We have discovered a flaw in our finalization algorithm where the finalized block notarized in the metachain was different than the block seen as final in a shard. Based on our findings from shard 3 I will try to describe shortly what happened and how shard 3 got stuck.
The current view of a validator in shard 3 at 17:50 UTC looks like this:
In shard 3 the latest seen header from the metachain has nonce 18685, which contained the highest finalized block nonce in shard 3 with nonce 17175.
Looking in the blockchain explorer on shard 3 page 3 we see the block 17175 at the bottom which was the latest notarized final header for shard 3. The shard worked further with no problems until block 17179, which was sent to the metachain and made block 17178 final in the view of the metachain for shard 3. Starting with block 17179 no blocks were created in Shard 3 for 15 minutes. Because no metachain headers had been added to any block in shard 3, as no blocks were created, shard 3 considered block 17175 as final and tried rolling back to it.
During the latest stall in our network, where the metachain stood still, we have built in a mechanism to fix the stalls, so that all nodes in the shard revert to a lower nonce (nonce-1) and try to create a new block with the lower nonce. If they again do not succeed, the nonce is being decreased, until the latest finalized header from the metachain, 17175.
During the 15 minutes where no blocks had been accepted in shard 3, the revert from block 17179 to 17175 happened. Then shard 3 started producing new blocks starting again with block number 17175 and created a fork. This fork went on until block 17230 in shard 3. During this fork duplicates of blocks 17176, 17177, 17178 were created and the new 17179 was created in the shard, but could not be notarized in the metachain because it built on top of another block 17178. When we do the rollbacks of the blocks, the blocks are taken out from the persistent storage into our datapools, and deleted from the storage. This is what caused the current issue to be unrecoverable, because, e.g. block 17178 was removed from the shard 3 datapool after some rounds.
Afterwards, shard 3 tried to add metachain headers in their blocks, but couldn’t advance with the latest finalized block nonce above 17178, because the metachain would not accept the blocks because they were built on top of a fork. We got to a deadlock, where shard 3 could not advance with its finalized blocks.
We are working on a fix, because now we know what went wrong. We’ll keep you posted with updates.