Ethereum Bug Knocks Out Nodes, Now Fixed – Trustnodes

We experienced a production incident where all 4 sets of CL+EL deployed in our system were rebooted simultaneously, an ethereum staker says, adding:

This kind of phenomenon was rarely seen before. After investigation, it was found that these BeaconNode Prysm suddenly doubled their memory usage at 2023-05-12 20:12:00 UTC 0, exceeding our memory limit and triggering OOM.

I have checked the relevant logs, and it seems that the sudden printing of re-subscribe to topic logs during normal operation may be the cause of this memory surge. Not only memory, but CPU usage also doubled.

The problem seems to have lasted for about an hour, starting at 9PM UTC on Thursday, during which participation dropped to as low as 40%.

A few hours prior, crypto prices started to fall, but the problem was addressed quickly with the network now back to running as normal.

Its unclear currently what exactly happened. Nishant Das, an ethereum 2.0 developer, said the team has just been debugging on this for the past day. We will post a more detailed summary on the incident by today.

In an overview of the incident, a Prysm spokesperson said:

Prysm nodes received many attestations for previous epochs where the block did not reflect the latest checkpoint in fork choice.

The peered nodes likely sent the attestation didnt have all the blocks for the rest of the epoch. Because of this, Prysm spent a lot of resouces replay state & eventually fell into the death spiral (CPU spikes / OOM).

Prysm nodes also have discovered a subtle bug where it didnt use the correct state to compute shuffling during death spiral time.

Coming out of this, we have a few optimizations to caching scheme, also using heuristics to filter unviable attestations. You should expect a release from us early next week.

This is the first time as far as we are aware that participation in the live ethereum network has dropped so low.

Some suggest the problem was only with the Prysm staking client, but there are reports other clients were affected too.

Yet the network is now running as normal because it was sufficiently resilient to heal by itself, so the problem has been largely addressed with more optimizations coming, but apparently no protocol level changes are needed.

See the original post here:

Ethereum Bug Knocks Out Nodes, Now Fixed - Trustnodes

Related Posts

Comments are closed.