Stability of an engine eval from depth to depth is achieved by requiring a new best move to be searched to the same depth as the previous best move. In ChessDB, you'll get the minimum leaf eval of 22, but otherwise a tree with a handful of nodes can be backed up to the root and through minimaxing take preference over a tree with a million nodes. Automatic exploration will later pick it up and expand the small tree and either refute a key move (and go back to the big tree) or confirm the findings and keep it.
This has some advantages, because the moves may be explored to very unequal depths that don't reflect their quality, and the search tree shape is less regular than in an engine search. If you could just explore the line 100 plies deep and not have the root score change to another line until it is also explored 100 plies deep, it would be an issue.
But it comes with the drawback that the root score is easier to change with a shallow line.
In practice, black must walk a fine line to keep the (likely) winning advantage, while white has a lot of drawing attempts available. The difficulty to move the score is directly related to the number of lines that must get scored differently, and this number of lines is very roughly (number of enemy moves to refute)^(depth to refute).
If you were to attempt to change CDB's 1. g4 score, you'll manage a much bigger change by showing it a way to draw what it thought was a key attacking line than by showing it a way to bust a drawing attempt. On the other hand, once you've pushed the score down, your drawing line will be the only one with such a low score to refute through a new attacking move to get back to the previous evaluation.
So the challenge for someone that wants to prove a draw is to push the score down and keep it down.
I'm waiting on mmt's big hardware analysis of the 2. c4 Bxg4 line.
