Car industry and the Elo race

hgm · Post by **hgm** » Sun Apr 28, 2024 11:50 am

I noticed an interesting parallel between chess-engine development and the car industry. Let me start with a story that will teach us a moral lesson:

As there originated demand for cleaner cars, government agencies devised tests for measuring car emissions. Unfortunately these tests did not mimic the conditions of every-day use of a car very well. So it turned out to be possible to do very well in the tests, while being quite polluting in real-life usage. When it was discovered that car manifacturers like Volkswagen exploited this to the max, society did not look very kindly on these great technical achievements. It was considered cheating, the software doing this optimization was called a 'defeat device', legal claims were filed against the companies, and the involved board members were sued. If there had been a discord channel on which car engineers would have exchanged ideas for how to make their cars perform better in the tests, I am pretty sure the participants there would have been charged with 'conspiracy to defraud'.

This is what can happen if testing is not realistic, and people start to see passing the test as the goal, rather than delivering a good-quality product.

In computer chess there seems to develop a similar situation. Elo is tested in engine-engine games, but this is merely a testing method, and cannot be called an application. It is true that it can be fun to play engine tourneys, but the amount of fun is hardly dependent on the strength of the participants, for which you nowadays don't have to look very far to push the level beyond human understanding. The real application for which Elo matters is analysis. Human Chess players that want the 'absolute truth' about positions they encounter in games, or of opening lines they are preparing.

It would therefore be a bad thing if the testing conditions would be dissimilar to those used in analysis. But unfortunately it seems they are. Rating testers run their test with generous hash allocation. Which is easy, because even at LTC the time per move is sub-minute. But in analysis the time can be measured in hours. To keep the same hash-size/search-time ratio as in the tests you quickly get to insane amounts of memory, which would be never useful for any other application, and would have to be purchased for no other reason than doing the analysis. Understandably most users won't be willing to do that; for those it is only important how the engine performs on the hardware they can afford.

Of course you could argue that this is just the user's fault. Just like it was the driver's fault that they were not limiting their car trips to roads where they could drive at constant speed, without braking and accelerating, on days without wind, just as the emission tests on the roller bench did. But that argument did not fly very well in court. An engine that doesn't perform as its advertized Elo under conditions where people would want to use it, stinks as much as a car that passed the emission test by means of a 'defeat device'.

So what I want to advocate is that testing should really be done under conditions where the node count in the search tree of a typical move is a factor 10 to 100 larger than what would fit in the hash table. (Assuming, say, each entry would take 16 bytes, as we don't want to punish designs that make more efficient use of memory.) Otherwise we will run the risk that engines will only be designed to pass meaningless tests, wrecking them for use in analysis if that is needed to crank up the Elo under artificial conditions.

chesskobra · Post by **chesskobra** » Sun Apr 28, 2024 12:41 pm

With a disclaimer that I am not an expert, I agree. From whatever little testing I do, I think that rating lists are meaningless. For example, many top rated engines seem to miss elementary concepts in endgames, for example, not being able to mate for 120 moves when one side has Q, B, 3P against 1P, playing on when there is no possibility of a win for either side, and so on. A 1600 rated player would notice the weirdness in the endgame play. I could be wrong, but I think the only thing many engines are good at is middlegame calculation. In the middgegame, for me, it is hard to notice the weirdness (maybe GMs can notice), but in endgame it is easy to see. I would be interested in a rating list for endgame play (not endgame problem solving).

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Apr 28, 2024 3:10 pm

hgm wrote: ↑Sun Apr 28, 2024 11:50 am ...

What a load of nonsense. Maybe you'd have the beginnings of an argument if you actually had empirical evidence supporting your claims. Time and time again, the current start of the art testing methodology has been empirically demonstrated to provided results which extrapolate well to a variety of testing conditions. If you're going to claim there's a problem, at the very least present evidence supporting your claim.

I don't care about silly analogies, but even your car manufacturer analogy doesn't make sense: here, the car manufacturers have designed the tests themselves to improve their cars.

towforce · Post by **towforce** » Sun Apr 28, 2024 10:44 pm

hgm wrote: ↑Sun Apr 28, 2024 11:50 am...

"You get what you measure"

smatovic · Post by **smatovic** » Sun Apr 28, 2024 11:00 pm

One of my math teachers used to say: "Wer viel misst, misst Mist.", I guess he was into quantum physics

--
Srdja

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Apr 29, 2024 7:39 am

towforce wrote: ↑Sun Apr 28, 2024 10:44 pm
hgm wrote: ↑Sun Apr 28, 2024 11:50 am...

"You get what you measure"

We measure Elo.

chrisw · Post by **chrisw** » Mon Apr 29, 2024 2:36 pm

connor_mcmonigle wrote: ↑Mon Apr 29, 2024 7:39 am
towforce wrote: ↑Sun Apr 28, 2024 10:44 pm
hgm wrote: ↑Sun Apr 28, 2024 11:50 am...

"You get what you measure"

We measure Elo.

The point hgm was trying to make was that the USE case for the engine is very deep long time control analysis of individual positions (to gain the elusive truth of the position). Of necessity this runs into heavy overload of the available hash.
Bullet testing for Elo is done in an environment where the hash is not overloaded.
Hash overload has a significant effect. Hence his claim is that you’re testing in an environment very different from the final usage environment, and the car analogy was therefore good.

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Apr 29, 2024 3:30 pm

chrisw wrote: ↑Mon Apr 29, 2024 2:36 pm
connor_mcmonigle wrote: ↑Mon Apr 29, 2024 7:39 am
towforce wrote: ↑Sun Apr 28, 2024 10:44 pm
hgm wrote: ↑Sun Apr 28, 2024 11:50 am...

"You get what you measure"

We measure Elo.
The point hgm was trying to make was that the USE case for the engine is very deep long time control analysis of individual positions (to gain the elusive truth of the position). Of necessity this runs into heavy overload of the available hash.
Bullet testing for Elo is done in an environment where the hash is not overloaded.
Hash overload has a significant effect. Hence his claim is that you’re testing in an environment very different from the final usage environment, and the car analogy was therefore good.

Fundamentally, you have to pick some hash pressure target to optimize for. The hash pressure target chosen by most engine devs actually lines up reasonably with the configuration used at TCEC in terms of Hash and TC.

There's always going to be some misalignment between the very short time controls we use for testing and longer time controls. But, in lieu of a practical alternative (i.e, testing at longer time controls is not feasible), I contend that it remains the best possible methodology.

Viz · Post by **Viz** » Mon Apr 29, 2024 4:30 pm

Also we have really good progress in knowledge in area which parameters are TC sensitive and thus can be optimized for longer searches.
Namely almost every parameter in singular extensions is extremely sensitive to average game length (depth of search and w/e measurement you like).

chrisw · Post by **chrisw** » Mon Apr 29, 2024 6:42 pm

Viz wrote: ↑Mon Apr 29, 2024 4:30 pm Also we have really good progress in knowledge in area which parameters are TC sensitive and thus can be optimized for longer searches.
Namely almost every parameter in singular extensions is extremely sensitive to average game length (depth of search and w/e measurement you like).

SE is a good example (looks like loadsa Elo possible) of asking a developer a) why this works and b) what would be the basis in science of particular tweak adjustments to experiment on. Negative extensions was a neat idea and more extension for larger singularity likewise. But, we’ll reach the point where logical base for other ideas begins to shrink and you’re down to throwing guesses at it, possibly for diminishing returns. I guess, one day, with a gazillion resources, the process can be automated with a universal adjustable black box function, ie no longer do you gave to guess at an algorithm and then use learning on the parameters, but also meta-guess at a range of algorithmic possibilities and apply learning on those.

Car industry and the Elo race

Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race

Re: Car industry and the Elo race