Not at all - I think you've raised some very interesting points! MCTS averaging does seem fundamentally mismatched to Chess. That's why I was so amazed A0 actually worked.
It seems it's not that good in go either as shown by recent Leela games it blunders tactics in go on regular basis. Go is a bit different as it's possible to kill humans there without much tactical awareness but winning against humans in a board game isn't exactly a high bar these days. In engine vs engine matches it's clear that MCTS in pure form isn't working.
There is some kind of component missing, for example now you can get to 100k playouts, discover that the line is a total disaster (losing by force) and it will take a longer while for it to prefer a different move.
I personally believe policy guided search and policy being trained on many games will work but the move selection itself will be more in line with alpha/beta in the end. Overall I am excited (and wish I had time away from programming engines for card games to participate). If anything Leela plays like a naive optimistic human 2200-2300ELO and that's really cool to have
Hi Piotr,
Slightly off-topic, but I'm curious, what could you recommend as a good program/AI for hold'em (or poker in general)?
And yes, the self-play match games between networks on the main site is terrible and misleading.
I wonder why that is. Now that the matches are no longer used for gating and there is much more opening variety, the graph should in principle be correct on average.
So it seems that elo is not additive in this case.
One possible explanation might be that buggy engines do not satisfy the elo model. This was an observation by HGM in a slightly different context. Of course it is bit unclear how to define a buggy engine...
Weren't they during the bug (underpromoting) self-tested against another underpromoting almost identical engine? Even I would have seen a progress, with a book and fixed time. Then they AFAIK only slowly re-trained the net (shouldn't they just start all again from ID124 or even from "smallnet" ID122?). And then tested non-buggy engine against non-buggy engine, which slowly started to increase Queen promotion, thus progress again assured (on average)? They could or should have one drop, but due to slow changes , it is barely visible, they have many lower and higher all the way.
Probably one could imagine certain bugs (say rate of time losses) and a gedanken experiment, showing that three engines don't satisfy the additivity underl Elo model.
CMCanavessi wrote:My tests arrive to very similar numbers to yours, Kai (though I've tested 150 as the strongest, and have not tested any newer network... maybe 156 will be the next one). And yes, the self-play match games between networks on the main site is terrible and misleading.
My gauntlet numbers:
The calculated Elo:
It seems there is large jump outside error margins with ID156, maybe you can tell the devs. It is now outside error margins the best net.
I'm running the gauntlet right now with 156, and it's proving to be the best network so far, but not by much (early estimates of around 40 elo). Only 30% of the games played, so it has some way to go.
Is Leela really regaining the lost knowledge since the fix of the promotion bug?
I suspect that the newer training just skips those positions leading to the promotion instead of retraining them because the buggy network already tells the search that it's loosing.
As a result, the succeeding networks are seemingly getting better and better over the previous buggy one because they don't have those positions to deal with each other.
But when faced against a network before the bug like ID125 it's not much better.
CMCanavessi wrote:I'm running the gauntlet right now with 156, and it's proving to be the best network so far, but not by much (early estimates of around 40 elo). Only 30% of the games played, so it has some way to go.
Our error margins are large with only 200 games, but I can confirm that ID159 comes close to that high result of ID156, so it was not a 2.5 standard deviations fluke. By now, we both can confirm that the new nets are the best ones, and will probably get better an better.
OTOH, I couldn't see a significant jump on both opening positional and middlegame tactical suites, just a small improvement over say ID147. I don't know why, maybe some other aspects of the gameplay improved, say endgames.
CMCanavessi wrote:I'm running the gauntlet right now with 156, and it's proving to be the best network so far, but not by much (early estimates of around 40 elo). Only 30% of the games played, so it has some way to go.
Our error margins are large with only 200 games, but I can confirm that ID159 comes close to that high result of ID156, so it was not a 2.5 standard deviations fluke. By now, we both can confirm that the new nets are the best ones, and will probably get better an better.
OTOH, I couldn't see a significant jump on both opening positional and middlegame tactical suites, just a small improvement over say ID147. I don't know why, maybe some other aspects of the gameplay improved, say endgames.
Do you have a list of the results of different versions of LCZero in tactics?
CMCanavessi wrote:I'm running the gauntlet right now with 156, and it's proving to be the best network so far, but not by much (early estimates of around 40 elo). Only 30% of the games played, so it has some way to go.
Our error margins are large with only 200 games, but I can confirm that ID159 comes close to that high result of ID156, so it was not a 2.5 standard deviations fluke. By now, we both can confirm that the new nets are the best ones, and will probably get better an better.
OTOH, I couldn't see a significant jump on both opening positional and middlegame tactical suites, just a small improvement over say ID147. I don't know why, maybe some other aspects of the gameplay improved, say endgames.
Do you have a list of the results of different versions of LCZero in tactics?
Yes, some sort of list. For ECM200.epd middlegame tactical suite (200 positions), analyzed for 20s/position. At this time control and my hardware, LC0 performs overall (Elo-wise) comparably to GreKo 6.5 2330 Elo CCRL standard A/B engine, which fares much better tactically (but much worse positionally). And it seems on this tactical middlegame suite ID124 is still the best of the nets.
gladius wrote:But the entire process is designed to have it solve tactics. The policies are trained to match the output of an 800 node search, so it's being trained to take the tactics into account. Even modern chess evaluation features do this (with eg. huge penalties for queen under threat, and restricting queen mobility to "safe" squares).
Don't you think that the network can learn to predict tactics?
What I don't quite get is what the move probabilities are supposed to stand for.
If the move probabilities are supposed to single out "good" moves, then a move that simply looks bad but happens to have a deep (or even shallow) tactic behind it would score bad and would not guide the search to discover the tactic.
If the move probabilties are supposed to single out "unclear" moves, then things could work. But I don't really see how the whole updating process would work towards identifying "unclear" moves.