Tony Thomas wrote:I am almost done testing Romi and the results so far are inconclusive. Testing against Hamsters seem to have Romi and she is also showing random behavior. For example, in her game against Danasah, after 15 matches she was leading 10-5, however in the next fifteen matches she only managed to get 5.5 points.

The current version seems to be the same strength as the previous 3 or 4 versions, so I am wishing Mike the best of luck with the new rewrite. Note that this is the first time Romichess won the match against Francesca, however this is the first time Romi lost to Arasan in last five matches, so I guess one bad result cancel the other one out.
Code: Select all
10 RomiChessNG2 : 2540 300 (+125,= 55,-120), 50.8 %
Danasah 2.85 : 30 (+ 8,= 15,- 7), 51.7 %
Delphil 1.6c : 30 (+ 11,= 11,- 8), 55.0 %
Francesca MAD 0.13 : 30 (+ 13,= 5,- 12), 51.7 %
Lime 6.3 : 30 (+ 27,= 3,- 0), 95.0 %
Zeus 1.28 : 30 (+ 21,= 2,- 7), 73.3 %
Zappa 1.1 : 30 (+ 7,= 3,- 20), 28.3 %
GreKo 5.2 : 30 (+ 19,= 4,- 7), 70.0 %
Arasan 9.5 : 30 (+ 9,= 4,- 17), 36.7 %
Naum 2.0 : 30 (+ 8,= 3,- 19), 31.7 %
ChessTiger2007.1 UCI : 30 (+ 2,= 5,- 23), 15.0 %
Hi Tony,
It is very frustrating to have done way better than before against the strongest two engines (new records) and also against the least strong engine (also a new record) not to mention the first ever win against FranMad, only to flop in the rest.
But why do you have NG2's rating at 2540 instead of 2440? Must be a typo!
Bob, author of Crafty, in the programmers forum claims that the number of games needed from a set of 40 fixed positions at long time controls to arrive at an accurate rating is 2,560 games versus each opponent. Since, 1+1 games suffer from 'time jitters' randomness far more than long time control games, the number of games needed would be many times more at 1+1. Add the randomness of opening books and the number of games needed is enormous. I do not think that the jump from about 8% vs Naum to 31.7% can be accounted for by randomness alone. So I am hopeful that this version is really stronger.
In my shoddy testing of 100 game matches, if Romi gains a new record against at least two engines, even if it is only half a point, I consider it an improved version. So there are ups and downs in strength, howevr, the general trend is up! So for NG2, that is seven new records! I had new records versus Hamsters2, Olithink and TSCP (finnally 100%, previous best 99%, just for fun).
Also the new 'Ninja Girl' code seems to work better at longer time controls, as it uses information from the search to modify the eval. Therefore the longer the search, the more reliable the data from the search.
Ninja Girl is still very young and much tuning is still needed to get it working really well.
Thanks for the testing. It is very helpful!
Mike