Ajedrecista wrote:Hello George:
With the data you provide, I get the following error bars and the minimum score required for assuring that IH is better than SF, both for 2-sigma confidence (~ 95.45% confidence):
Code: Select all
1 Ivanhoe B46a x64 +49 +14/-7/=29 57.00% 28.5/50
2 Stockfish 2.2.2 JA SSE42 -49 +7/-14/=29 43.00% 21.5/50
-----------------------------------------------------------------------
Confidence interval for 2-sigma:
Elo rating difference: 48.962560037161944 Elo
Lower rating difference: -13.548115630226197 Elo
Upper rating difference: 114.82934892620664 Elo
Lower bound uncertainty: -62.510675667388142 Elo
Upper bound uncertainty: 65.866788889044697 Elo
Average error: +- 64.188732278216419 Elo
K = (average error)*[sqrt(n)] = 453.88287869694658
Elo interval: ] -13.548115630226197 , 114.82934892620664 [
-----------------------------------------------------------------------
Minimum score for no regresion: 58.819171036881969 %
Code: Select all
1 Ivanhoe 46h x64 +78 +17/-6/=27 61.00% 30.5/50
2 Stockfish 2.2.2 JA SSE42 -78 +6/-17/=27 39.00% 19.5/50
-----------------------------------------------------------------------
Confidence interval for 2-sigma:
Elo rating difference: 77.706091193707124 Elo
Lower rating difference: 13.396545663492863 Elo
Upper rating difference: 147.79531731086832 Elo
Lower bound uncertainty: -64.309545530214261 Elo
Upper bound uncertainty: 70.089226117161201 Elo
Average error: +- 67.199385823687731 Elo
K = (average error)*[sqrt(n)] = 475.17141407500744
Elo interval: ] 13.396545663492863 , 147.79531731086832 [
-----------------------------------------------------------------------
Minimum score for no regresion: 59.229582069908972 %
Code: Select all
1 Ivanhoe B46fC x64 +85 +18/-6/=26 62.00% 31.0/50
2 Stockfish 2.2.2 JA SSE42 -85 +6/-18/=26 38.00% 19.0/50
-----------------------------------------------------------------------
Confidence interval for 2-sigma:
Elo rating difference: 85.043237152577479 Elo
Lower rating difference: 19.537412490302369 Elo
Upper rating difference: 157.16130696200742 Elo
Lower bound uncertainty: -65.505824662275111 Elo
Upper bound uncertainty: 72.118069809429936 Elo
Average error: +- 68.811947235852524 Elo
K = (average error)*[sqrt(n)] = 486.57394517122224
Elo interval: ] 19.537412490302369 , 157.16130696200742 [
-----------------------------------------------------------------------
Minimum score for no regresion: 59.428090415820634 %
Code: Select all
1 Ivanhoe B46fE.02 x64 +35 +13/-8/=29 55.00% 27.5/50
2 Stockfish 2.2.2 JA SSE42 -35 +8/-13/=29 45.00% 22.5/50
-----------------------------------------------------------------------
Confidence interval for 2-sigma:
Elo rating difference: 34.860070287560063 Elo
Lower rating difference: -28.241740262805108 Elo
Upper rating difference: 100.36872486273647 Elo
Lower bound uncertainty: -63.101810550365171 Elo
Upper bound uncertainty: 65.508654575176408 Elo
Average error: +- 64.305232562770790 Elo
K = (average error)*[sqrt(n)] = 454.70666010913216
Elo interval: ] -28.241740262805108 , 100.36872486273647 [
-----------------------------------------------------------------------
Minimum score for no regresion: 58.819171036881969 %
2-sigma confidence is more less 95.45% confidence: i.e 21 times correct out of 22. I use a mathematic model that is not bright with few number of games (like all those matches), so please take with care the data I post.
I would say that IH and SF are now roughly equal talking in Elo terms (please take a look on 200-game test by Pal Larkin, although 200 are not enough games for my taste). I perfectly understand that running an amount of games is much time consuming, so personally I appreciate it. But I also agree with Julien that luck is a very important factor in tests with a low number of games.
So, I say that few conclusions can be extracted from 50-game matches... although I enjoy viewing the results of these mini-matches! Please continue those tests.
Regards from Spain.
Ajedrecista.
Look, why are you telling me this. I already know the + and - error stats for 200 games, 400 games, right up to 2000. Your info is very interesting. What does it have to do with these matches.
And what in hell I wonder did Julien put in his reply for. I been testing for close to 10 years and I want to ask you and Julien a question- no use posting 2 threads. "Do you think I am either an idiot or brain damaged?Exactly where did I say these versions are better than Stockfish?"
All I said was they humbled an extremely strong engine. What in that statement is not true?! And if they did, it stands to reason the next 4 will likely do the same- as they are fairly equal. As far as I can tell, Julien is the only one who said Ivanhoe was stronger. I didn't. Which Ivanhoes is he speaking of. Just these, or all Ivanhoe versions. In 32 bit, Stockfish was stronger than some. And the ones who beat him never did by over 3 to 5 games.
Look, when you refer to my "mini-matches"- that is like saying have fun but they don't mean anything. If you want to think that- don't let me stand in your way.
I see the problem. If I had posted that Stockfish humbled 4 Ivanhoe engines, no one would have even replied. And how do you know how many 50 game matches I have run.
I am going to say this
ONE MORE TIME AND THAT'S IT! I ran 12 Ivanhoes ag. Komodo. And I may or may not run 12 ag. Stockfish. I am trying to get a general idea of the order the top 12 or 15 Ivanhoe engines are in. And I can't do that running them ag. just one common opponent. And 2 is not really enough. If you had bothered to keep up- you would have seen in the beginning I stated what I was doing. I can show you the statement where I said wins or losses ag. Komodo were irrelevant. That was not what I was after. And I only had to repeat it 3 or 4 times.
But if you don't approve of- or see any value in what I am doing- or if it bothers you and Julien- I can get the hell out of this goddam place.
One last time- how do you know how many matches I have run? Are you sure these are the 1st ones ag. Stockfish?
But thank you for pointing it out to me that my "mini-matches"- tho maybe fun- actually mean nothing. I never would have noticed if you had not pointed it out.
Last time- where did I say they were stronger than Stockfish. I said Stockfish was humbled. You think not? You go back and read my thread again- then you come tell me you find one sentence- one lousy damn sentence- that you don't agree with.
Shit, I gotta get some sleep. I been up 48 damn hours. But you need to let me know if it is ok with you for me to post the results of the next 2 or maybe 4 matches. I need a lot of things, but someone assuming they can read my mind isn't on the list.
Am I mad? You betcha.