Extremely Strong Engine Humbled Beyond Belief!

geots · Post by **geots** » Tue May 08, 2012 4:43 pm

Stockfish 2.2.2 SSE42 x64 vs. 4 Strong 64bit Ivanhoe Versions

I think it is easier to grasp and keep up if I put all 4 results in 1 thread. This is no patzer these Ivanhoes are marching thru. Problem for Stockfish is the next 2 versions may do as well as these. They should. And the 2 versions that follow after that in theory should do much more damage than any so far- assuming they played to their full potential- which you never know.

Intel i5 w/4TCs
Fritz 11 gui
1CPU/64bit
128MB hash
Bases=NONE
Ponder_Learning=OFF
Perfect 12.32 book w/12-move limit
40/3 Repeating
Match=50 games

Stockfish 2.2.2 SSE42 x64:

Code: Select all

1   Ivanhoe B46a x64           +49    +14/-7/=29   57.00%   28.5/50
2   Stockfish 2.2.2 JA SSE42   -49    +7/-14/=29   43.00%   21.5/50

Code: Select all

1   Ivanhoe 46h x64            +78    +17/-6/=27   61.00%   30.5/50
2   Stockfish 2.2.2 JA SSE42   -78    +6/-17/=27   39.00%   19.5/50

Code: Select all

1   Ivanhoe B46fC x64          +85    +18/-6/=26   62.00%   31.0/50
2   Stockfish 2.2.2 JA SSE42   -85    +6/-18/=26   38.00%   19.0/50

Code: Select all

1   Ivanhoe B46fE.02 x64       +35    +13/-8/=29   55.00%   27.5/50
2   Stockfish 2.2.2 JA SSE42   -35    +8/-13/=29   45.00%   22.5/50

Once Again Tomorrow-

g
e
o
r
g
e

JuLieN · Post by **JuLieN** » Tue May 08, 2012 5:43 pm

Although there's definitely a pattern here (Ivanhoe is stronger than Stockfish), 50 games are VERY small samples, George. My own tests at bullet times show that a result after 1000 games can still sometimes get reversed in a 2000-games long bullet time match between two engines.

ernest · Post by **ernest** » Tue May 08, 2012 8:07 pm

geots wrote:Stockfish 2.2.2 SSE42 x64:

Why not play it (again?) against your famed B46e?

Ajedrecista · Post by **Ajedrecista** » Tue May 08, 2012 9:02 pm

Hello George:

With the data you provide, I get the following error bars and the minimum score required for assuring that IH is better than SF, both for 2-sigma confidence (~ 95.45% confidence):

Code: Select all

1   Ivanhoe B46a x64           +49    +14/-7/=29   57.00%   28.5/50
2   Stockfish 2.2.2 JA SSE42   -49    +7/-14/=29   43.00%   21.5/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      48.962560037161944    Elo

Lower rating difference:     -13.548115630226197    Elo
Upper rating difference:      114.82934892620664    Elo

Lower bound uncertainty:     -62.510675667388142    Elo
Upper bound uncertainty:      65.866788889044697    Elo
Average error: +-     64.188732278216419    Elo

K = (average error)*[sqrt(n)] =      453.88287869694658

Elo interval: ]    -13.548115630226197    ,     114.82934892620664    [
-----------------------------------------------------------------------

Minimum score for no regresion:     58.819171036881969    %

Code: Select all

1   Ivanhoe 46h x64            +78    +17/-6/=27   61.00%   30.5/50
2   Stockfish 2.2.2 JA SSE42   -78    +6/-17/=27   39.00%   19.5/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      77.706091193707124    Elo

Lower rating difference:      13.396545663492863    Elo
Upper rating difference:      147.79531731086832    Elo

Lower bound uncertainty:     -64.309545530214261    Elo
Upper bound uncertainty:      70.089226117161201    Elo
Average error: +-     67.199385823687731    Elo

K = (average error)*[sqrt(n)] =      475.17141407500744

Elo interval: ]     13.396545663492863    ,     147.79531731086832    [
-----------------------------------------------------------------------

Minimum score for no regresion:     59.229582069908972    %

Code: Select all

1   Ivanhoe B46fC x64          +85    +18/-6/=26   62.00%   31.0/50
2   Stockfish 2.2.2 JA SSE42   -85    +6/-18/=26   38.00%   19.0/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      85.043237152577479    Elo

Lower rating difference:      19.537412490302369    Elo
Upper rating difference:      157.16130696200742    Elo

Lower bound uncertainty:     -65.505824662275111    Elo
Upper bound uncertainty:      72.118069809429936    Elo
Average error: +-     68.811947235852524    Elo

K = (average error)*[sqrt(n)] =      486.57394517122224

Elo interval: ]     19.537412490302369    ,     157.16130696200742    [
-----------------------------------------------------------------------

Minimum score for no regresion:     59.428090415820634    %

Code: Select all

1   Ivanhoe B46fE.02 x64       +35    +13/-8/=29   55.00%   27.5/50
2   Stockfish 2.2.2 JA SSE42   -35    +8/-13/=29   45.00%   22.5/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      34.860070287560063    Elo

Lower rating difference:     -28.241740262805108    Elo
Upper rating difference:      100.36872486273647    Elo

Lower bound uncertainty:     -63.101810550365171    Elo
Upper bound uncertainty:      65.508654575176408    Elo
Average error: +-     64.305232562770790    Elo

K = (average error)*[sqrt(n)] =      454.70666010913216

Elo interval: ]    -28.241740262805108    ,     100.36872486273647    [
-----------------------------------------------------------------------

Minimum score for no regresion:     58.819171036881969    %

2-sigma confidence is more less 95.45% confidence: i.e 21 times correct out of 22. I use a mathematic model that is not bright with few number of games (like all those matches), so please take with care the data I post.

I would say that IH and SF are now roughly equal talking in Elo terms (please take a look on 200-game test by Pal Larkin, although 200 are not enough games for my taste). I perfectly understand that running an amount of games is much time consuming, so personally I appreciate it. But I also agree with Julien that luck is a very important factor in tests with a low number of games.

So, I say that few conclusions can be extracted from 50-game matches... although I enjoy viewing the results of these mini-matches! Please continue those tests.

Regards from Spain.

Ajedrecista.

geots · Post by **geots** » Tue May 08, 2012 11:22 pm

Ajedrecista wrote:Hello George:

With the data you provide, I get the following error bars and the minimum score required for assuring that IH is better than SF, both for 2-sigma confidence (~ 95.45% confidence):

Code: Select all

1   Ivanhoe B46a x64           +49    +14/-7/=29   57.00%   28.5/50
2   Stockfish 2.2.2 JA SSE42   -49    +7/-14/=29   43.00%   21.5/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      48.962560037161944    Elo

Lower rating difference:     -13.548115630226197    Elo
Upper rating difference:      114.82934892620664    Elo

Lower bound uncertainty:     -62.510675667388142    Elo
Upper bound uncertainty:      65.866788889044697    Elo
Average error: +-     64.188732278216419    Elo

K = (average error)*[sqrt(n)] =      453.88287869694658

Elo interval: ]    -13.548115630226197    ,     114.82934892620664    [
-----------------------------------------------------------------------

Minimum score for no regresion:     58.819171036881969    %

Code: Select all

1   Ivanhoe 46h x64            +78    +17/-6/=27   61.00%   30.5/50
2   Stockfish 2.2.2 JA SSE42   -78    +6/-17/=27   39.00%   19.5/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      77.706091193707124    Elo

Lower rating difference:      13.396545663492863    Elo
Upper rating difference:      147.79531731086832    Elo

Lower bound uncertainty:     -64.309545530214261    Elo
Upper bound uncertainty:      70.089226117161201    Elo
Average error: +-     67.199385823687731    Elo

K = (average error)*[sqrt(n)] =      475.17141407500744

Elo interval: ]     13.396545663492863    ,     147.79531731086832    [
-----------------------------------------------------------------------

Minimum score for no regresion:     59.229582069908972    %

Code: Select all

1   Ivanhoe B46fC x64          +85    +18/-6/=26   62.00%   31.0/50
2   Stockfish 2.2.2 JA SSE42   -85    +6/-18/=26   38.00%   19.0/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      85.043237152577479    Elo

Lower rating difference:      19.537412490302369    Elo
Upper rating difference:      157.16130696200742    Elo

Lower bound uncertainty:     -65.505824662275111    Elo
Upper bound uncertainty:      72.118069809429936    Elo
Average error: +-     68.811947235852524    Elo

K = (average error)*[sqrt(n)] =      486.57394517122224

Elo interval: ]     19.537412490302369    ,     157.16130696200742    [
-----------------------------------------------------------------------

Minimum score for no regresion:     59.428090415820634    %

Code: Select all

1   Ivanhoe B46fE.02 x64       +35    +13/-8/=29   55.00%   27.5/50
2   Stockfish 2.2.2 JA SSE42   -35    +8/-13/=29   45.00%   22.5/50
-----------------------------------------------------------------------

Confidence interval for            2-sigma:

Elo rating difference:      34.860070287560063    Elo

Lower rating difference:     -28.241740262805108    Elo
Upper rating difference:      100.36872486273647    Elo

Lower bound uncertainty:     -63.101810550365171    Elo
Upper bound uncertainty:      65.508654575176408    Elo
Average error: +-     64.305232562770790    Elo

K = (average error)*[sqrt(n)] =      454.70666010913216

Elo interval: ]    -28.241740262805108    ,     100.36872486273647    [
-----------------------------------------------------------------------

Minimum score for no regresion:     58.819171036881969    %

2-sigma confidence is more less 95.45% confidence: i.e 21 times correct out of 22. I use a mathematic model that is not bright with few number of games (like all those matches), so please take with care the data I post.

I would say that IH and SF are now roughly equal talking in Elo terms (please take a look on 200-game test by Pal Larkin, although 200 are not enough games for my taste). I perfectly understand that running an amount of games is much time consuming, so personally I appreciate it. But I also agree with Julien that luck is a very important factor in tests with a low number of games.

So, I say that few conclusions can be extracted from 50-game matches... although I enjoy viewing the results of these mini-matches! Please continue those tests.

Regards from Spain.

Ajedrecista.

Look, why are you telling me this. I already know the + and - error stats for 200 games, 400 games, right up to 2000. Your info is very interesting. What does it have to do with these matches.

And what in hell I wonder did Julien put in his reply for. I been testing for close to 10 years and I want to ask you and Julien a question- no use posting 2 threads. "Do you think I am either an idiot or brain damaged?Exactly where did I say these versions are better than Stockfish?" All I said was they humbled an extremely strong engine. What in that statement is not true?! And if they did, it stands to reason the next 4 will likely do the same- as they are fairly equal. As far as I can tell, Julien is the only one who said Ivanhoe was stronger. I didn't. Which Ivanhoes is he speaking of. Just these, or all Ivanhoe versions. In 32 bit, Stockfish was stronger than some. And the ones who beat him never did by over 3 to 5 games.

Look, when you refer to my "mini-matches"- that is like saying have fun but they don't mean anything. If you want to think that- don't let me stand in your way.

I see the problem. If I had posted that Stockfish humbled 4 Ivanhoe engines, no one would have even replied. And how do you know how many 50 game matches I have run.

I am going to say this ONE MORE TIME AND THAT'S IT! I ran 12 Ivanhoes ag. Komodo. And I may or may not run 12 ag. Stockfish. I am trying to get a general idea of the order the top 12 or 15 Ivanhoe engines are in. And I can't do that running them ag. just one common opponent. And 2 is not really enough. If you had bothered to keep up- you would have seen in the beginning I stated what I was doing. I can show you the statement where I said wins or losses ag. Komodo were irrelevant. That was not what I was after. And I only had to repeat it 3 or 4 times.

But if you don't approve of- or see any value in what I am doing- or if it bothers you and Julien- I can get the hell out of this goddam place.

One last time- how do you know how many matches I have run? Are you sure these are the 1st ones ag. Stockfish?

But thank you for pointing it out to me that my "mini-matches"- tho maybe fun- actually mean nothing. I never would have noticed if you had not pointed it out.

Last time- where did I say they were stronger than Stockfish. I said Stockfish was humbled. You think not? You go back and read my thread again- then you come tell me you find one sentence- one lousy damn sentence- that you don't agree with.

Shit, I gotta get some sleep. I been up 48 damn hours. But you need to let me know if it is ok with you for me to post the results of the next 2 or maybe 4 matches. I need a lot of things, but someone assuming they can read my mind isn't on the list.

Am I mad? You betcha.

Kyodai · Post by **Kyodai** » Wed May 09, 2012 6:52 am

George - just keep the results coming! They are very interesting - as is
the engine (s) Ivanhoe!

Ajedrecista · Post by **Ajedrecista** » Wed May 09, 2012 9:27 am

Hello George:

Do you think I am either an idiot or brain damaged?

Of course not. But I think you are very susceptible, just in view of lots of your posts. My intention is not being offensive against you but I know I have managed it, disgracefully.

Look, when you refer to my "mini-matches"- that is like saying have fun but they don't mean anything. If you want to think that- don't let me stand in your way.

Yes, I enjoy them and I also think that they mean almost nothing statistically. But this is better than nothing!

I see the problem. If I had posted that Stockfish humbled 4 Ivanhoe engines, no one would have even replied. And how do you know how many 50 game matches I have run.

Completely wrong: I was refering to the number of games, no matter of the opponents. In fact, IH is an engine I like so much, the same as SF.

But if you don't approve of- or see any value in what I am doing- or if it bothers you and Julien- I can get the hell out of this goddam place.

I can not speak for Julien, but I think your tests are interesting in spite of the number of games and I encouraged you in my other post to keep the good work.

One last time- how do you know how many matches I have run? Are you sure these are the 1st ones ag. Stockfish?

I am aware that you have tested SF before; I am also aware that you have run lots of tests with lots of engines, and I realize that this is a very difficult task.

But thank you for pointing it out to me that my "mini-matches"- tho maybe fun- actually mean nothing. I never would have noticed if you had not pointed it out.

Very sarcastic.

Shit, I gotta get some sleep. I been up 48 damn hours. But you need to let me know if it is ok with you for me to post the results of the next 2 or maybe 4 matches. I need a lot of things, but someone assuming they can read my mind isn't on the list.

Am I mad? You betcha.

I agree, you need to sleep because 48 hours without sleep can not be a good thing. I like the idea that you post more tests.

Anyway, I also agree with the ones that think that computer chess community sucks nowadays and replying some people can be a complete waste of time. I registered in TalkChess not for making friends, but also for not making enemies... I managed the latter thing unintentionally.

Regards from Spain.

Ajedrecista.

geots · Post by **geots** » Wed May 09, 2012 4:11 pm

Ajedrecista wrote:Hello George:

Do you think I am either an idiot or brain damaged?
Of course not. But I think you are very susceptible, just in view of lots of your posts. My intention is not being offensive against you but I know I have managed it, disgracefully.

Look, when you refer to my "mini-matches"- that is like saying have fun but they don't mean anything. If you want to think that- don't let me stand in your way.
Yes, I enjoy them and I also think that they mean almost nothing statistically. But this is better than nothing!

I see the problem. If I had posted that Stockfish humbled 4 Ivanhoe engines, no one would have even replied. And how do you know how many 50 game matches I have run.
Completely wrong: I was refering to the number of games, no matter of the opponents. In fact, IH is an engine I like so much, the same as SF.

But if you don't approve of- or see any value in what I am doing- or if it bothers you and Julien- I can get the hell out of this goddam place.
I can not speak for Julien, but I think your tests are interesting in spite of the number of games and I encouraged you in my other post to keep the good work.

One last time- how do you know how many matches I have run? Are you sure these are the 1st ones ag. Stockfish?
I am aware that you have tested SF before; I am also aware that you have run lots of tests with lots of engines, and I realize that this is a very difficult task.

But thank you for pointing it out to me that my "mini-matches"- tho maybe fun- actually mean nothing. I never would have noticed if you had not pointed it out.
Very sarcastic.

Shit, I gotta get some sleep. I been up 48 damn hours. But you need to let me know if it is ok with you for me to post the results of the next 2 or maybe 4 matches. I need a lot of things, but someone assuming they can read my mind isn't on the list.

Am I mad? You betcha.
I agree, you need to sleep because 48 hours without sleep can not be a good thing. I like the idea that you post more tests.

Anyway, I also agree with the ones that think that computer chess community sucks nowadays and replying some people can be a complete waste of time. I registered in TalkChess not for making friends, but also for not making enemies... I managed the latter thing unintentionally.

Regards from Spain.

Ajedrecista.

I am not in the business of taking someone's thread and picking it apart phrase by phrase. Because in the first place it simply doesn't make sense unless it is taken in the whole context. I have never been impressed with people who did this.

First- no way to prove it- it would never be admitted. Julien doesn't give a rat's ass about any of my match threads- or if I ever post any. His intent yesterday was in response to a complaint from someone bothered by the results- not by anything I said. When he got here- he naturally saw nothing to moderate.

My mini-matches mean almost nothing statistically? That is a crock of shit- unless you know exactly how many 50game matches I have between the same exact 2 engine versions. Why would I show them if I had them- I am after something else that doesn't include the strength relationship between Stockfish and Ivanhoe.

3 things. You picked the 13-8 score to use in your statistical analysis. Why did you not pick the 18-6 score? Not that I really care what you do or don't do. Maybe you will use the 16-4 score I post in a little while. Again- I don't care. On its own it means little except to me.

2nd- your analysis is a waste of time for more than one reason. It might make sense if first you knew what I was after and even then if you asked me BEFORE you went off on some statistical hunt- how many 50game matches did I have besides this 1 to back it up. You assume way too much.

Lastly, and by God this really is THE LAST TIME I SAY IT. I don't care whether Stockfish wins all matches- or none- against Ivanhoe versions. Stockfish is one of my bases in a litmus test to help me rate the top 15 or so Ivanhoe versions. Nothing more- nothing less. I am running 12 or so Ivanhoe versions ag. some common opponents. After I run them ag. maybe 6 to 8 common ones, I will make my list of the order of the "Top 12" of Ivanhoe versions. People like Brent will have access to my results and my rating list. You won't- so no need for you to wonder about it. If it is not exactly perfect- you will never be bothered with it.

And next time you start off on a tester's match- at least have the good sense to first ask him what he is after. You look silly spending time on a statistical foray that means not one thing thing when you consider my intentions for running the goddam matches in the first place.

To close- there is nothing- n-o-t / o-n-e / s-i-n-g-l-e / t-h-i-n-g, that you can tell me about number of games and + and - error bars that I don't already know. I know exactly how much it changes with 100 more games, 200 more games, 400 more games, 700 more games, 1000 more games, 2000 more games. The only thing I don't know, is the detail and the math/QA that is used to arrive at the figures. And I don't know because I don't give a shit. Don't show me how you arrived at the figures- I don't care. And I already KNOW the figures, so exactly what is it that you have that I need to know?!

gts

geots · Post by **geots** » Wed May 09, 2012 4:13 pm

Kyodai wrote:George - just keep the results coming! They are very interesting - as is
the engine (s) Ivanhoe!

Thank you Sune. I appreciate your interest!

gts

JuLieN · Post by **JuLieN** » Wed May 09, 2012 4:25 pm

geots wrote: First- no way to prove it- it would never be admitted. Julien doesn't give a rat's ass about any of my match threads- or if I ever post any. His intent yesterday was in response to a complaint from someone bothered by the results- not by anything I said. When he got here- he naturally saw nothing to moderate.

George, get real! Nobody complained about your post!

1) why would someone complain about your original post?

2) then, in the extraordinary event where someone wanted to complain, there would be nothing to moderate: that's what I would PM this person, and I would post nothing in your thread.

3) if I posted something in your thread it's obviously that I "gave a rat's ass", to quote you. And I found your post interesting.

4) My post was friendly, and I was just sharing my own experience about testing and statistics. So I'm extremely surprised by your enormous aggressiveness. Just like Thord pointed it out there is something dysfunctional and toxic with this community, and as soon as I'm done with my duty as a mod I will just leave to never come back.

Extremely Strong Engine Humbled Beyond Belief!

Extremely Strong Engine Humbled Beyond Belief!

Re: Extremely Strong Engine Humbled Beyond Belief!

Re: Extremely Strong Engine Humbled Beyond Belief!

Care with the number of games of a match.

Re: Care with the number of games of a match.

Re: Care with the number of games of a match.

Re: Care with the number of games of a match.

Re: Care with the number of games of a match.

Re: Care with the number of games of a match.

Re: Care with the number of games of a match.