Notice- Some EXTREMELY Interesting Stockfish Results!

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

User avatar
geots
Posts: 4790
Joined: Sat Mar 11, 2006 12:42 am

Notice- Some EXTREMELY Interesting Stockfish Results!

Post by geots »

I have 3 short matches here. The controls were not perfect for comparison- one was 5+3 and two matches were 40/20 repeating. And the matches were 20 games, 30 games and 30 games. Marco already has the results from the 2nd match. But it is the 3rd match that was the stunner! And I used in the first 2 matches 2.2.2 and 2.3.1- tho probably not but a few elo diff. between them. They were all 3- 4CPU matches run on my XP quad- which is now in transit to a good friend Tom Likens. At any rate- you can draw your own conclusions.




XP Pro x64 Intel Quad
Fritz 11 gui
4CPU/64bit****
128MB hash
Bases=NONE
Ponder_Learning=OFF
Fritz 11.ctg w/10-move limit*
5'+3"
Match=20 games


Code: Select all

Ivanhoe 46h x64       +53    +9/-6/=5   57.50%   11.5/20 
Stockfish 2.3.1 x64   -53    +6/-9/=5   42.50%    8.5/20




XP Pro x64 Intel Quad
Fritz 11 gui
4CPU/64bit****
128MB hash
Bases=NONE
Ponder_Learning=OFF
Fritz 11.ctg w/10-move limit*
40/10 Repeating (Benched to adapt to 40/20)
Match=30 games


Code: Select all

Ivanhoe B46fE.02 x64       +47    +7/-3/=20   56.67%   17.0/30
Stockfish 2.2.2 JA 64bit   -47    +3/-7/=20   43.33%   13.0/30


Now to the 3rd match, where Ivanhoe 46h again plays, but this time against the latest Stockfish beta of Marco's.




XP Pro x64 Intel Quad
Fritz 11 gui
4CPU/64bit****
128MB hash
Bases=NONE
Ponder_Learning=OFF
Fritz 11.ctg w/10-move limit*
40/10 Repeating (Benched to adapt to 40/20)
Match=30 games


Code: Select all

Stockfish 25-03-13 64bit   0    +5/-5/=20   50.00%   15.0/30
Ivanhoe 46h x64            0    +5/-5/=20   50.00%   15.0/30


Results are left to individual interpretation.



Bye-

george
User avatar
hgm
Posts: 28446
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by hgm »

OK, so it won two games where the other version lost. Sometimes you just get lucky...

Seems your book is too drawish to do sensitive testing of the engine strength, btw. 80% draws is awful.
User avatar
geots
Posts: 4790
Joined: Sat Mar 11, 2006 12:42 am

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by geots »

hgm wrote:OK, so it won two games where the other version lost. Sometimes you just get lucky...

Seems your book is too drawish to do sensitive testing of the engine strength, btw. 80% draws is awful.

HG, I am beginning to wonder what you actually know about testing. You could not be any more wrong. All books, generic or not- in testing are designed for just that. In a perfect world- the engines should come out of the opening DEAD EVEN. And you call that drawish- that is the way it is supposed to be. Then it is up to each engine to win from an even position. Like CCRL for example. When I tested for them, if an engine came out of the opening with +0.75 or more- the game was thrown out. And any advantage an engine had over +1.00 coming out of the opening would NEVER EVER be used.

Your idea is great for engines playing each other using their OWN BOOKS. But that is not the way CCRL tests, not the way CEGT tests and certainly not the way I test.

So here, your argument is not worth the bandwidth used to type it. And note, alas, that I said the results were left to individual interpretation. But for the draws to be blamed on the book is insanity- pure and simple.



gts
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by lucasart »

geots wrote:
hgm wrote:OK, so it won two games where the other version lost. Sometimes you just get lucky...

Seems your book is too drawish to do sensitive testing of the engine strength, btw. 80% draws is awful.

HG, I am beginning to wonder what you actually know about testing. You could not be any more wrong. All books, generic or not- in testing are designed for just that. In a perfect world- the engines should come out of the opening DEAD EVEN. And you call that drawish- that is the way it is supposed to be. Then it is up to each engine to win from an even position. Like CCRL for example. When I tested for them, if an engine came out of the opening with +0.75 or more- the game was thrown out. And any advantage an engine had over +1.00 coming out of the opening would NEVER EVER be used.

Your idea is great for engines playing each other using their OWN BOOKS. But that is not the way CCRL tests, not the way CEGT tests and certainly not the way I test.

So here, your argument is not worth the bandwidth used to type it. And note, alas, that I said the results were left to individual interpretation. But for the draws to be blamed on the book is insanity- pure and simple.



gts
Excuse me, but you are the one who is cluless here. And it's not the first time you demonstrate your ignorance.

HGM is right: getting 2 points more on a 30 game sample doesn't mean anything.

If you want to look at some *meaningful* statistics about Stockfish, you should look at Gary's page:
http://54.235.120.254:6543/tests

As you can see, the last test between SF and SF 2.3.1 shows this result (time control is 60"+0.05")

ELO: 18.76 +-2.8 (95%) LOS: 100.0%
Total: 20000 W: 3903 L: 2824 D: 13273

With *that* kind of result, you can distinguish the signal from the noise. With a probbility of 95%, we can say that SF is better than SF 2.3.1 by an ELO margin between 18.76-2.8 and 18.76+2.8
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
geots
Posts: 4790
Joined: Sat Mar 11, 2006 12:42 am

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by geots »

lucasart wrote:
geots wrote:
hgm wrote:OK, so it won two games where the other version lost. Sometimes you just get lucky...

Seems your book is too drawish to do sensitive testing of the engine strength, btw. 80% draws is awful.

HG, I am beginning to wonder what you actually know about testing. You could not be any more wrong. All books, generic or not- in testing are designed for just that. In a perfect world- the engines should come out of the opening DEAD EVEN. And you call that drawish- that is the way it is supposed to be. Then it is up to each engine to win from an even position. Like CCRL for example. When I tested for them, if an engine came out of the opening with +0.75 or more- the game was thrown out. And any advantage an engine had over +1.00 coming out of the opening would NEVER EVER be used.

Your idea is great for engines playing each other using their OWN BOOKS. But that is not the way CCRL tests, not the way CEGT tests and certainly not the way I test.

So here, your argument is not worth the bandwidth used to type it. And note, alas, that I said the results were left to individual interpretation. But for the draws to be blamed on the book is insanity- pure and simple.



gts
Excuse me, but you are the one who is cluless here. And it's not the first time you demonstrate your ignorance.

HGM is right: getting 2 points more on a 30 game sample doesn't mean anything.

If you want to look at some *meaningful* statistics about Stockfish, you should look at Gary's page:
http://54.235.120.254:6543/tests

As you can see, the last test between SF and SF 2.3.1 shows this result (time control is 60"+0.05")

ELO: 18.76 +-2.8 (95%) LOS: 100.0%
Total: 20000 W: 3903 L: 2824 D: 13273

With *that* kind of result, you can distinguish the signal from the noise. With a probbility of 95%, we can say that SF is better than SF 2.3.1 by an ELO margin between 18.76-2.8 and 18.76+2.8



And another person who doesn't take the goddam time to even read a thread. He was wrong about the book- and you were wrong about my purpose for posting the results. Did I say it was enough games to prove a thing- NO. Did I say I considered these results were enough games to give an accurate elo representation- NO. Did I say the interpretation was left to each individual to decide for himself- YES. So read the threads before you start replying with bullshit rubbish. Is this some kind of disease going around with you guys? If I'm ignorant- you need to look in the mirror. And be sure after you get thru typing, wash your hands good. This was windows, and you don't like to get your hands soiled.

I have just about had enough of you and your half-ass assumptions to last a lifetime.

I have replied my last time to you. At my age, time becomes valuable- and I don't have any to waste dealing with you.
User avatar
hgm
Posts: 28446
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by hgm »

geots wrote:All books, generic or not- in testing are designed for just that. In a perfect world- the engines should come out of the opening DEAD EVEN. And you call that drawish- that is the way it is supposed to be.
And I say that is not a very smart way of testing. Because you waste most of your time on producing draws. That you have always done it this way, and that everyone always has done it this way doesn't change that: you are all wasting most of your testing time in producing bloodless draws. That you or everyone else have always done it this way doesn't make it smart.

This problem is comparatively new, because it only occurs now that the level of play has become so high (for 3000+ engines long TC) that 'dead even' is too deep into the drawing zone to overcome in most cases. This is seen by the drawing rate shooting up from its usual 32% to a much higher number.

That engines should come out of the opening dead even is in fact nothing but an unfounded superstition (albeit a very common one). Can you point me to test results that show a 'dead even' book produces different results from an 'on the edge' book where all out-of-book positions are on the brink of being lost or won? I bet you cannot.

It is easily predicted that for determining LOS an on-the-edge book would (preferably all positions played with reversed colors as well) would give you more reliable results in fewer games at these high levels of play. It seems to me that what you describe as 'knowing about testing' actually stands for 'conforming to tradition without knowing or understanding'.
And note, alas, that I said the results were left to individual interpretation.
Well, so I shared my interpretation with the other forum dwellers. Do you have a problem with that?
User avatar
geots
Posts: 4790
Joined: Sat Mar 11, 2006 12:42 am

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by geots »

hgm wrote:
geots wrote:All books, generic or not- in testing are designed for just that. In a perfect world- the engines should come out of the opening DEAD EVEN. And you call that drawish- that is the way it is supposed to be.
And I say that is not a very smart way of testing. Because you waste most of your time on producing draws. That you have always done it this way, and that everyone always has done it this way doesn't change that: you are all wasting most of your testing time in producing bloodless draws. That you or everyone else have always done it this way doesn't make it smart.

This problem is comparatively new, because it only occurs now that the level of play has become so high (for 3000+ engines long TC) that 'dead even' is too deep into the drawing zone to overcome in most cases. This is seen by the drawing rate shooting up from its usual 32% to a much higher number.

That engines should come out of the opening dead even is in fact nothing but an unfounded superstition (albeit a very common one). Can you point me to test results that show a 'dead even' book produces different results from an 'on the edge' book where all out-of-book positions are on the brink of being lost or won? I bet you cannot.

It is easily predicted that for determining LOS an on-the-edge book would (preferably all positions played with reversed colors as well) would give you more reliable results in fewer games at these high levels of play. It seems to me that what you describe as 'knowing about testing' actually stands for 'conforming to tradition without knowing or understanding'.
And note, alas, that I said the results were left to individual interpretation.
Well, so I shared my interpretation with the other forum dwellers. Do you have a problem with that?



Problem is- or maybe it is not a problem- but I am not interested in the least as to what you think about my testing methods. Try this with CCRL and CEGT and see how much attention they pay to you. I can already tell you. It is highly unlikely they will bother to answer.

One last time- these draws were not the norm. Very seldom does it happen with this many as you refer to. But there are exceptions from time to time and you live with them. But to have what you want- all the testers would have to use "own books" with the engines. Because with generic books, the idea as I said is to come out of the opening in exactly the opposite condition as you think is better- which is as close to dead even as possible.

I have discussed this with you as far as I am willing to go. Think what you like- it's a free world. Just don't expect any of the other testers to listen to you as much as I have been willing to. But it is time to close this discussion from my end.

Let me just say this in closing, because you started it.


1. Marco and I have talked. He is happy with my work and my help.

2. I beta tested Komodo 5 for Don and am beta testing Komodo 6 as we speak. He is satisfied with my work.

3. I am testing Strelka 5.6 for Yuri. The only one he has allowed to have it.

4. I am helping Giancarlo and beta testing Equinox for him. The only one he will allow anywhere near his betas.

5. I am beta testing Djinn and helping my friend Tom Likens. Again, the only one he allows near his betas.

Ask Yuri, Giancarlo and Tom if they would be willing to trade me for another tester.

So- I stay busy like there are not enough hours in the day. But I enjoy it. So you have issues with my testing. Like I said, it's a free world- knock yourself out. Just don't expect me to care.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by Adam Hair »

Just a few things to note:

For the CCRL database, games are thrown out when both engines report the evaluation to be greater than abs(0.75) for the whole game. That throws out openings that are too unbalanced in a particular batch of games. This is not necessarily a cure all.

The goal is for both engines to have a chance to win the game when leaving the book. However, there are the openings where more than likely two engines of similar strength will draw. For the most part, these positions are not a big problem for the CCRL. The CCRL rating list has low resolution, and its focus is on a wide selection of engines.

In the case of individual matches between strong opponents that are less likely to make bad moves, more opening positions have a chance to be "drawish". If the purpose of the matches is to see which engine is stronger, then I agree with HG. Using more "on edge" openings would be informative. But, it would take some work to collect these type of openings. Using openings that are too unbalanced is not very informative either.
User avatar
geots
Posts: 4790
Joined: Sat Mar 11, 2006 12:42 am

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by geots »

Adam Hair wrote:Just a few things to note:

For the CCRL database, games are thrown out when both engines report the evaluation to be greater than abs(0.75) for the whole game. That throws out openings that are too unbalanced in a particular batch of games. This is not necessarily a cure all.

The goal is for both engines to have a chance to win the game when leaving the book. However, there are the openings where more than likely two engines of similar strength will draw. For the most part, these positions are not a big problem for the CCRL. The CCRL rating list has low resolution, and its focus is on a wide selection of engines.

In the case of individual matches between strong opponents that are less likely to make bad moves, more opening positions have a chance to be "drawish". If the purpose of the matches is to see which engine is stronger, then I agree with HG. Using more "on edge" openings would be informative. But, it would take some work to collect these type of openings. Using openings that are too unbalanced is not very informative either.





I fully agree, but the point is you still use the same generic book for both engines. At least when I was there. And the way to vary the openings more is by use of the slider in book options. Where of course you have to be very careful about how far you slide it to the right- too far is big trouble. One other note- in chessbase "game options" you have the choice of the setting "early draw" or "late draw"- tho I am not positive how much difference that makes in the long run. But what you have to understand, none of what you mention would satisfy the point he was trying feebly to make- which is the use of generic books for each engine is not a good idea. He thinks coming out of the opening in an even position is the biggest cause of too many draws. And that still remains what we are all striving for. No matter what we do, there are always going to be cases where there were way too many draws for our liking in some matches. It's the nature of the beast. It's not a perfect world.



Best,

george
User avatar
hgm
Posts: 28446
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Notice- Some EXTREMELY Interesting Stockfish Results!

Post by hgm »

geots wrote:Problem is- or maybe it is not a problem- but I am not interested in the least as to what you think about my testing methods.
Try this with CCRL and CEGT and see how much attention they pay to you. I can already tell you. It is highly unlikely they will bother to answer.
Well, you don't have a reputation for paying attention to anyone, and yes, this is a problem, but mainly for you.
One last time- these draws were not the norm. Very seldom does it happen with this many as you refer to. But there are exceptions from time to time and you live with them.
If you had payed attention to others you would heve seen this sharp increase in draw rates was reprorted several times here, for tests of these super-strong engines at long TC.
But to have what you want- all the testers would have to use "own books" with the engines. Because with generic books, the idea as I said is to come out of the opening in exactly the opposite condition as you think is better- which is as close to dead even as possible.
There is no logical basis for this statement. It is totaly trivial to make a generic book where every engine comes out of the opening a Queen ahead. There are equal generic books and there are wild generic books. And any book, irrespective of content, can be used as GUI book servicing all engines. This is just another remark to fall in the category "I, George Speight have never done this, so it must be impossible"...
I have discussed this with you as far as I am willing to go. Think what you like- it's a free world. Just don't expect any of the other testers to listen to you as much as I have been willing to. But it is time to close this discussion from my end.

Let me just say this in closing, because you started it.


1. Marco and I have talked. He is happy with my work and my help.

2. I beta tested Komodo 5 for Don and am beta testing Komodo 6 as we speak. He is satisfied with my work.

3. I am testing Strelka 5.6 for Yuri. The only one he has allowed to have it.

4. I am helping Giancarlo and beta testing Equinox for him. The only one he will allow anywhere near his betas.

5. I am beta testing Djinn and helping my friend Tom Likens. Again, the only one he allows near his betas.

Ask Yuri, Giancarlo and Tom if they would be willing to trade me for another tester.

So- I stay busy like there are not enough hours in the day. But I enjoy it. So you have issues with my testing. Like I said, it's a free world- knock yourself out. Just don't expect me to care.
I think you overlook this is a forum post, not a communication through PM. So if I give advice on how your methods could be improved and possible shortcomings remedied, it is not primarily directed at you, but mainly the the perhaps 100 other people that read it. Whether you want to benefit from advice of others is of no importance to anyone but yourself.

But this whole issue was a side track anyway. On the main topic of the thread my comment was "OK, so two games were won which in the other match were lost...". That is not something I would call 'EXTREMELY interesting'. It would not even be something I would call 'interesting'. It is something that just happens.