Deep Blue vs Rybka

rbarreira · Post by **rbarreira** » Tue Sep 14, 2010 2:58 pm

Milos wrote:
bob wrote:Current tests show Crafty is 200 below stockfish. Our clusters have a total of over 750 nodes. A doubling is 70 Elo. Do you not think that even _conservatively_ that 750 codes would give 8x the performance. Worst case? Work on that math a bit and think before posting.
I wrote SF on i7, that's 6 real cores each of them twice the strength of your cluster node.

It's 200elo in your measurements. All other "officiel" lists show more than 250. Sorry, but in this case I really don't believe your 200.
You think you can gain 250 elo with 60 times more computing power (6 doublings)??? LOL
You can dream of 70 elo.
Going from 1 to 4 cores Crafty 23.2 gains 95 elo (CCRL data in 40/40, huuuge error margins, realistically is much smaller gain).
Going from 2 to 4 cores Crafty 23.0 gains only 22 elo (CCRL data in 40/4, much smaller error margins, more realistic data).
Going from for example 256 to 512 Crafty 23.3 would not gain more than 20 elo in best case.
Be realistic, we are not kids.

Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.

Milos · Post by **Milos** » Tue Sep 14, 2010 3:02 pm

rbarreira wrote:Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.

Sure 70 elo is much more realistic (add twice the error margin and you'll still be far of 70 elo). Give me a break. Bob is a big authority in the field but he is enormously biased when something of his own is in question!

Don · Post by **Don** » Tue Sep 14, 2010 3:12 pm

Milos wrote:
bob wrote:Current tests show Crafty is 200 below stockfish. Our clusters have a total of over 750 nodes. A doubling is 70 Elo. Do you not think that even _conservatively_ that 750 codes would give 8x the performance. Worst case? Work on that math a bit and think before posting.
I wrote SF on i7, that's 6 real cores each of them twice the strength of your cluster node.

It's 200elo in your measurements. All other "officiel" lists show more than 250. Sorry, but in this case I really don't believe your 200.
You think you can gain 250 elo with 60 times more computing power (6 doublings)??? LOL
You can dream of 70 elo.
Going from 1 to 4 cores Crafty 23.2 gains 95 elo (CCRL data in 40/40, huuuge error margins, realistically is much smaller gain).
Going from 2 to 4 cores Crafty 23.0 gains only 22 elo (CCRL data in 40/4, much smaller error margins, more realistic data).
Going from for example 256 to 512 Crafty 23.3 would not gain more than 20 elo in best case.
Be realistic, we are not kids.

Uri Blass · Post by **Uri Blass** » Tue Sep 14, 2010 3:16 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Don wrote:
Gerd Isenberg wrote:
bob wrote:
uaf wrote:
bob wrote: I recall them losing one game due to a power outage, and one game due to comm problems (Fritz in Hong Kong) which is an incredible streak over 11 years.
And IIRC it was Deep Thought II that lost to Fritz and not Deep Blue as always advertised by Chessbase. Deep Blue was not yet ready.
Confusion caused by IBM. That was "deep blue prototype" which Hsu/Campbell had said was "deep blue software running on deep thought hardware". So you are correct. I was lumping them all together. Chiptest first played in 1986 with serious bugs. It won the ACM event in 1987 and every year after that, only losing the two games I mentioned to the best of my recollection, one on time due to a power failure at the Watson center, one in Hong Kong primarily caused by a comm failure...
Was the game against Mephisto from ACM 1989 that with the power failure?
Deep Blue was remarkably strong for 1997 but it was far from being unbeatable. It was rare but it suffered draws and losses. I think we can estimate that it was about 4-5 years ahead of the PC programs. By 2002 a lot of very smart people believed that Junior or Fritz would beat it in a match. No point arguing about it because we can never know for sure.

I read somewhere (and I'll try to find it) that if you consider various incarnations of Deep Blue that actually played in tournaments, and performance rate it's total results, it is not particularly impressive because it only indicates something like a 200 ELO superiority over the best - but I think of all the games it lost a lot of them were due to unfortunate issues, so this is probably far from a fair metric (also considering that so few games were played.) I think in reality is was stronger than this. A crude calculation is that if it took programs 5 years to catch it, you can guesstimate it's superiority and I think that puts it as more like 400 ELO better than anyone else.

The Deep Blue team was very humble and were a joy to talk with. At the Hong Kong tournament Murray told me that they estimated their winning chances to be right around 50%. That sounds incredible at first unless you do the math. To survive a 5 round tournament with 24 players and have a 50% chance to be the winner you must not only be the best player, but best by a good margin. If their chances were 50%, the chances of the 23 other contestants were divided up among the remaining 50% so that is pretty impressive.

But this tells you that even the Deep Blue team expected to lose games relatively frequently, just much less frequently than anyone else! When it's all soberly analyzed and all the hype removed, Deep Blue stands out as the most outstanding program of it's day, but no more. (I am not sure if some early programs stand out even more, such as Belle or even before than the Chess 4.7 program, they were also seemingly unbeatable so this deserves a fact check.)
If you go back to 1997 when they won the Fredkin prize, they had a FIDE equivalent rating of 2650+. I don't remember the exact number but it was _well_ beyond the Fredkin prize requirement... What micro was close to that in long games. A couple of micros had beaten GM players in blitz (Cray Blitz defeated GM players all over the place in the 90's, as a reference). So they were very strong, and based on deep though vs everyone else thru 1994 ACM-sponsored events, they were clearly well "above and beyond."

How far is debatable. But I would not use that 200 number myself since we have no data for Micros playing super-GM players at 40 moves in 2 hours.
We clearly have data about computers who played humans in 120/40 time control

I remember reading that Fritz3 on P90 could get the IM norm in tournament time control games so it is not correct to say that programs did not play long time games.

No micro was close to 2650 but I am sure that
micro's were at least at 2400 level at that time so 200 elo difference between Deep thought or Deep blue prototype and the best micro's of the same time is not illogical.

Uri
When did Fritz play in such tournaments in the 1987-1988 time frame? The DB project had a pretty daunting task to win the Fredkin stage 3 prize. And my dates were wrong.

DB produced that 2650+ rating in 1988. Not 1997. 1997 was for the final stage of the Fredkin prize, beating the world champion in a match.

So, more correctly, do you believe Fritz in 1988 was within 200 points of a program that had just earned a rating of 2600+ playing 24 games against only GM-level competition? IMHO, not a chance in hades.... Most micros were jokes in 1988...
Fritz3 did it in 1994 or 1995 based on my memory and I did not think about 1988.

I agree that the gap between Deep thought and the micros was more than 200 elo in 1988

I also find it hard to believe the 2650 rating of Deep thought in 1988 because I remember clearly worse results than 2650 for Deep thought after 1988.
You don't have to "believe" it, you can "confirm" it with a quick google search. DB won the fredkin stage 2 prize in 1988. This required a 2550+ rating over 24 consecutive games against only GM-level players. They finished up somewhere in the 2650 area. I don't _ever_ remember worse than 2650 results for DB unless you pick an event like hong-kong where they lost one game out of 5.

These worse results include 2 losses against kasparov in 1989 and a loss against karpov in 1990(I expect 2650 player to score 0.5/2 against kasparov and I also remember that the estimate in the newspaper for Deep thought's level before the games against kasparov was 2550 and not 2650).
These worse result include a tournament in 1991 when Deep thought got performance only slightly more than 2400 and scored 2.5/7 against GM opponents with rating 2480-2560.

I remember deep thought had a good tournament when it scored 6.5/8 and won first place in 1988 but I do not remember other tournaments when Deep thought got performance above 2600

I would like to see list of 24 opponents together with their rating and results and time of the game to understand the basis for the claim that DT got performance that is higher than 2650 in 1988.
Look up the Fredkin prize results. It was discussed at length back then and was quite convincing. No micro 5 years later could approach that. Maybe by 2000 it was barely becoming possible...

I can find that DT got the fredkin prize but I do not see performance of 2650

http://www.aaai.org/ojs/index.php/aimag ... ew/753/671

They had one good tournament when they score 6.5/8 against average rating of 2492

The tournaments that they played in 1988 against humans are

1)May
28–30, DT tied for second in a field of
over 20 masters and ahead of three
other computers

2)In August at the U.S. Open,
DT scored 8.5, 3.5 to tie for eighteenth
place with Arnold Denker
among others.

3)In the American Open tournament
in Los Angeles in October, DT scored
a modest 4.5, 1.5

4)However, three
weeks later at the U.S. Class Championships
in New Jersey, DT had an
impressive 5, 1, beating two IMs
(Bonin and Zlotnikov).

5)in early
November, DT won its first major
tournament, scoring 4.5, 0.5. During
the tournament, it beat another IM
(Blocker) and drew with co-winner IM
Igor Ivanov,

6)later, DT achieved the
greatest computer success to date. It
tied for first with GM Tony Miles in
the prestigious Software Toolworks
Open in Los Angeles with a score of
6.5, 1.5.

Unfortunately I do not find exact result and I believe that they performed above 2550 but not above 2650 except a single tournament.

They got a USCF rating of 2551 and I believe that they could get higher rating in case of getting performance above 2650 in 25 games.

mhull · Post by **mhull** » Tue Sep 14, 2010 3:28 pm

Milos wrote:
bob wrote:Current tests show Crafty is 200 below stockfish. Our clusters have a total of over 750 nodes. A doubling is 70 Elo. Do you not think that even _conservatively_ that 750 codes would give 8x the performance. Worst case? Work on that math a bit and think before posting.
I wrote SF on i7, that's 6 real cores each of them twice the strength of your cluster node.

It's 200elo in your measurements. All other "officiel" lists show more than 250. Sorry, but in this case I really don't believe your 200.

So you believe lists with larger error margins than Bob's tests are more accurate. And you think faster I7s increase the ELO delta between crafty and SF. That's some opinion you've got there.

bob · Post by **bob** » Tue Sep 14, 2010 3:49 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Don wrote:
Gerd Isenberg wrote:
bob wrote:
uaf wrote:
bob wrote: I recall them losing one game due to a power outage, and one game due to comm problems (Fritz in Hong Kong) which is an incredible streak over 11 years.
And IIRC it was Deep Thought II that lost to Fritz and not Deep Blue as always advertised by Chessbase. Deep Blue was not yet ready.
Confusion caused by IBM. That was "deep blue prototype" which Hsu/Campbell had said was "deep blue software running on deep thought hardware". So you are correct. I was lumping them all together. Chiptest first played in 1986 with serious bugs. It won the ACM event in 1987 and every year after that, only losing the two games I mentioned to the best of my recollection, one on time due to a power failure at the Watson center, one in Hong Kong primarily caused by a comm failure...
Was the game against Mephisto from ACM 1989 that with the power failure?
Deep Blue was remarkably strong for 1997 but it was far from being unbeatable. It was rare but it suffered draws and losses. I think we can estimate that it was about 4-5 years ahead of the PC programs. By 2002 a lot of very smart people believed that Junior or Fritz would beat it in a match. No point arguing about it because we can never know for sure.

I read somewhere (and I'll try to find it) that if you consider various incarnations of Deep Blue that actually played in tournaments, and performance rate it's total results, it is not particularly impressive because it only indicates something like a 200 ELO superiority over the best - but I think of all the games it lost a lot of them were due to unfortunate issues, so this is probably far from a fair metric (also considering that so few games were played.) I think in reality is was stronger than this. A crude calculation is that if it took programs 5 years to catch it, you can guesstimate it's superiority and I think that puts it as more like 400 ELO better than anyone else.

The Deep Blue team was very humble and were a joy to talk with. At the Hong Kong tournament Murray told me that they estimated their winning chances to be right around 50%. That sounds incredible at first unless you do the math. To survive a 5 round tournament with 24 players and have a 50% chance to be the winner you must not only be the best player, but best by a good margin. If their chances were 50%, the chances of the 23 other contestants were divided up among the remaining 50% so that is pretty impressive.

But this tells you that even the Deep Blue team expected to lose games relatively frequently, just much less frequently than anyone else! When it's all soberly analyzed and all the hype removed, Deep Blue stands out as the most outstanding program of it's day, but no more. (I am not sure if some early programs stand out even more, such as Belle or even before than the Chess 4.7 program, they were also seemingly unbeatable so this deserves a fact check.)
If you go back to 1997 when they won the Fredkin prize, they had a FIDE equivalent rating of 2650+. I don't remember the exact number but it was _well_ beyond the Fredkin prize requirement... What micro was close to that in long games. A couple of micros had beaten GM players in blitz (Cray Blitz defeated GM players all over the place in the 90's, as a reference). So they were very strong, and based on deep though vs everyone else thru 1994 ACM-sponsored events, they were clearly well "above and beyond."

How far is debatable. But I would not use that 200 number myself since we have no data for Micros playing super-GM players at 40 moves in 2 hours.
We clearly have data about computers who played humans in 120/40 time control

I remember reading that Fritz3 on P90 could get the IM norm in tournament time control games so it is not correct to say that programs did not play long time games.

No micro was close to 2650 but I am sure that
micro's were at least at 2400 level at that time so 200 elo difference between Deep thought or Deep blue prototype and the best micro's of the same time is not illogical.

Uri
When did Fritz play in such tournaments in the 1987-1988 time frame? The DB project had a pretty daunting task to win the Fredkin stage 3 prize. And my dates were wrong.

DB produced that 2650+ rating in 1988. Not 1997. 1997 was for the final stage of the Fredkin prize, beating the world champion in a match.

So, more correctly, do you believe Fritz in 1988 was within 200 points of a program that had just earned a rating of 2600+ playing 24 games against only GM-level competition? IMHO, not a chance in hades.... Most micros were jokes in 1988...
Fritz3 did it in 1994 or 1995 based on my memory and I did not think about 1988.

I agree that the gap between Deep thought and the micros was more than 200 elo in 1988

I also find it hard to believe the 2650 rating of Deep thought in 1988 because I remember clearly worse results than 2650 for Deep thought after 1988.
You don't have to "believe" it, you can "confirm" it with a quick google search. DB won the fredkin stage 2 prize in 1988. This required a 2550+ rating over 24 consecutive games against only GM-level players. They finished up somewhere in the 2650 area. I don't _ever_ remember worse than 2650 results for DB unless you pick an event like hong-kong where they lost one game out of 5.

These worse results include 2 losses against kasparov in 1989 and a loss against karpov in 1990(I expect 2650 player to score 0.5/2 against kasparov and I also remember that the estimate in the newspaper for Deep thought's level before the games against kasparov was 2550 and not 2650).
These worse result include a tournament in 1991 when Deep thought got performance only slightly more than 2400 and scored 2.5/7 against GM opponents with rating 2480-2560.

I remember deep thought had a good tournament when it scored 6.5/8 and won first place in 1988 but I do not remember other tournaments when Deep thought got performance above 2600

I would like to see list of 24 opponents together with their rating and results and time of the game to understand the basis for the claim that DT got performance that is higher than 2650 in 1988.
Look up the Fredkin prize results. It was discussed at length back then and was quite convincing. No micro 5 years later could approach that. Maybe by 2000 it was barely becoming possible...
I can find that DT got the fredkin prize but I do not see performance of 2650

http://www.aaai.org/ojs/index.php/aimag ... ew/753/671

They had one good tournament when they score 6.5/8 against average rating of 2492

The tournaments that they played in 1988 against humans are

1)May
28–30, DT tied for second in a field of
over 20 masters and ahead of three
other computers

2)In August at the U.S. Open,
DT scored 8.5, 3.5 to tie for eighteenth
place with Arnold Denker
among others.

3)In the American Open tournament
in Los Angeles in October, DT scored
a modest 4.5, 1.5

4)However, three
weeks later at the U.S. Class Championships
in New Jersey, DT had an
impressive 5, 1, beating two IMs
(Bonin and Zlotnikov).

5)in early
November, DT won its first major
tournament, scoring 4.5, 0.5. During
the tournament, it beat another IM
(Blocker) and drew with co-winner IM
Igor Ivanov,

6)later, DT achieved the
greatest computer success to date. It
tied for first with GM Tony Miles in
the prestigious Software Toolworks
Open in Los Angeles with a score of
6.5, 1.5.

Unfortunately I do not find exact result and I believe that they performed above 2550 but not above 2650 except a single tournament.

They got a USCF rating of 2551 and I believe that they could get higher rating in case of getting performance above 2650 in 25 games.

The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.

bob · Post by **bob** » Tue Sep 14, 2010 3:58 pm

Milos wrote:
bob wrote:Current tests show Crafty is 200 below stockfish. Our clusters have a total of over 750 nodes. A doubling is 70 Elo. Do you not think that even _conservatively_ that 750 codes would give 8x the performance. Worst case? Work on that math a bit and think before posting.
I wrote SF on i7, that's 6 real cores each of them twice the strength of your cluster node.

It's 200elo in your measurements. All other "officiel" lists show more than 250. Sorry, but in this case I really don't believe your 200.

Totally up to you as to what you believe. As far as results go, here's at least one to chew on as I am running a calibration match right now to replace stockfish 1.6 with the latest 1.8.

Code: Select all

    Stockfish 1.8 64bit  2878    3    3 56621   83%  2606   18% 
    Crafty-23.4-1        2672    4    4 30000   61%  2582   20% 
    Crafty-23.4R01-1     2669    4    4 30000   61%  2582   21%

The R01 version is one with a simple change that really doesn't make any difference.

You can do the subtraction to see the difference in the two program.

So, whether you believe my numbers or not doesn't really matter. They are what they are. However, as I said, my testing is done a bit differently. Equal hardware. No parallel search (from significant testing, crafty will pick up 20+ elo over stockfish on an 8 core platform). No opening book (probably significant, as we have never released a customized book at all, and simply play from 3000 equal starting positions, spread across _all_ popular openings being played by IM/GM players.

You think you can gain 250 elo with 60 times more computing power (6 doublings)??? LOL
You can dream of 70 elo.

I tend to not dream, and actually _measure_. I just showed results yesterday, here, that cutting the speed by 1/2 drops Elo by 70. Cutting it by 1/2 again drops it by 80. If you can't handle the math, that would appear to be _your_ problem, not mine.

Going from 1 to 4 cores Crafty 23.2 gains 95 elo (CCRL data in 40/40, huuuge error margins, realistically is much smaller gain).

Seems within reason. 4 cores = 3.1-3.3 times faster. So I guess I fail to see your point, unless you are simply trying to show how little you understand about this process???

Going from 2 to 4 cores Crafty 23.0 gains only 22 elo (CCRL data in 40/4, much smaller error margins, more realistic data).
Going from for example 256 to 512 Crafty 23.3 would not gain more than 20 elo in best case.
Be realistic, we are not kids.

[/quote]

I have no idea what you are talking about. Even if that were true, which it probably is not, if one could gain +20 for every time the number of processors is doubled, that would produce a difficult-to-beat machine, knowing that there are 64K node machines around. That is only 16 doublings, and at "only" 20 Elo, that is +320. And for processor numbers below 64, +20 for doubling is an under-estimation.

bob · Post by **bob** » Tue Sep 14, 2010 4:02 pm

Milos wrote:
rbarreira wrote:Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.
Sure 70 elo is much more realistic (add twice the error margin and you'll still be far of 70 elo). Give me a break. Bob is a big authority in the field but he is enormously biased when something of his own is in question!

I simply reported the doubling (really halving) Elo change as part of the hardware vs software debate. Anyone can reproduce that test if they have the time and the interest. Put Crafty in a pool of players, and play heads-up for 30,000 games. Then re-run bug only give Crafty 1/2 the time (others get original time). Elo dropped by 70. Do it again so that we are now at 1/4 the time or 4x slower. Elo dropped by 150 total, or 80 for this second "halving".

You have two reasonable alternatives:

(1) run the test and post your results. If they are different from mine, then we can try to figure out why.

(2) be quiet. guessing, thinking, supposing and such have no place in a discussion about _real_ data. And "real" data is all I have ever provided here. As in my previous data about SF 1.8 vs Crafty 23.4...

Uri Blass · Post by **Uri Blass** » Tue Sep 14, 2010 4:04 pm

bob wrote:The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.

In this case maybe Deep thought performed better against GM's relative to other players.

I do not remember that deep thought played against 24 GM's in 1988
so it is going to be nice to have a list of GM's who played against it in 1988.

bob · Post by **bob** » Tue Sep 14, 2010 4:08 pm

mhull wrote:
Milos wrote:
bob wrote:Current tests show Crafty is 200 below stockfish. Our clusters have a total of over 750 nodes. A doubling is 70 Elo. Do you not think that even _conservatively_ that 750 codes would give 8x the performance. Worst case? Work on that math a bit and think before posting.
I wrote SF on i7, that's 6 real cores each of them twice the strength of your cluster node.

It's 200elo in your measurements. All other "officiel" lists show more than 250. Sorry, but in this case I really don't believe your 200.
So you believe lists with larger error margins than Bob's tests are more accurate. And you think faster I7s increase the ELO delta between crafty and SF. That's some opinion you've got there.

Remember:

"Opinions are like assholes, everybody has one, and nobody wants to look at anyone else's" - unix obscene fortune circa 1985 from Sun Microsystems.

Some people post facts, juxtaposed, and make little sense. Do we believe that as the hardware speeds up, SF actually _gains_ on crafty? When for years everyone has talked about "diminishing returns" as we go deeper, and there is ample evidence that such is actually the case?

Sometimes it is hard to understand what a comment is supposed to mean. The one you quoted is a good example. The good thing about my tests is that _everything_ is constant. Opponents don't change. Hardware doesn't change. No books or learning. Rating lists are fine. But offer nowhere near the accuracy I need for the tuning I am doing. ten different testers, using different hardware, various books, unknown learning, unknown superfluous applications running that steal time from the programs, etc. I don't have any of that in my testing environment, other than dealing with unexpected interrupts here and there that _slightly_ alters the timing.

Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka