Deep Blue vs Rybka

bob · Post by **bob** » Tue Sep 14, 2010 7:06 pm

rbarreira wrote:
bob wrote:
Milos wrote:
mhull wrote:The point is that Bob knows how to do it and has the resources to test and perfect the technique.
The only thing I find interesting is whether or not he'd be able to achieve at least 20elo per node doubling starting from 32 nodes cluster. So that, for example, with 1000 nodes he could get 100elo more compared to 32 nodes cluster.
It would really surprised me if he could.
I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still. Clusters are a different animal. And I play with this from time to time. It's a harder problem and I am not going to try to predict what is possible. NPS will be huge, but speedup is unknown. GCP could probably give a starting point for estimates here since he is doing this already.
I'm not doubting the results of your tests with up to 64 CPUs, but that formula can't go below 70% efficiency even with infinite CPUs, which makes me doubt its application to a high number of CPUs. That formula actually says that beyond 8 CPUs or so, the overhead of adding CPUs almost doesn't increase at all.

Unless there is some reason to believe that efficiency will never get below 70%, which I would find quite surprising.

The formula was derived from the results. Not computed independently and compared to the results. So of course I would not expect it to go beyond 64 unless I one day run on 128 or 256 cpu SMP box and discover that it does work. Or that it needs tweaking.

But for 1-64 it has been quite accurate. Never perfect. But it gives a reasonable idea...

michiguel · Post by **michiguel** » Tue Sep 14, 2010 7:20 pm

bob wrote:
michiguel wrote:
bob wrote:
Milos wrote:
rbarreira wrote:Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.
Sure 70 elo is much more realistic (add twice the error margin and you'll still be far of 70 elo). Give me a break. Bob is a big authority in the field but he is enormously biased when something of his own is in question!
I simply reported the doubling (really halving) Elo change as part of the hardware vs software debate. Anyone can reproduce that test if they have the time and the interest. Put Crafty in a pool of players, and play heads-up for 30,000 games. Then re-run bug only give Crafty 1/2 the time (others get original time). Elo dropped by 70. Do it again so that we are now at 1/4 the time or 4x slower. Elo dropped by 150 total, or 80 for this second "halving".

Just for the record, because I see that mentioning this will become a truth if it is not contested. This second "halving" was obtained with an engine that scored ~10% of the points in that particular pool. There are three problems. First, the error in ELO increases a lot because the difference between 10 and 11% is much bigger than 50-51%. Second, the error bar for 10% is bigger than 50%. Third, the ELO approximation at the tails of the curve may not be accurate anymore.

The difference between 70 and 80 may be just error.
I believe I mentioned that. And that the only solution was to find weaker opponents, which is a gigantic waste of time for anything except to answer this question.

Yes, but I am making sure this caveat is noticed and the reasons why. There were no comments about it before.

Miguel

Miguel

You have two reasonable alternatives:

(1) run the test and post your results. If they are different from mine, then we can try to figure out why.

(2) be quiet. guessing, thinking, supposing and such have no place in a discussion about _real_ data. And "real" data is all I have ever provided here. As in my previous data about SF 1.8 vs Crafty 23.4...

Don · Post by **Don** » Tue Sep 14, 2010 7:39 pm

bob wrote:
Don wrote:Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.
That is not "two doublings". This is, once again, apples and oranges. SMP overhead comes in and this changes things.

I didn't say you were doing anything wrong here, I'm only making the point that we must use great care. And since I don't have time to carefully parse all the posts flooding in right now and answer every point, I just wanted to remind everyone involved of this.

For example, it would wrong to test a 1 cpu program against a 4 cpu program and say, "even after 2 doublings we only get 100 ELO", but then later say a doubling in hardware is worth a full 60 or 70 ELO without distinguishing was KIND of doubling it was.

At the very beginning I stressed that we should not even be considering MP programs in all of this - it can be worked out later. Keep it simple stupid, the KISS principle. Otherwise it gets terribly confusing. What you should have done is estimate the 1 cpu hardware improvements over 15 years, then the 1 cpu software improvement over 15 years, and left it at that. But I feel that you changed the point of reference in each case to suit whatever point you happened to be trying to make at the time. Whether you did or did not, you made it really confusing.

So let's please keep this real simple and leave out MP completely. Do the 1 core calculation only for old and new programs. THEN we can see how much we get for 6 more cores in 2010.

I cannot help but feel that your argument is really weak when you feel the need now to talk about the inferiority of various SMP machines and now are pushing to invalidate the results of the ratings agencies due to this.

mhull · Post by **mhull** » Tue Sep 14, 2010 7:42 pm

Don wrote:
bob wrote:
Don wrote:Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.
That is not "two doublings". This is, once again, apples and oranges. SMP overhead comes in and this changes things.
I didn't say you were doing anything wrong here, I'm only making the point that we must use great care. And since I don't have time to carefully parse all the posts flooding in right now and answer every point, I just wanted to remind everyone involved of this.

For example, it would wrong to test a 1 cpu program against a 4 cpu program and say, "even after 2 doublings we only get 100 ELO", but then later say a doubling in hardware is worth a full 60 or 70 ELO without distinguishing was KIND of doubling it was.

At the very beginning I stressed that we should not even be considering MP programs in all of this - it can be worked out later. Keep it simple stupid, the KISS principle. Otherwise it gets terribly confusing. What you should have done is estimate the 1 cpu hardware improvements over 15 years, then the 1 cpu software improvement over 15 years, and left it at that. But I feel that you changed the point of reference in each case to suit whatever point you happened to be trying to make at the time. Whether you did or did not, you made it really confusing.

So let's please keep this real simple and leave out MP completely. Do the 1 core calculation only for old and new programs. THEN we can see how much we get for 6 more cores in 2010.

I cannot help but feel that your argument is really weak when you feel the need now to talk about the inferiority of various SMP machines and now are pushing to invalidate the results of the ratings agencies due to this.

So, in 1997, we (in principle) should have been comparing Deep Blue on one node against the competition?

mhull · Post by **mhull** » Tue Sep 14, 2010 7:46 pm

bob wrote:
Uri Blass wrote:
bob wrote:The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.
In this case maybe Deep thought performed better against GM's relative to other players.

I do not remember that deep thought played against 24 GM's in 1988
so it is going to be nice to have a list of GM's who played against it in 1988.
It had to, as that was a requirement. And IIRC some of those games were not rated, because USCF had rules about "matches" between computers and humans, because of the commercial bullshit from years before where they would pay someone to play an "arranged match" to give them a high rating for their advertising...

I do not remember the names of all the GMs. I do remember it playing Byrne, Browne and Spraggett, but I suspect this must be recorded somewhere in the announcement of their winning the prize.

Here's another brief summary of the Fredkin Prize stages:
http://chessprogramming.wikispaces.com/ ... in%20Prize

Don · Post by **Don** » Tue Sep 14, 2010 7:56 pm

mhull wrote:
Don wrote:
bob wrote:
Don wrote:Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.
That is not "two doublings". This is, once again, apples and oranges. SMP overhead comes in and this changes things.
I didn't say you were doing anything wrong here, I'm only making the point that we must use great care. And since I don't have time to carefully parse all the posts flooding in right now and answer every point, I just wanted to remind everyone involved of this.

For example, it would wrong to test a 1 cpu program against a 4 cpu program and say, "even after 2 doublings we only get 100 ELO", but then later say a doubling in hardware is worth a full 60 or 70 ELO without distinguishing was KIND of doubling it was.

At the very beginning I stressed that we should not even be considering MP programs in all of this - it can be worked out later. Keep it simple stupid, the KISS principle. Otherwise it gets terribly confusing. What you should have done is estimate the 1 cpu hardware improvements over 15 years, then the 1 cpu software improvement over 15 years, and left it at that. But I feel that you changed the point of reference in each case to suit whatever point you happened to be trying to make at the time. Whether you did or did not, you made it really confusing.

So let's please keep this real simple and leave out MP completely. Do the 1 core calculation only for old and new programs. THEN we can see how much we get for 6 more cores in 2010.

I cannot help but feel that your argument is really weak when you feel the need now to talk about the inferiority of various SMP machines and now are pushing to invalidate the results of the ratings agencies due to this.
So, in 1997, we (in principle) should have been comparing Deep Blue on one node against the competition?

Look at what I wrote. What I advocate is not ignoring MP, but to break the problem into more easily resolvable issues first. It's a lot easier to ignore MP and then factor it in later. We get a firm number we all agree with (yeah, right) and then we can attribute 50 ELO per SMP doubling to hardware (after arguing about it for a couple of days first.)

<cynical talk>The only problem with doing that is that it is too simple, and it makes it more difficult to construct biased experiments.</cynical talk>

mhull · Post by **mhull** » Tue Sep 14, 2010 8:04 pm

Don wrote: <cynical talk>The only problem with doing that is that it is too simple, and it makes it more difficult to construct biased experiments.</cynical talk>

In your posts, cynical tags would be a superfluity.

Ralph Stoesser · Post by **Ralph Stoesser** » Tue Sep 14, 2010 9:01 pm

Milos wrote:
mhull wrote:Bob's tests reduce the error margin by playing more games with fewer unknown variables.

I'm not saying you're wrong by definition, I'm just saying yours is an opinion based on weaker data.
Bob's tens of thousands of games has nothing to do with accuracy. Simply his testing methodology is faulty. He could play million of games and his results would be still inaccurate.
I understand some ppl get easily impressed by this, but when different test methodologies, opening books etc., all show accordance to 10-15 elo accuracy (with 15-20 elo error margin) and Bob's result are off by almost 100 elo with his 4 elo margin everything you say is just holding for a straw.
Bob has a systematic error in his testing methodology which he (and some other ppl) are not willing to admit.
I hope you remember The Emperor's New Clothes tale...

Show us how Crafty performs against Fruit and Glaurung family engines in "official" rating lists. If you are right, you should find numbers similar to Mr. Hyatt's cluster test results.

bob · Post by **bob** » Tue Sep 14, 2010 9:11 pm

Don wrote:
bob wrote:
Don wrote:Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.
That is not "two doublings". This is, once again, apples and oranges. SMP overhead comes in and this changes things.
I didn't say you were doing anything wrong here, I'm only making the point that we must use great care. And since I don't have time to carefully parse all the posts flooding in right now and answer every point, I just wanted to remind everyone involved of this.

For example, it would wrong to test a 1 cpu program against a 4 cpu program and say, "even after 2 doublings we only get 100 ELO", but then later say a doubling in hardware is worth a full 60 or 70 ELO without distinguishing was KIND of doubling it was.

I agree completely. And have tried to be quite clear. If you talk about pure hardware speed, a 6 core is close to 6x faster than a single core at the same speed. A lot depends on memory and such and also cache (two 4-core chips might be better than one 8-core for example, since you suddenly have twice the paths to memory, if memory has been designed with interleaving or some such. If you talk about program speed, then I try to use (for SMP discussions) the actual SMP speedup numbers, which is always less than the number of cores, on average...

At the very beginning I stressed that we should not even be considering MP programs in all of this - it can be worked out later. Keep it simple stupid, the KISS principle. Otherwise it gets terribly confusing. What you should have done is estimate the 1 cpu hardware improvements over 15 years, then the 1 cpu software improvement over 15 years, and left it at that. But I feel that you changed the point of reference in each case to suit whatever point you happened to be trying to make at the time. Whether you did or did not, you made it really confusing.

If you have noticed, _all_ of my testing has been single-cpu here. I do lots of SMP testing, but never in this discussion, and primarily only do that when I want to test SMP changes, since it slows testing down by 8x.

My "Crafty" numbers could be written two different ways. Clearly if you run Crafty on a single-cpu I7, it will run at N nodes per second. Just as clearly, a 6-core (or dual-6-core) i7 would be 6 or 12 times faster in overall computing power. And less in terms of chess-playing skill. Which number to use? I use Crafty's speedup numbers since they have been verified dozens of times, by many different people. But one could argue that since Crafty's NPS would actually be 6x or 12x faster on the above hardware, that is a software shortcoming, and the hardware should get full credit. If we can't use it effectively, is that the engineer's fault?

But for my summary of results, I certainly used the effective speedup number, which was about 1024, as opposed to the theoretical max, which was 1500x.

So let's please keep this real simple and leave out MP completely. Do the 1 core calculation only for old and new programs. THEN we can see how much we get for 6 more cores in 2010.

Already did that. If you want to remove the 4.5x speedup for 6 cores, we go back to 250x faster. Which is 8 doublings. or something in the range of 560 Elo assuming just 70 per double. But there is definitely another two doublings without being optimistic, because we have such 6-core boxes already and have the performance data showing 4.5x (actually closer to 5.0 for a couple of hundred test positions (and not tactical positions that are well-suited)).

I cannot help but feel that your argument is really weak when you feel the need now to talk about the inferiority of various SMP machines and now are pushing to invalidate the results of the ratings agencies due to this.

Where am I pushing to invalidate anything? Someone questioned why my results are off by 50 or whatever compared to others. Simple questions:

What about the book? Do you believe it matters? I believe a good book is worth 100 Elo, from many years of doing this stuff. I don't release our tournament book, period, we keep it for tournaments.

What about learning? I can't count the number of mistakes I have made due to book and position learning. The first version looks bad, the new version looks better. Not because it is better, but because of the learning from the first version.

What about variable loads on a test machine? Run a short query while A is thinking and it gets hurt. Run it while B is running and B gets hurt. I don't have that problem at all.

I don't mix and match and change opponents all the time. I try to keep the same opponents, the same positions, the same hardware, the exact same time control, the same everything. And no books or other outside influences.

And exactly what "inferiority of various SMP machines" have I started to talk about? No idea what that means...

Don · Post by **Don** » Tue Sep 14, 2010 9:11 pm

Bob,

I tried the link you gave me and it's not working. Is it working for anyone else or is there something wrong with my connection?

Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka