Deep Blue vs Rybka

bob · Post by **bob** » Tue Sep 14, 2010 4:26 pm

Uri Blass wrote:
bob wrote:The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.
In this case maybe Deep thought performed better against GM's relative to other players.

I do not remember that deep thought played against 24 GM's in 1988
so it is going to be nice to have a list of GM's who played against it in 1988.

It had to, as that was a requirement. And IIRC some of those games were not rated, because USCF had rules about "matches" between computers and humans, because of the commercial bullshit from years before where they would pay someone to play an "arranged match" to give them a high rating for their advertising...

I do not remember the names of all the GMs. I do remember it playing Byrne, Browne and Spraggett, but I suspect this must be recorded somewhere in the announcement of their winning the prize.

bob · Post by **bob** » Tue Sep 14, 2010 4:29 pm

Milos wrote:
mhull wrote:The point is that Bob knows how to do it and has the resources to test and perfect the technique.
The only thing I find interesting is whether or not he'd be able to achieve at least 20elo per node doubling starting from 32 nodes cluster. So that, for example, with 1000 nodes he could get 100elo more compared to 32 nodes cluster.
It would really surprised me if he could.

I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still. Clusters are a different animal. And I play with this from time to time. It's a harder problem and I am not going to try to predict what is possible. NPS will be huge, but speedup is unknown. GCP could probably give a starting point for estimates here since he is doing this already.

Milos · Post by **Milos** » Tue Sep 14, 2010 4:36 pm

bob wrote:Totally up to you as to what you believe. As far as results go, here's at least one to chew on as I am running a calibration match right now to replace stockfish 1.6 with the latest 1.8.
Code: Select all
    Stockfish 1.8 64bit  2878    3    3 56621   83%  2606   18% 
    Crafty-23.4-1        2672    4    4 30000   61%  2582   20% 
    Crafty-23.4R01-1     2669    4    4 30000   61%  2582   21% 
You can do the subtraction to see the difference in the two program.

So, whether you believe my numbers or not doesn't really matter. They are what they are. However, as I said, my testing is done a bit differently. Equal hardware. No parallel search (from significant testing, crafty will pick up 20+ elo over stockfish on an 8 core platform). No opening book (probably significant, as we have never released a customized book at all, and simply play from 3000 equal starting positions, spread across _all_ popular openings being played by IM/GM players.

Crafty results are exaggerated. In reality you only test against 2 families of engines (Glaurung and Fruit) and Crafty is during years in fact tuned against them. You can see that if you run it against Rybka. Suddenly the things change and Crafty becomes really weak.
But then you cannot run it on your cluster against closed source programs. Try running it than against Ivanhoe and you'd be surprised with the result. I did this, not buy running 30k games, even 1000 was sufficient.

There is no point in running 30k matches and cutting down error bars to 4 elo, when you have a systematic error of at least 20 elo by non-representative sample of opponents.
Sorry, but your testing methodology is flawed.
Maybe once you'll understand this.

I tend to not dream, and actually _measure_. I just showed results yesterday, here, that cutting the speed by 1/2 drops Elo by 70. Cutting it by 1/2 again drops it by 80. If you can't handle the math, that would appear to be _your_ problem, not mine.

Speed doubling does increase elo by 70. Cores doubling doesn't. You are aware of this but somehow you are ignoring this fact all the time.
Increasing speed 128 times improves elo for almost 500. Going from 32 to 4096 nodes doesn't give you even 200 elo. If you don't believe me, test it on your cluster and publish results, if you dare.

I have no idea what you are talking about. Even if that were true, which it probably is not, if one could gain +20 for every time the number of processors is doubled, that would produce a difficult-to-beat machine, knowing that there are 64K node machines around. That is only 16 doublings, and at "only" 20 Elo, that is +320. And for processor numbers below 64, +20 for doubling is an under-estimation.

Node doubling saturates. When going from 32678 to 65536 nodes you would not gain anything at all. In fact you would be lucky if you don't loose elo.
Show us your cluster results for example when testing 378 against 756 nodes. It will be less than 20 elo of difference. It's a simple experiment for you, I guess

.

Don · Post by **Don** » Tue Sep 14, 2010 4:51 pm

Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.

The other issue is that how much you get per doubling is not a constant either. Modern programs have excellent branching factors compared to the older programs. This is software improvement. In fact I don't really thing there is a good way to resolve this. But consider this:

If you take a 1995 program and test it with a bunch of doubling's, you won't come up with a very impressive number in terms of ELO. If you could do the same with a modern program you will come up with much more impressive numbers.

It would be a real mistake to attribute all of this to hardware by observing how much modern programs improve with a speedup. I know how you think and you are going to consider this completely fair. But it isn't because this kind of improvement was just not possible without the software improvements that lowered BF so much.

Also, you really NEED to give us all of your source code and binaries, to the old program and new.

You can of course do anything you want, but I have to say that I am extremely uncomfortable with YOU being in complete control of all the tests designed by YOU under YOUR conditions when these tests are designed so that YOU can make a point. Nobody was given any feedback on how these tests were run.

So unless you hand over the sources and binaries involved so that there can be some transparency, I think it would be foolish on our part to continue to entertain this.

It's stupid for us to keep bringing up issues and then for us to wait for you to tell us if we are right or wrong based on some private testing that you decided on.

bob wrote:
Uri Blass wrote:
bob wrote:The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.
In this case maybe Deep thought performed better against GM's relative to other players.

I do not remember that deep thought played against 24 GM's in 1988
so it is going to be nice to have a list of GM's who played against it in 1988.
It had to, as that was a requirement. And IIRC some of those games were not rated, because USCF had rules about "matches" between computers and humans, because of the commercial bullshit from years before where they would pay someone to play an "arranged match" to give them a high rating for their advertising...

I do not remember the names of all the GMs. I do remember it playing Byrne, Browne and Spraggett, but I suspect this must be recorded somewhere in the announcement of their winning the prize.

Milos · Post by **Milos** » Tue Sep 14, 2010 5:10 pm

bob wrote:I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still. Clusters are a different animal. And I play with this from time to time. It's a harder problem and I am not going to try to predict what is possible. NPS will be huge, but speedup is unknown. GCP could probably give a starting point for estimates here since he is doing this already.

You are talking all the time about speedup, which I suppose, you define as time to fixed depth. This is wrong, since time to fixed depth is not linearly dependent to strength.
Moreover, even the upper linear approximation is very hard to believe.
If you want to do a real test, run 1CPU case against your standard set of opponents, than 2CPU case, than 4 CPU case, etc. and plot the curve elo vs. num_cpus. I guarantee you it's not going to be linear, as a matter of fact it is probably goint to be something in the form of
elo_gain=A*(1-exp(-k*num_cpus)).

michiguel · Post by **michiguel** » Tue Sep 14, 2010 5:18 pm

bob wrote:
Milos wrote:
rbarreira wrote:Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.
Sure 70 elo is much more realistic (add twice the error margin and you'll still be far of 70 elo). Give me a break. Bob is a big authority in the field but he is enormously biased when something of his own is in question!
I simply reported the doubling (really halving) Elo change as part of the hardware vs software debate. Anyone can reproduce that test if they have the time and the interest. Put Crafty in a pool of players, and play heads-up for 30,000 games. Then re-run bug only give Crafty 1/2 the time (others get original time). Elo dropped by 70. Do it again so that we are now at 1/4 the time or 4x slower. Elo dropped by 150 total, or 80 for this second "halving".

Just for the record, because I see that mentioning this will become a truth if it is not contested. This second "halving" was obtained with an engine that scored ~10% of the points in that particular pool. There are three problems. First, the error in ELO increases a lot because the difference between 10 and 11% is much bigger than 50-51%. Second, the error bar for 10% is bigger than 50%. Third, the ELO approximation at the tails of the curve may not be accurate anymore.

The difference between 70 and 80 may be just error.

Miguel

You have two reasonable alternatives:

(1) run the test and post your results. If they are different from mine, then we can try to figure out why.

(2) be quiet. guessing, thinking, supposing and such have no place in a discussion about _real_ data. And "real" data is all I have ever provided here. As in my previous data about SF 1.8 vs Crafty 23.4...

rbarreira · Post by **rbarreira** » Tue Sep 14, 2010 5:19 pm

bob wrote:
Milos wrote:
mhull wrote:The point is that Bob knows how to do it and has the resources to test and perfect the technique.
The only thing I find interesting is whether or not he'd be able to achieve at least 20elo per node doubling starting from 32 nodes cluster. So that, for example, with 1000 nodes he could get 100elo more compared to 32 nodes cluster.
It would really surprised me if he could.
I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still. Clusters are a different animal. And I play with this from time to time. It's a harder problem and I am not going to try to predict what is possible. NPS will be huge, but speedup is unknown. GCP could probably give a starting point for estimates here since he is doing this already.

I'm not doubting the results of your tests with up to 64 CPUs, but that formula can't go below 70% efficiency even with infinite CPUs, which makes me doubt its application to a high number of CPUs. That formula actually says that beyond 8 CPUs or so, the overhead of adding CPUs almost doesn't increase at all.

Unless there is some reason to believe that efficiency will never get below 70%, which I would find quite surprising.

michiguel · Post by **michiguel** » Tue Sep 14, 2010 5:59 pm

Milos wrote:
bob wrote:I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still. Clusters are a different animal. And I play with this from time to time. It's a harder problem and I am not going to try to predict what is possible. NPS will be huge, but speedup is unknown. GCP could probably give a starting point for estimates here since he is doing this already.
You are talking all the time about speedup, which I suppose, you define as time to fixed depth. This is wrong, since time to fixed depth is not linearly dependent to strength.
Moreover, even the upper linear approximation is very hard to believe.
If you want to do a real test, run 1CPU case against your standard set of opponents, than 2CPU case, than 4 CPU case, etc. and plot the curve elo vs. num_cpus. I guarantee you it's not going to be linear, as a matter of fact it is probably goint to be something in the form of
elo_gain=A*(1-exp(-k*num_cpus)).

Probably better:

EloGain = A * ln (n / (k+n)) + MaxEloGain

where n is num_cpus and A is a constant close to ~100.
k is another constant.

Miguel
EDIT: EloGain compared to 1 cpu

mhull · Post by **mhull** » Tue Sep 14, 2010 6:02 pm

Milos wrote:Crafty results are exaggerated. In reality you only test against 2 families of engines (Glaurung and Fruit) and Crafty is during years in fact tuned against them. You can see that if you run it against Rybka. Suddenly the things change and Crafty becomes really weak.
But then you cannot run it on your cluster against closed source programs. Try running it than against Ivanhoe and you'd be surprised with the result. I did this, not buy running 30k games, even 1000 was sufficient.

The difference between you and Bob is that Bob posts measurements that support his claims.

Milos wrote:There is no point in running 30k matches and cutting down error bars to 4 elo, when you have a systematic error of at least 20 elo by non-representative sample of opponents.
Sorry, but your testing methodology is flawed.
Maybe once you'll understand this.

Maybe if you posted real measurements, your opinion would have credibility.

Milos · Post by **Milos** » Tue Sep 14, 2010 6:29 pm

mhull wrote:The difference between you and Bob is that Bob posts measurements that support his claims.

Maybe if you posted real measurements, your opinion would have credibility.

Ok lets see:

CCRL 40/4:
Stockfish 1.7.1 64-bit 3142 +13 −13 2402 65.5%
Crafty 23.2 64-bit 2820 +15 −15 1655 56.1%
322 elo

CEGT 40/4:
Stockfish 1.7.1 x64 1CPU 3099 11 11 2540 62.4%
Crafty 23.1 x64 1CPU 2775 38 38 200 60.2%
324 elo

IPON:
Stockfish 1.7.1 JA 2883 11 11 3000 73%
Crafty 23.3 JA 2600 14 14 1900 30%
Crafty 23.1 JA 2545 10 10 4000 26%
283 elo compared to 23.3, 338 elo compared to 23.1

SWCR:
Stockfish 1.8.0 JA x64 2.892 22 21 800
Crafty 23.3 JA x64 2.582 20 21 800
310 elo

Are you really convinced all these lists are wrong and only Bob's cluster results are correct???
Remeber we are not talking about 10-15 elo difference. This is almost 100 elo difference (or at least 85 elo difference in the best case for Bob)!

Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka