Deep Blue vs Rybka

bob · Post by **bob** » Tue Sep 14, 2010 6:33 pm

Don wrote:Bob,

I'm not following this too closely any longer. I don't know to what extent you have taken these 2 things into consideration - maybe you already have but if not, here goes:

Crafty gets 100 ELO going from 1 to 4 processors. That is 2 doublings and that means you get 50 ELO per doubling. If you go with MORE processors you get even less ELO per doubling. So the point is that you cannot mix and match any way you want to and call it science. I'm not saying you are doing that as I am only quickly skimming these discussions. So if you talk about nodes per second, number of cores, or speedup per core you have to separate them and make sure you are being scientifically rigid, at least as much as tests like this can permit.

That is not "two doublings". This is, once again, apples and oranges. SMP overhead comes in and this changes things.

So can we use some sensible numbers? Going from 1 cpu to 4, is about a 3x speedup. But not knowing what hardware makes even this inexact, as there are good SMP boxes and bad SMP boxes. And some take an SMP version and run it on NUMA, which is significantly lower than optimal if Crafty doesn't know. Etc. So 3x = +100 is "in the range". But it isn't 2 doublings.

If you recall, in our hardware discussion I gave both a "NPS speed" which is just raw hardware, but then I factored in SMP to go from 1 i7 core to 6, and ended up with 4.5x rather than 6x.

The other issue is that how much you get per doubling is not a constant either. Modern programs have excellent branching factors compared to the older programs. This is software improvement. In fact I don't really thing there is a good way to resolve this. But consider this:

If you take a 1995 program and test it with a bunch of doubling's, you won't come up with a very impressive number in terms of ELO. If you could do the same with a modern program you will come up with much more impressive numbers.

I don't believe that, because I have tried it. In fact, as we get more speculative in the search, going 2x faster is not the same as going 2x faster 10 years ago, because some fraction of that 2x is doing up in smoke because of error. I just ran some "doubling" experiments for the old version. I'll crank out a couple for the new code to compare, again.

It would be a real mistake to attribute all of this to hardware by observing how much modern programs improve with a speedup. I know how you think and you are going to consider this completely fair. But it isn't because this kind of improvement was just not possible without the software improvements that lowered BF so much.

Also, you really NEED to give us all of your source code and binaries, to the old program and new.

Already done. Always been on my ftp machine. I can stick the modified version I tested there, but again, I am not certain at all that it will run with xboard, the protocol has changed a lot over the years. 23.3 is already available and everyone interested has a copy, and it is still on my ftp box. 23.4 is maybe 5-6 Elo stronger and isn't available, but I don't see how that makes enough difference to matter.

I did just copy my 10.x version over, and I called it 10.x to keep it from being confused with 10.18. Test at your own risk. It does require 64 bit hardware only, and it works with my cluster referee program, no idea about xboard/winboard...

You can of course do anything you want, but I have to say that I am extremely uncomfortable with YOU being in complete control of all the tests designed by YOU under YOUR conditions when these tests are designed so that YOU can make a point. Nobody was given any feedback on how these tests were run.

So unless you hand over the sources and binaries involved so that there can be some transparency, I think it would be foolish on our part to continue to entertain this.

Funny guy. Who was making claims about Komodo as well during the discussion? And where, exactly, is _your_ source. I know, this is a one-way street. In any case, my source has been available the whole time on ftp.cis.uab.edu/pub/hyatt/source, you just choose the version numbers to look at. I just put my crafty-10.x.tar over there which may work for you, or not. the normal 10.18 certainly will not as I could not get it to compile and run without making quite a few changes.

It's stupid for us to keep bringing up issues and then for us to wait for you to tell us if we are right or wrong based on some private testing that you decided on.

No, it is _lazy_ for you to not run the tests yourself. I try to be open, and post everything relevant, and you want to imply that I am being secretive and not open. Again, where's your source? Got something to hide? Clearly I don't...

bob wrote:
Uri Blass wrote:
bob wrote:The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.
In this case maybe Deep thought performed better against GM's relative to other players.

I do not remember that deep thought played against 24 GM's in 1988
so it is going to be nice to have a list of GM's who played against it in 1988.
It had to, as that was a requirement. And IIRC some of those games were not rated, because USCF had rules about "matches" between computers and humans, because of the commercial bullshit from years before where they would pay someone to play an "arranged match" to give them a high rating for their advertising...

I do not remember the names of all the GMs. I do remember it playing Byrne, Browne and Spraggett, but I suspect this must be recorded somewhere in the announcement of their winning the prize.

mhull · Post by **mhull** » Tue Sep 14, 2010 6:38 pm

bob wrote:
Uri Blass wrote:
bob wrote:The tournaments don't count "in toto". The Fredkin stage 2 prize required a 2550+ rating over 24 consecutive games against GM players only. However, if you look at old USCF rating reports, You can find a 2551 rating in 1988, although I don't have 'em for the entire year.
In this case maybe Deep thought performed better against GM's relative to other players.

I do not remember that deep thought played against 24 GM's in 1988
so it is going to be nice to have a list of GM's who played against it in 1988.
It had to, as that was a requirement. And IIRC some of those games were not rated, because USCF had rules about "matches" between computers and humans, because of the commercial bullshit from years before where they would pay someone to play an "arranged match" to give them a high rating for their advertising...

I do not remember the names of all the GMs. I do remember it playing Byrne, Browne and Spraggett, but I suspect this must be recorded somewhere in the announcement of their winning the prize.

Here is an account of winning the Fredkin Intermediate Prize of attaining a 2500+ rating.
www.aaai.org/ojs/index.php/aimagazine/a ... ad/753/671

mhull · Post by **mhull** » Tue Sep 14, 2010 6:45 pm

Milos wrote:
mhull wrote:The difference between you and Bob is that Bob posts measurements that support his claims.

Maybe if you posted real measurements, your opinion would have credibility.
Ok lets see:

CCRL 40/4:
Stockfish 1.7.1 64-bit 3142 +13 −13 2402 65.5%
Crafty 23.2 64-bit 2820 +15 −15 1655 56.1%
322 elo

Which engines does the common book favor? You don't know. Which engines does ponder-off favor? You don't know.

Milos wrote: Are you really convinced all these lists are wrong and only Bob's cluster results are correct???
Remeber we are not talking about 10-15 elo difference. This is almost 100 elo difference (or at least 85 elo difference in the best case for Bob)!

Bob's tests reduce the error margin by playing more games with fewer unknown variables.

I'm not saying you're wrong by definition, I'm just saying yours is an opinion based on weaker data.

michiguel · Post by **michiguel** » Tue Sep 14, 2010 6:53 pm

mhull wrote:
Milos wrote:
mhull wrote:The difference between you and Bob is that Bob posts measurements that support his claims.

Maybe if you posted real measurements, your opinion would have credibility.
Ok lets see:

CCRL 40/4:
Stockfish 1.7.1 64-bit 3142 +13 −13 2402 65.5%
Crafty 23.2 64-bit 2820 +15 −15 1655 56.1%
322 elo
Which engines does the common book favor? You don't know. Which engines does ponder-off favor? You don't know.

Milos wrote: Are you really convinced all these lists are wrong and only Bob's cluster results are correct???
Remeber we are not talking about 10-15 elo difference. This is almost 100 elo difference (or at least 85 elo difference in the best case for Bob)!
Bob's tests reduce the error margin by playing more games with fewer unknown variables.

I'm not saying you're wrong by definition, I'm just saying yours is an opinion based on weaker data.

This is the typical example of accuracy vs. precision. Bob's measurements are very precise, but possibly less accurate (we do not know the accuracy or lack thereof). The accuracy problem is lower when two Crafty versions are compared, but potentially bigger when Crafty is compared to one of the control programs.

Miguel

bob · Post by **bob** » Tue Sep 14, 2010 6:57 pm

Milos wrote:
bob wrote:I have actually run on machines with up to 64 CPUs (not a cluster, a real shared-memory SMP box). And thru 64, the speedup pretty well matched my linear approximation of

speedup = 1 + (NCPUS - 1) * 0.7

Or, for the 64 CPU case, about 45x faster than one CPU. Not great, but seriously faster, still. Clusters are a different animal. And I play with this from time to time. It's a harder problem and I am not going to try to predict what is possible. NPS will be huge, but speedup is unknown. GCP could probably give a starting point for estimates here since he is doing this already.
You are talking all the time about speedup, which I suppose, you define as time to fixed depth. This is wrong, since time to fixed depth is not linearly dependent to strength.
Moreover, even the upper linear approximation is very hard to believe.
If you want to do a real test, run 1CPU case against your standard set of opponents, than 2CPU case, than 4 CPU case, etc. and plot the curve elo vs. num_cpus. I guarantee you it's not going to be linear, as a matter of fact it is probably goint to be something in the form of
elo_gain=A*(1-exp(-k*num_cpus)).

Pure nonsense. If a program gets to the same depth 3x faster, it is 3x better. Doesn't matter whether this is done by SMP search or just a single CPU that is 3x faster. Please come back when you actually have something that is based on reality, rather than just random nonsense. Twice as fast gives (for my old program, not verified for new one yet, but will be) +70 to +80. Whether you give me one CPU that is twice as fast, or enough CPUs to let the search reach the same point twice as fast does not matter.

I can absolutely guarantee you that for what I have done so far, which stops at 64 CPUs, my linear formula is very accurate. For 1-8 cpus it is actually pessimistic. You can find a discussion about 1-8 cpu speedups a few years ago when I was running on an 8-cpu opteron box. I ran the test positions Vincent asked for, and posted the results on my ftp box. Several looked at them, Martin F. took the time to go thru several hundred positions to compute the speedup and found that my estimation formula was a bit off. For example, real was 3.3, predicted was 3.1.. for 8, real was 6.3 (I believe) and predicted was 5.9. Since, I have had the chance to run on 12, 24, 32 and 64 cpu boxes, and found the formula to fit. When I tested on 32 and 64, it took some significant tweaking to get reasonable numbers. That was 2+ years ago however.

So stop spouting nonsense about how to measure performance. Been doing it for 32 years on parallel algorithms and know how to do it correctly. You should read a bit and learn the same for yourself.

If you don't think there is a linear relationship between depth and strength, you need to get out more and read some. Ken Thompson, then Berliner, then Heinz all ran these varying-depth tests to measure Elo improvement, searching for that elusive "diminishing returns" look on a graph.

Milos · Post by **Milos** » Tue Sep 14, 2010 6:57 pm

mhull wrote:Bob's tests reduce the error margin by playing more games with fewer unknown variables.

I'm not saying you're wrong by definition, I'm just saying yours is an opinion based on weaker data.

Bob's tens of thousands of games has nothing to do with accuracy. Simply his testing methodology is faulty. He could play million of games and his results would be still inaccurate.
I understand some ppl get easily impressed by this, but when different test methodologies, opening books etc., all show accordance to 10-15 elo accuracy (with 15-20 elo error margin) and Bob's result are off by almost 100 elo with his 4 elo margin everything you say is just holding for a straw.
Bob has a systematic error in his testing methodology which he (and some other ppl) are not willing to admit.
I hope you remember The Emperor's New Clothes tale...

bob · Post by **bob** » Tue Sep 14, 2010 6:59 pm

michiguel wrote:
bob wrote:
Milos wrote:
rbarreira wrote:Are you really quibbling about 22-50 elo differences on ratings with +- 16 error margins? That just doesn't make any sense.
Sure 70 elo is much more realistic (add twice the error margin and you'll still be far of 70 elo). Give me a break. Bob is a big authority in the field but he is enormously biased when something of his own is in question!
I simply reported the doubling (really halving) Elo change as part of the hardware vs software debate. Anyone can reproduce that test if they have the time and the interest. Put Crafty in a pool of players, and play heads-up for 30,000 games. Then re-run bug only give Crafty 1/2 the time (others get original time). Elo dropped by 70. Do it again so that we are now at 1/4 the time or 4x slower. Elo dropped by 150 total, or 80 for this second "halving".

Just for the record, because I see that mentioning this will become a truth if it is not contested. This second "halving" was obtained with an engine that scored ~10% of the points in that particular pool. There are three problems. First, the error in ELO increases a lot because the difference between 10 and 11% is much bigger than 50-51%. Second, the error bar for 10% is bigger than 50%. Third, the ELO approximation at the tails of the curve may not be accurate anymore.

The difference between 70 and 80 may be just error.

I believe I mentioned that. And that the only solution was to find weaker opponents, which is a gigantic waste of time for anything except to answer this question.

Miguel

You have two reasonable alternatives:

(1) run the test and post your results. If they are different from mine, then we can try to figure out why.

(2) be quiet. guessing, thinking, supposing and such have no place in a discussion about _real_ data. And "real" data is all I have ever provided here. As in my previous data about SF 1.8 vs Crafty 23.4...

mhull · Post by **mhull** » Tue Sep 14, 2010 7:03 pm

bob wrote:...the only solution was to find weaker opponents, which is a gigantic waste of time for anything except to answer this question.

If someone had the time to waste, perhaps Gnuchess 5.17, Gnuchess 4.x (completely different program), Phalanx XXII. Dann Corbit might have more suggestions.

bob · Post by **bob** » Tue Sep 14, 2010 7:04 pm

Milos wrote:
bob wrote:Totally up to you as to what you believe. As far as results go, here's at least one to chew on as I am running a calibration match right now to replace stockfish 1.6 with the latest 1.8.
Code: Select all
    Stockfish 1.8 64bit  2878    3    3 56621   83%  2606   18% 
    Crafty-23.4-1        2672    4    4 30000   61%  2582   20% 
    Crafty-23.4R01-1     2669    4    4 30000   61%  2582   21% 
You can do the subtraction to see the difference in the two program.

So, whether you believe my numbers or not doesn't really matter. They are what they are. However, as I said, my testing is done a bit differently. Equal hardware. No parallel search (from significant testing, crafty will pick up 20+ elo over stockfish on an 8 core platform). No opening book (probably significant, as we have never released a customized book at all, and simply play from 3000 equal starting positions, spread across _all_ popular openings being played by IM/GM players.
Crafty results are exaggerated. In reality you only test against 2 families of engines (Glaurung and Fruit) and Crafty is during years in fact tuned against them. You can see that if you run it against Rybka. Suddenly the things change and Crafty becomes really weak.
But then you cannot run it on your cluster against closed source programs. Try running it than against Ivanhoe and you'd be surprised with the result. I did this, not buy running 30k games, even 1000 was sufficient.

There is no point in running 30k matches and cutting down error bars to 4 elo, when you have a systematic error of at least 20 elo by non-representative sample of opponents.
Sorry, but your testing methodology is flawed.
Maybe once you'll understand this.

I tend to not dream, and actually _measure_. I just showed results yesterday, here, that cutting the speed by 1/2 drops Elo by 70. Cutting it by 1/2 again drops it by 80. If you can't handle the math, that would appear to be _your_ problem, not mine.
Speed doubling does increase elo by 70. Cores doubling doesn't. You are aware of this but somehow you are ignoring this fact all the time.
Increasing speed 128 times improves elo for almost 500. Going from 32 to 4096 nodes doesn't give you even 200 elo. If you don't believe me, test it on your cluster and publish results, if you dare.

Again, the first prerequisite for participating in a discussion is "learn to read". I've discussed _both_ aspects. And in all the comparisons I made, the numbers were _corrected_ for SMP loss. My 6core i7 numbers were based on 4.5x faster performance, which _is_ correct. I did not call it 6x. Except when I used the term "raw hardware speed. That took my 1500x raw hardware improvement down to just over 1024x, if you'd only read instead of skimming, and then analyze rather than diving into what you are going to write next, without following the discussion at all.

I have no idea what you are talking about. Even if that were true, which it probably is not, if one could gain +20 for every time the number of processors is doubled, that would produce a difficult-to-beat machine, knowing that there are 64K node machines around. That is only 16 doublings, and at "only" 20 Elo, that is +320. And for processor numbers below 64, +20 for doubling is an under-estimation.
Node doubling saturates. When going from 32678 to 65536 nodes you would not gain anything at all. In fact you would be lucky if you don't loose elo.
Show us your cluster results for example when testing 378 against 756 nodes. It will be less than 20 elo of difference. It's a simple experiment for you, I guess .

Have you seen me discuss 32768 nodes other than theoretically? Of course not. Nowhere in the discussion did we go off into never-never land, except for the hypothetical case of what _might_ happen one day.

Milos · Post by **Milos** » Tue Sep 14, 2010 7:04 pm

bob wrote:I can absolutely guarantee you that for what I have done so far, which stops at 64 CPUs, my linear formula is very accurate. For 1-8 cpus it is actually pessimistic. You can find a discussion about 1-8 cpu speedups a few years ago when I was running on an 8-cpu opteron box. I ran the test positions Vincent asked for, and posted the results on my ftp box. Several looked at them, Martin F. took the time to go thru several hundred positions to compute the speedup and found that my estimation formula was a bit off. For example, real was 3.3, predicted was 3.1.. for 8, real was 6.3 (I believe) and predicted was 5.9. Since, I have had the chance to run on 12, 24, 32 and 64 cpu boxes, and found the formula to fit. When I tested on 32 and 64, it took some significant tweaking to get reasonable numbers. That was 2+ years ago however.

Sure you ran test positions.
Then why don't you just run test positions to optimize the strength of your engine, but you run test matches instead???
I say test position are nonsense and you can't measure program strength with them.
You have never showed test matches data.
When you are so convinced why don't you run the test I proposed up there, it is very easy to run?

Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka