Deep Blue vs Rybka

Don · Post by **Don** » Thu Sep 16, 2010 3:14 am

Just to add random noise to this conversation, the way I measure the effect of some change on the speed in Komodo is to run off some fixed depth games and note the time for the first N ply, usually to ply 140. My tester measures this.

I have to run enough games that the number settles down which happens after 100 or 200 games.

So this is a test I do when I want to know if some selectivity change (for instance) really gave much of a speedup.

I have tried just running 100 or 200 random positions but I think this is better.

bob wrote:
Milos wrote:
bob wrote:The reason these positions were used is that before I wrote the DTS paper, as opposed to my dissertation, I had been asked "OK, we see what kind of speedup you get on random positions, but how does that translate to games, where you have the hash table carrying information from one position to the next, etc?" I didn't have an answer, so I undertook the challenge to answer that question. I can look thru any longish game Crafty plays on ICC, and find over the course of 10-15 moves search depths that vary by 10 plies.
Uh where to start... so much nonsense just in one post, it's hard.
You biggest argument is to repeat like parrot to me how ignorant I am, which actually tells about how grounded your claims are.

Lets see, somebody told you you should follow positions from a single chess game without resetting the hash and that would bit a representative sample. My comment is that you really had idiots in your thesis comity. More problematic is you accepted their suggestions...

So, up until 1995 (or whenever that was published) no one had _ever_ considered the actual speedup produced over a real game. Just random positions chosen from here and there. Several had noted that "deeper is better" for parallel search, something still true today. And YOU think that trying to answer that question is not worthwhile? No point in knowing what the speedup is in a game? I mean we don't actually use these things to play games anyway, correct?

The only idiot here is arguing with me at the moment.

You don't see any points at all, obviously. Had you read any previous research papers on parallel search, you would have known that deeper gives more accurate numbers. 15 ply searches are a fraction of a second. Time measurement noise becomes larger than the total search time. So get serious and study this about before making ridiculous statements.

And not very useful with some searches taking milliseconds...
Lets see further, you laugh at 15 plies search. 15 plies (with reset hash) on 8 core takes how much? miliseconds?
You must be joking me, it takes at least 0.1s for all your 40 positions. For most of positions it takes about second. I know you can't judge things correctly any more (it obviously comes with age), but you could at least try to run it yourself before writing ridiculous things like this.

??? You have already run these positions? I can show you endgame searches that reach 40 plies in < .01 seconds, the smallest amount of time I can measure inside Crafty. Do you know anything about parallel tree search? You do realize there is overhead (computational, not extra nodes) to simply kick off parallel searches? And of course there is the extra node overhead caused by incorrect move ordering. And in a 15 ply search, in simple positions, there is more overhead involved in getting the split done than there is in doing the split search?

This really is pointless when you have no idea whatsoever about how a parallel alpha/beta search works...

In addition in your DTS paper you hardly reached amazing depth of incredible 11plies. How ridiculous that must that be then, when 15 is ridiculously small for you?
In the C90, in 1994, we were doing 11 ply searches in 3 minutes. Today, on an 8-core, I see 24-26-28 depending. 15 sounds pretty miniscule, eh? In 2004 when I ran the last set of tests using the 8-way opteron box, 15 plies was not exactly a long search either, although nowhere near as short as it is today. I just tried a 15 ply search on my 8-core intel box. starting position. Results:
Code: Select all
               15->   0.32   0.36   1. Nf3 Nc6 2. Nc3 Nf6 3. e4 e5 4. d4
                                    exd4 5. Nxd4 Bc5 6. Nxc6 bxc6 7. Bd3
                                    O-O 8. O-O
              time=0.33  mat=0  n=2128730  fh=91%  nps=6.5M
6.5M nps

Then run a little longer to let the parallel search have a chance to overcome the overhead it incurs and:
Code: Select all
               23    41.31   0.22   1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. d4
                                    exd4 5. Nxd4 Bc5 6. Nxc6 bxc6 7. Bc4
                                    O-O 8. O-O d6 9. Bg5 Be6 10. Bxe6 fxe6
                                    11. Qf3 Bd4 12. Rfe1
               23->  48.53   0.22   1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. d4
                                    exd4 5. Nxd4 Bc5 6. Nxc6 bxc6 7. Bc4
                                    O-O 8. O-O d6 9. Bg5 Be6 10. Bxe6 fxe6
                                    11. Qf3 Bd4 12. Rfe1
              24    48.53   1/20*  1. e4      &#40;14.9Mnps&#41;             
              time=59.88  mat=0  n=912806093  fh=91%  nps=15.2M
So, now 2.5x faster.

See what I mean? There's a _lot_ you don't know about this that those of us doing this stuff do understand.

And then you talk about accurate numbers by running total of 40 positions. A really good representative sample of a chess game. Oh dear...
Funny. I've been playing chess for 50 years now. I don't play many 200 move games. In fact I don't recall having played _any_ games over 100 moves long in all the tournaments and skittles games I have played. I excluded the first 10 moves or so because those were book moves. Hard to get a parallel speedup there, eh? And I excluded that last 10-15 moves because the game was over and the search speedup was greatly inflated once mate is found. Seems like the only viable way to answer the question, given the difficulty of accessing a 70 million buck machine in 1994 when the actual game was played in the ACM tournament.

There are two obvious ways to run a test like this...
.
.
.
Ideally one wants each position to take about as long as the other positions, so that the speedups, when averaged, don't weigh some more heavily than others. But I suppose that is yet another thing you had no clue about?
No, there are 2 Bob's ways. Bob of course thinks these are only 2 possible ways. This only speaks about Bob's arrogance.
Or your ignorance. I'm waiting for an alternative. I'll bet this is going to be earth-shattering in its significance. Or stupidity. I'll read on to see.

But since Bob is not a god, and his writings are not the bible (even though most of his findings are as old as the later) there is absolutely no reason to follow them.
There is an infinite number of ways, and none of them is more correct then the other (specifically not because Bob says the opposite).
You could for example run up to fixed number of nodes each time.
First suggestion is stupidity. Now you factor out search overhead because you always stop after N nodes. Great parallel speedup to do that. Meaningless. But great. On to the next...

By running to a fixed depth you get differently shaped (sized) trees you get more variety and this is certainly not worse than to run for always the same amount of time.
So early in the game, your fixed depth searches take 2 minutes. Late in the game they take 2 seconds. And you are going to take the speedup for a 2 minute computation and treat it equally to the speedup for a 2 second computation? Which would help the program the most, a good speedup on the big tree, or a good speedup on the small tree. If you search the small tree in 0 seconds, you only save 2 seconds total to help you in the game. If you search the 2 minute position in only 1 minute, a paltry 2x speedup, you just saved a minute. In a timed game of chess, which is more important?

So stupid idea #2. Got any more of these?

As far as how many runs, I addressed that in my dissertation. I ran everything on a 30-processor sequent, and did statistical analysis to determine how many repeats were needed for each number of processors to get the SD down to something acceptable. For 2 and 4 it doesn't take very many at all. 4 for 4 cpus is reasonable. For 8 processors, at least 8 for a _good_ average. And for 16, double again and the numbers are good.
Lol at some OS from the time when Linus Torvalds was not born.
You do realize the recent data was on linux in 2004? The original was done on unix in 1994. No significant difference. In fact, for threads, the Cray was significantly better with no virtual memory, -much- more efficient semaphore mechanism, etc. Your ignorance might be approaching boundless...

Try something from this millennium you might get surprised with results .
Was not 2004 from "this millennium? quad socket dual core AMD opteron box. Running Linux.

Now we can definitely say "boundless". You have no clue what has been done, what has been written, or even what this experiment was about. Which is no less than I expected, of course...

Keep trying, so far batting average is 0.000, if you actually know what that means.

Milos · Post by **Milos** » Thu Sep 16, 2010 3:24 am

bob wrote:So, up until 1995 (or whenever that was published) no one had _ever_ considered the actual speedup produced over a real game. Just random positions chosen from here and there. Several had noted that "deeper is better" for parallel search, something still true today. And YOU think that trying to answer that question is not worthwhile? No point in knowing what the speedup is in a game? I mean we don't actually use these things to play games anyway, correct?

Bob you are just hilarious. You contradicted yourself so many time in this thread that you broke even Rolf's record.

bob, from couple of pages ago wrote:The concept of "speedup" applies to a position, not to a game. How do you compute the "speedup" for a game? I know how to compute the speedup for a position. And, just to keep this on a sane level, this is the way _everybody_ has been reporting speedup for 40 years now.

The only idiot here is arguing with me at the moment.

On this I could definitively agree.

??? You have already run these positions?

Yes I actually did. I don't go around throwing ridiculous numbers as you do.

I can show you endgame searches that reach 40 plies in < .01 seconds, the smallest amount of time I can measure inside Crafty.

That's completely irrelevant. Chance of that position occurring in an actual game is smaller than the chance of an elephant tunneling through a mouse hole. In other word one more logical fallacy of yours.

Do you know anything about parallel tree search? You do realize there is overhead (computational, not extra nodes) to simply kick off parallel searches? And of course there is the extra node overhead caused by incorrect move ordering. And in a 15 ply search, in simple positions, there is more overhead involved in getting the split done than there is in doing the split search?

And that overhead did not exist in 1994 in 11 plies search in DTS???
Again you contradict yourself. You are in a war with the simplest logic since you cannot grasp even the simplest thing.
If 15 plies is meaningless than your 1994 paper with 11 plies depth is even more meaningless.

In the C90, in 1994, we were doing 11 ply searches in 3 minutes. Today, on an 8-core, I see 24-26-28 depending. 15 sounds pretty miniscule, eh?

Time is completely irrelevant. Another one of your logical fallacies.
Reaching depth 11 in 100 hours of computation on processor running on 30kHz clock, or reaching the same depth on the same processor running on 6GHz in 1.8 sec is all the same.

Funny. I've been playing chess for 50 years now. I don't play many 200 move games. In fact I don't recall having played _any_ games over 100 moves long in all the tournaments and skittles games I have played. I excluded the first 10 moves or so because those were book moves. Hard to get a parallel speedup there, eh? And I excluded that last 10-15 moves because the game was over and the search speedup was greatly inflated once mate is found. Seems like the only viable way to answer the question, given the difficulty of accessing a 70 million buck machine in 1994 when the actual game was played in the ACM tournament.

This is just a lousy excuse, which does not make your 1994 paper results less crappier.

Now you factor out search overhead because you always stop after N nodes.

This is simply not true. Time to reach N nodes will be different each time you run the test, search tree will be different each time, and there is no such a thing as "factoring out" of the overhead.

You do realize the recent data was on linux in 2004? The original was done on unix in 1994. No significant difference. In fact, for threads, the Cray was significantly better with no virtual memory, -much- more efficient semaphore mechanism, etc. Your ignorance might be approaching boundless...

Thanks for exactly proving my point (and contradicting yourself again in the process

)
Since Cray was better it means the spread of uncertainty of the time to reach depth was much smaller than it is on today's OSs which means that your 4 runs per position on 4 nodes machine is not sufficient today and was not sufficient in 2004 either.

Milos · Post by **Milos** » Thu Sep 16, 2010 3:31 am

Don wrote:Just to add random noise to this conversation, the way I measure the effect of some change on the speed in Komodo is to run off some fixed depth games and note the time for the first N ply, usually to ply 140. My tester measures this.

I have to run enough games that the number settles down which happens after 100 or 200 games.

So this is a test I do when I want to know if some selectivity change (for instance) really gave much of a speedup.

I have tried just running 100 or 200 random positions but I think this is better.

Of course it's better. It is a very good way of easily getting a representative sample. 100 games x 140plies = 14000 representative positions ("only" 350 times more than what Bob is using).
And fixed depth means that all tree sizes are included and TC factors are excluded from the equation.

bob · Post by **bob** » Thu Sep 16, 2010 3:46 am

Don wrote:Just to add random noise to this conversation, the way I measure the effect of some change on the speed in Komodo is to run off some fixed depth games and note the time for the first N ply, usually to ply 140. My tester measures this.

I have to run enough games that the number settles down which happens after 100 or 200 games.

So this is a test I do when I want to know if some selectivity change (for instance) really gave much of a speedup.

I have tried just running 100 or 200 random positions but I think this is better.

Note that my "random positions" are not random at all. They are consecutive btm positions from a single game. On the C90 testing was limited. Today, not so much. Today, I simply make a smp change, and then play 30K games using SMP crafty, If the Elo goes up, the SMP change is good. If the Elo goes down, it is not... Much simpler and since I can play 70 8-thread games at a time, it goes reasonably quickly.

The testing we are discussing here originated on the C90, using a _long_ game at 16 cpus for the base values, then backing up to 8, 4, 2 and eventually 1 cpu to compute speedups. This is not the way I test anything today, but it was all that could be done using that hardware.

bob · Post by **bob** » Thu Sep 16, 2010 4:25 am

Milos wrote:
bob wrote:So, up until 1995 (or whenever that was published) no one had _ever_ considered the actual speedup produced over a real game. Just random positions chosen from here and there. Several had noted that "deeper is better" for parallel search, something still true today. And YOU think that trying to answer that question is not worthwhile? No point in knowing what the speedup is in a game? I mean we don't actually use these things to play games anyway, correct?
Bob you are just hilarious. You contradicted yourself so many time in this thread that you broke even Rolf's record.

Only in _your_ mind. Which really doesn't matter to me.

bob, from couple of pages ago wrote:The concept of "speedup" applies to a position, not to a game. How do you compute the "speedup" for a game? I know how to compute the speedup for a position. And, just to keep this on a sane level, this is the way _everybody_ has been reporting speedup for 40 years now.

The only idiot here is arguing with me at the moment.
On this I could definitively agree.

??? You have already run these positions?
Yes I actually did. I don't go around throwing ridiculous numbers as you do.

Small lie, here? there are _not_ 40 positions.

Of course if you didn't look at the paper, you wouldn't know that. So read it again. There were actually 24 that were used. Book move positions were excluded. Mate-in-N at the end were excluded. So you actually ran the non-existent positions, you say? And you want to talk about credibility and honesty? May as well remove the suspense and post the FEN for the _24_ positions now, which can be found in the paper I gave the link to:

Code: Select all

r2qkbnr/ppp2p1p/2n5/3P4/2BP1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq -  ; bm Na5
r2qkbnr/ppp2p1p/8/nB1P4/3P1pb1/2N2p2/PPPQ2PP/R1B2RK1 b kq - ; bm c6
r2qkbnr/pp3p1p/2p5/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B2RK1 b kq - ; bm cxb5
r2qkb1r/pp3p1p/2p2n2/nB1P4/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b kq - ; bm Kd7
r2q1b1r/pp1k1p1p/2P2n2/nB6/3P1Qb1/2N2p2/PPP3PP/R1B1R1K1 b - - ; bm bxc6
r2q1b1r/p2k1p1p/2p2n2/nB6/3PNQb1/5p2/PPP3PP/R1B1R1K1 b - - ; bm f2+
r2q1b1r/p2k1p1p/2p5/nB6/3Pn1Q1/5p2/PPP3PP/R1B1R1K1 b - - ; bm Kc7
r2q1b1r/p1k2p1p/2p5/nB6/3PR1Q1/5p2/PPP3PP/R1B3K1 b - - ; bm cxb5
r2q1b1r/p1k2p1p/8/np6/3PR3/5Q2/PPP3PP/R1B3K1 b - - ; bm Bd6
r4b1r/p1kq1p1p/8/np6/3P1R2/5Q2/PPP3PP/R1B3K1 b - - ; bm Be7
r6r/p1kqbR1p/8/np6/3P4/5Q2/PPP3PP/R1B3K1 b - - ; bm Raf8
5r1r/p1kqbR1p/8/np6/3P1B2/5Q2/PPP3PP/R5K1 b - - ; bm Kb6
5r1r/p2qbR1p/1k6/np2B3/3P4/5Q2/PPP3PP/R5K1 b - - ; bm Rxf7
5rr1/p2qbR1p/1k6/np2B3/3P4/2P2Q2/PP4PP/R5K1 b - - ; bm Qe8
5rr1/p2qbR1p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - - ; bm Rxf7
4qRr1/p3b2p/1kn5/1p2B3/3P4/2P2Q2/PP4PP/4R1K1 b - - ; bm Rxf8
5qr1/p3b2p/1kn5/1p1QB3/3P4/2P5/PP4PP/4R1K1 b - - ; bm Qd8
5q2/p3b2p/1kn5/1p1QB1r1/P2P4/2P5/1P4PP/4R1K1 b - - ; bm bxa4
5q2/p3b2p/1kn5/3QB1r1/p1PP4/8/1P4PP/4R1K1 b - - ; bm Nxe5
5q2/p3b2p/1k6/3QR1r1/p1PP4/8/1P4PP/6K1 b - - ; bm Rxe5
5q2/p3b2p/1k6/4Q3/p1PP4/8/1P4PP/6K1 b - - ; bm a6
3q4/p3b2p/1k6/2P1Q3/p2P4/8/1P4PP/6K1 b - - ; bm Kb5
3q4/p3b2p/8/1kP5/p2P4/8/1P2Q1PP/6K1 b - - ; bm Kb4
3q4/p3b2p/8/2P5/pk1P4/3Q4/1P4PP/6K1 b - - ; bm Qd5

The "bm" is not guaranteed to be the "best move". It is the necessary move so that the next position will follow the game.

What to say now that you have been caught in a false statement???

I can show you endgame searches that reach 40 plies in < .01 seconds, the smallest amount of time I can measure inside Crafty.
That's completely irrelevant. Chance of that position occurring in an actual game is smaller than the chance of an elephant tunneling through a mouse hole. In other word one more logical fallacy of yours.

You never know when to quit while you are ahead, do you? Here's a 32 ply search in .1 seconds from a game played on ICC recently, I am just skimming a few log files to see how many elephants have gone thru mouseholes in the last week.

Looks like a few elephants get through mouseholes regularly, for those that know our butt from a hatband. A few searches grabbed from log files of games played on ICC this week:

Code: Select all

               32->   0.13   0.01   44. b7 Rb2 45. Ke1 Kf8 46. Kd1 g4 47.
                                    Kc1 Rb6 48. Bd5 Ke7 49. g3 Kd6 50.
                                    Bxf7 Rxb7 51. Bh5 Rc7+ 52. Kd2 Rg7
                                    53. f3 gxf3 54. Bxf3 <HT>

               32     0.06   0.01   38. Be2 Nc1 39. Bc4 Re4 40. b6 Rxc4
                                    41. b7 Rb4 42. b8=Q+ Rxb8 43. Bxb8
                                    Kg7 44. Bf4 Nd3 45. Bd2 g5 46. Ke2
                                    Nc5 47. Bb4 Ne6 48. Bd6 f5 49. Ke3
                                    Kg6 50. Kd3 Kh5 51. Ke3 Kg4 52. h3+
                                    Kh4 53. Bg3+ Kh5 <HT>

               40     0.85   4.19   91. ... Ra6 92. Kb4 Ra1 93. h6+ Kh7
                                    94. Kc4 Rc1+ 95. Kd3 Ra1 96. Kd2 Ra2+
                                    97. Kc3 Ra4 98. Kb3 Ra1 99. Kb4 Rb1+
                                    100. Ka4 Ra1+ 101. Kb3 Ra6 102. Kc4
                                    Ra1 103. Kb5 Rb1+ 104. Kc6 Rc1+ 105.
                                    Kd7 Rd1+ 106. Ke6 Re1+ 107. Kf7 Ra1
                                    108. Ke7 Re1+ 109. Kd8 Rd1+ 110. Kc8
                                    Rc1+ 111. Kb7 Rb1+ 112. Kc7 Rc1+ 113.
                                    Kd6 Rd1+ 114. Ke5 Re1+ 115. Kf5 Rf1+
                                    116. Ke4 Re1+ 117. Kd4 Ra1 118. Kd3
                                    Rd1+ 119. Kc2 Ra1 120. Kc3 Kxh6 

               40     0.45   4.26   91. ... Rb1+ 92. Ka4 Ra1+ 93. Kb3 Ra5
                                    94. Kb2 Rb5+ 95. Kc2 Ra5 96. Kd3 Ra3+
                                    97. Ke2 Ra5 98. Kd2 Ra2+ 99. Ke3 Ra5
                                    100. Ke4 <HT>

               48->   0.37   4.26   90. ... Ra3+ 91. Ke2 Ra5 92. Kd2 Ra2+
                                    93. Ke3 Ra5 94. Ke4 Ra3 95. Kf5 <HT>

That should make the point. If you want the log files, I'll be happy to put 'em on my ftp box. Yet again, your ignorance of current program and hardware capabilities is beyond belief. Those came from the first two games I looked at that lasted for more than 60 moves. There are dozens more. And in those games there are dozens of searches like the above that show elephants can elongate themselves farther than anyone thought possible.

Do you know anything about parallel tree search? You do realize there is overhead (computational, not extra nodes) to simply kick off parallel searches? And of course there is the extra node overhead caused by incorrect move ordering. And in a 15 ply search, in simple positions, there is more overhead involved in getting the split done than there is in doing the split search?
And that overhead did not exist in 1994 in 11 plies search in DTS???

No, you just continue to expose more and more ignorance of parallel search. You did know that Crafty was doing parallel searches back when we could only do 9 ply searches in the middlegame, right, on the quad P6/200 type machines? You do realize one can tune a parallel search to the shape/size of the tree? You do know that everytime we have a quantum leap in hardware speed, or number of cores, or in selective search, that I have to carefully re-tune parallel search parameters inside Crafty to limit overhead, right? There is so much you don't know...

Again you contradict yourself. You are in a war with the simplest logic since you cannot grasp even the simplest thing.
If 15 plies is meaningless than your 1994 paper with 11 plies depth is even more meaningless.

So in addition to everything else you don't know, you don't know that in 1994 the effective branching factor was 5+? While in 2010 it is 2+. In 2004 when these opteron tests were run, it was about 3 to 3.5 for crafty. You do realize that effective branching factor has a profound effect on how a parallel search must be conducted, right? You do realize that if one splits too close to the tips, overhead goes up, particularly if the tips disappear right and left due to forward pruning, leaving only the overhead of doing the split. You did know all of that, right?

In the C90, in 1994, we were doing 11 ply searches in 3 minutes. Today, on an 8-core, I see 24-26-28 depending. 15 sounds pretty miniscule, eh?
Time is completely irrelevant. Another one of your logical fallacies.
Reaching depth 11 in 100 hours of computation on processor running on 30kHz clock, or reaching the same depth on the same processor running on 6GHz in 1.8 sec is all the same.

You do know that the C90 ran at 4.1ns, or about 250mhz. It has 16 cpus. Each CPU could read 4 64 bit words and write two 64 bit words in one cycle, right? You realize that turns into 250M * 16 * 48 bytes per second. Fully pipelined. Right? Which means that memory copying cost us almost nothing. Today it costs a fortune in terms of wasted clock cycles. The shape of the trees is different. Our middlegame trees today look more like our endgame trees in 1994 because of current selective search, pruning, reductions and such. Of course, an expert in parallel tree search would understand how the shape of the tree, the hardware constraints, all play into tuning AND design, and influence how things are done for that particular platform. And now, even you know it since I just told you, if you were able to follow.

Funny. I've been playing chess for 50 years now. I don't play many 200 move games. In fact I don't recall having played _any_ games over 100 moves long in all the tournaments and skittles games I have played. I excluded the first 10 moves or so because those were book moves. Hard to get a parallel speedup there, eh? And I excluded that last 10-15 moves because the game was over and the search speedup was greatly inflated once mate is found. Seems like the only viable way to answer the question, given the difficulty of accessing a 70 million buck machine in 1994 when the actual game was played in the ACM tournament.
This is just a lousy excuse, which does not make your 1994 paper results less crappier.

Nice, factual argument. I give specifics. I give data. I give examples. You give zip. nada. One needs information to argue. You should try getting some, first.

Now you factor out search overhead because you always stop after N nodes.
This is simply not true. Time to reach N nodes will be different each time you run the test, search tree will be different each time, and there is no such a thing as "factoring out" of the overhead.

Sure there is. When I search with 2 processors I search N extra nodes. When I search with 3 or 4, I search maybe 2N or 3N extra nodes. But I don't search them all in your example, I just stop. Giving me perfect speedups, incorrectly. You _really_ don't understand parallel search, so why on earth are you trying to look so foolish arguing about points you have no clue about???

You do realize the recent data was on linux in 2004? The original was done on unix in 1994. No significant difference. In fact, for threads, the Cray was significantly better with no virtual memory, -much- more efficient semaphore mechanism, etc. Your ignorance might be approaching boundless...
Thanks for exactly proving my point (and contradicting yourself again in the process )
Since Cray was better it means the spread of uncertainty of the time to reach depth was much smaller than it is on today's OSs which means that your 4 runs per position on 4 nodes machine is not sufficient today and was not sufficient in 2004 either.

"spread of uncertainty". You creating your own personal vocabulary today? Unix of 1994 is no different than unix of today at the process level. The only issue is statistical variance. One _can_ take observations (speedups) and measure variance to decide how many runs you want to drive the standard deviation down to whatever level you want. 99% of the variability in a parallel engine is _not_ a function of the operating system. It has almost nothing to do with it. But anyone doing parallel search would know this. Which excludes you. BTW I can only assume that the "lol" applies to yourself? People are laughing at you... Justifiably... Essentially everything you posted is either wrong or imaginary...

Milos · Post by **Milos** » Thu Sep 16, 2010 5:55 am

bob wrote:Small lie, here? there are _not_ 40 positions. Of course if you didn't look at the paper, you wouldn't know that. So read it again. There were actually 24 that were used. Book move positions were excluded. Mate-in-N at the end were excluded. So you actually ran the non-existent positions, you say? And you want to talk about credibility and honesty?

I ran 40 consecutive positions starting from move num 9 btm, ending with move 29 wtm. You ran 24 only black positions. So I actually run 4 positions short. However, I've now checked those remaining 4 and they are also above 0.1s time to reach depth 15. So you're doing nothing but trying to change subject in order to avoid questions that makes you look stupid. Obviously unsuccessfully.

Looks like a few elephants get through mouseholes regularly, for those that know our butt from a hatband. A few searches grabbed from log files of games played on ICC this week:

Code: Select all

               32->   0.13   0.01   44. b7 Rb2 45. Ke1 Kf8 46. Kd1 g4 47.
                                    Kc1 Rb6 48. Bd5 Ke7 49. g3 Kd6 50.
                                    Bxf7 Rxb7 51. Bh5 Rc7+ 52. Kd2 Rg7
                                    53. f3 gxf3 54. Bxf3 <HT>

               32     0.06   0.01   38. Be2 Nc1 39. Bc4 Re4 40. b6 Rxc4
                                    41. b7 Rb4 42. b8=Q+ Rxb8 43. Bxb8
                                    Kg7 44. Bf4 Nd3 45. Bd2 g5 46. Ke2
                                    Nc5 47. Bb4 Ne6 48. Bd6 f5 49. Ke3
                                    Kg6 50. Kd3 Kh5 51. Ke3 Kg4 52. h3+
                                    Kh4 53. Bg3+ Kh5 <HT>

               40     0.85   4.19   91. ... Ra6 92. Kb4 Ra1 93. h6+ Kh7
                                    94. Kc4 Rc1+ 95. Kd3 Ra1 96. Kd2 Ra2+
                                    97. Kc3 Ra4 98. Kb3 Ra1 99. Kb4 Rb1+
                                    100. Ka4 Ra1+ 101. Kb3 Ra6 102. Kc4
                                    Ra1 103. Kb5 Rb1+ 104. Kc6 Rc1+ 105.
                                    Kd7 Rd1+ 106. Ke6 Re1+ 107. Kf7 Ra1
                                    108. Ke7 Re1+ 109. Kd8 Rd1+ 110. Kc8
                                    Rc1+ 111. Kb7 Rb1+ 112. Kc7 Rc1+ 113.
                                    Kd6 Rd1+ 114. Ke5 Re1+ 115. Kf5 Rf1+
                                    116. Ke4 Re1+ 117. Kd4 Ra1 118. Kd3
                                    Rd1+ 119. Kc2 Ra1 120. Kc3 Kxh6 

               40     0.45   4.26   91. ... Rb1+ 92. Ka4 Ra1+ 93. Kb3 Ra5
                                    94. Kb2 Rb5+ 95. Kc2 Ra5 96. Kd3 Ra3+
                                    97. Ke2 Ra5 98. Kd2 Ra2+ 99. Ke3 Ra5
                                    100. Ke4 <HT>

               48->   0.37   4.26   90. ... Ra3+ 91. Ke2 Ra5 92. Kd2 Ra2+
                                    93. Ke3 Ra5 94. Ke4 Ra3 95. Kf5 <HT>

Lets see. You take samples obviously from late endgames.
So how is anything of that applicable to your sample of positions? You have none of those cases in your positions. You actually intentionally avoided them. So what's the purpose of those special cases except for (again unsuccessfully) trying to avoid the topic?

You did know that Crafty was doing parallel searches back when we could only do 9 ply searches in the middlegame, right, on the quad P6/200 type machines? You do realize one can tune a parallel search to the shape/size of the tree? You do know that every time we have a quantum leap in hardware speed, or number of cores, or in selective search, that I have to carefully re-tune parallel search parameters inside Crafty to limit overhead, right? There is so much you don't know...

And on the previous page you say:

The more aggressive pruning, and the reduction stuff might even require some more tuning, since I have not tested nor tuned SMP search since the more aggressive stuff was added this past year or so.

So you are finally accepting more reduction, after years of stubbornness, but you obviously don't find it important to tune SMP search. So when was the last time you really tuned it, in Crafty 10??

Sure there is. When I search with 2 processors I search N extra nodes. When I search with 3 or 4, I search maybe 2N or 3N extra nodes. But I don't search them all in your example, I just stop. Giving me perfect speedups, incorrectly. You _really_ don't understand parallel search, so why on earth are you trying to look so foolish arguing about points you have no clue about???

Perfect speedups you say??

Let me remind to your post on other thread:

bob contradicting himself for a millionth time wrote: (1) split cost. Non-trivial copying a lot of data to various split blocks. Totally unavoidable, one just has to try to limit the number of splits.

(2) search overhead. Every time you do a split, and one of the threads gets a fail high, some (or all) of what the other threads have done is wasted. How much overhead is difficult to project. If you generate 40 moves, and for some reason the fail-high move is at slot 20, you search move one serially to satisfy the YBWC, then you split and start the search. If any moves past slot 20 are searched, partially or completely, by the time move 20 fails high, that is wasted effort.

(3) synchronization. Locks are generally used. It is possible to write lockless code, and various algorithms are given in parallel programming books, assuming you can guarantee the order of memory reads and writes. This is not true for all processors today, and there is no guarantee it will be true even for Intel in the future. So it is dangerous. But in any case, locks are sequential, by definition, and factor directly into Amdahl's law, where the first two items do not necessarily.

(4) hardware. If you share things, and they end up close in memory, they can share a cache block. And you can generate a ton of cache coherency traffic keeping up with who has the most current copy of something. This is difficult to measure unless you have some software tools. AMD has some slick stuff, but it is non-public, or at least was the last time they ran it for me. There are other issues such as memory conflicts, bus conflicts, does each processor have its own cache (L1 and L2) or a local L1 and a shared L2, or a local L1 and a L2 shared between 2 or 4 cores? Each changes performance by adding some unknown serialization.

Overhead from points (1), (3) and (4) can be measured with time to fixed number of nodes. Again you are contradicting yourself. You are really a champion.

"spread of uncertainty". You creating your own personal vocabulary today? Unix of 1994 is no different than unix of today at the process level. The only issue is statistical variance. One _can_ take observations (speedups) and measure variance to decide how many runs you want to drive the standard deviation down to whatever level you want. 99% of the variability in a parallel engine is _not_ a function of the operating system. It has almost nothing to do with it. But anyone doing parallel search would know this. Which excludes you.

It's certainly not only OS, but it is mostly OS+hardware. On one side you have Cray on the other some modern quad.
So what is going to be your next "argument" quad with OOE, shared L3 (as you say in other thread), and million other new things introduces less uncertainty than Cray???

bob · Post by **bob** » Thu Sep 16, 2010 7:36 pm

Milos wrote:
bob wrote:Small lie, here? there are _not_ 40 positions. Of course if you didn't look at the paper, you wouldn't know that. So read it again. There were actually 24 that were used. Book move positions were excluded. Mate-in-N at the end were excluded. So you actually ran the non-existent positions, you say? And you want to talk about credibility and honesty?
I ran 40 consecutive positions starting from move num 9 btm, ending with move 29 wtm. You ran 24 only black positions. So I actually run 4 positions short. However, I've now checked those remaining 4 and they are also above 0.1s time to reach depth 15. So you're doing nothing but trying to change subject in order to avoid questions that makes you look stupid. Obviously unsuccessfully.

Your test is broken. The intent was to take the moves, as played by the program, connect them thru the game order via the transposition table, to see how the speedup happens in a game. Again, _all_ of this is spelled out in the paper. Generally, when one is quibbling about data, one tries to run the _same_ experiment to either verify or disprove the numbers. Except for you, of course...

Looks like a few elephants get through mouseholes regularly, for those that know our butt from a hatband. A few searches grabbed from log files of games played on ICC this week:

Code: Select all

               32->   0.13   0.01   44. b7 Rb2 45. Ke1 Kf8 46. Kd1 g4 47.
                                    Kc1 Rb6 48. Bd5 Ke7 49. g3 Kd6 50.
                                    Bxf7 Rxb7 51. Bh5 Rc7+ 52. Kd2 Rg7
                                    53. f3 gxf3 54. Bxf3 <HT>

               32     0.06   0.01   38. Be2 Nc1 39. Bc4 Re4 40. b6 Rxc4
                                    41. b7 Rb4 42. b8=Q+ Rxb8 43. Bxb8
                                    Kg7 44. Bf4 Nd3 45. Bd2 g5 46. Ke2
                                    Nc5 47. Bb4 Ne6 48. Bd6 f5 49. Ke3
                                    Kg6 50. Kd3 Kh5 51. Ke3 Kg4 52. h3+
                                    Kh4 53. Bg3+ Kh5 <HT>

               40     0.85   4.19   91. ... Ra6 92. Kb4 Ra1 93. h6+ Kh7
                                    94. Kc4 Rc1+ 95. Kd3 Ra1 96. Kd2 Ra2+
                                    97. Kc3 Ra4 98. Kb3 Ra1 99. Kb4 Rb1+
                                    100. Ka4 Ra1+ 101. Kb3 Ra6 102. Kc4
                                    Ra1 103. Kb5 Rb1+ 104. Kc6 Rc1+ 105.
                                    Kd7 Rd1+ 106. Ke6 Re1+ 107. Kf7 Ra1
                                    108. Ke7 Re1+ 109. Kd8 Rd1+ 110. Kc8
                                    Rc1+ 111. Kb7 Rb1+ 112. Kc7 Rc1+ 113.
                                    Kd6 Rd1+ 114. Ke5 Re1+ 115. Kf5 Rf1+
                                    116. Ke4 Re1+ 117. Kd4 Ra1 118. Kd3
                                    Rd1+ 119. Kc2 Ra1 120. Kc3 Kxh6 

               40     0.45   4.26   91. ... Rb1+ 92. Ka4 Ra1+ 93. Kb3 Ra5
                                    94. Kb2 Rb5+ 95. Kc2 Ra5 96. Kd3 Ra3+
                                    97. Ke2 Ra5 98. Kd2 Ra2+ 99. Ke3 Ra5
                                    100. Ke4 <HT>

               48->   0.37   4.26   90. ... Ra3+ 91. Ke2 Ra5 92. Kd2 Ra2+
                                    93. Ke3 Ra5 94. Ke4 Ra3 95. Kf5 <HT>

Lets see. You take samples obviously from late endgames.
So how is anything of that applicable to your sample of positions? You have none of those cases in your positions. You actually intentionally avoided them. So what's the purpose of those special cases except for (again unsuccessfully) trying to avoid the topic?

Just go back to my comment, look at the context, and all will become clear. Here it is, for reference:

That's completely irrelevant. Chance of that position occurring in an actual game is smaller than the chance of an elephant tunneling through a mouse hole. In other word one more logical fallacy of yours.

"in an actual game." Not "In the part of the game you used in your DTS paper." Who, exactly, looks stupid here? I gave you a couple of searches from different games, actually played within the last day or two on ICC. You _would_ call those "actual games" I would think? And Crafty _did_ do a 40 ply search in a tiny fraction of a second? But it "never happens" according to you.

As a pitcher, you keep throwing fastballs over the plate, high and on the outside corner, and I am going to keep slapping 'em over the scoreboard in center field. Your ERA must be up to over 1,000 by now... You could call it a game and quit. Or keep getting knocked out of the park. Your choice.

You did know that Crafty was doing parallel searches back when we could only do 9 ply searches in the middlegame, right, on the quad P6/200 type machines? You do realize one can tune a parallel search to the shape/size of the tree? You do know that every time we have a quantum leap in hardware speed, or number of cores, or in selective search, that I have to carefully re-tune parallel search parameters inside Crafty to limit overhead, right? There is so much you don't know...
And on the previous page you say:
The more aggressive pruning, and the reduction stuff might even require some more tuning, since I have not tested nor tuned SMP search since the more aggressive stuff was added this past year or so.
So you are finally accepting more reduction, after years of stubbornness, but you obviously don't find it important to tune SMP search. So when was the last time you really tuned it, in Crafty 10??

"accepting more reduction, after years of stubbornness"??? Why don't you enlighten me as to when I started using reductions? Or should I enlighten you? Perhaps just a hint to start with. Shortly after Fruit came out. And you did read the "I have not tuned it in the last year so"? I assume "lol" means, once again, you are laughing at your own ignorance? But, for the record:

* 23.0 Essentially a cleaned up 22.9 version. Comments have been *
* reviewed to make sure they are consistent with what is actually *
* done in the program. Major change is that the random numbers *
* used to produce the Zobrist hash signature are now statically *
* initialized which eliminates a source of compatibility issues *
* where a different stream of random numbers is produced if an *
* architecture has some feature that changes the generator, such *
* as a case on an older 30/36 bit word machine. The issue with *
* this change is that the old binary books are not compatible and *
* need to be re-created with the current random numbers. The *
* "lockless hash table" idea is back in. It was removed because *
* the move from the hash table is recognized as illegal when this *
* is appropriate, and no longer causes crashes. However, the above *
* pawn hash issue showed that this happens enough that it is better *
* to avoid any error at all, including the score, for safety. We *
* made a significant change to the parallel search split logic in *
* this version. We now use a different test to limit how near the *

That from main.c in Crafty source. Notice the version? Not quite 10.0, now was it? I am working on 23.4. This was done in 23.0. You really don't have a clue about anything, do you?

Sure there is. When I search with 2 processors I search N extra nodes. When I search with 3 or 4, I search maybe 2N or 3N extra nodes. But I don't search them all in your example, I just stop. Giving me perfect speedups, incorrectly. You _really_ don't understand parallel search, so why on earth are you trying to look so foolish arguing about points you have no clue about???
Perfect speedups you say??

This is simple. If I search a tree with 1 cpu, to a fixed depth, it searches N nodes. If I search it to the _same_ depth, using 2 cpus, it generally searches _more_ nodes. Crafty keeps both CPUs busy searching almost perfectly 100% of the time. So why isn't the speedup 2x? Because of those extra nodes. Example, first:

Code: Select all

log.001&#58;              time=3.06  mat=0  n=11455392  fh=94%  nps=3.7M
log.002&#58;              time=1.88  mat=0  n=12375352  fh=94%  nps=6.6M

Notice the "n=some-number" 2 cpus search a larger tree most of the time. This is called "parallel search overhead." You want to use a fixed node search. Which means that each of the above will search exactly the same number of nodes. The second will stop _before_ the complete 20 ply search is done (both above searched same position to depth=20). The speedup will be reported as almost exactly 2.0. And it would be wrong. In reality, the speedup should be 2.0 for a perfect algorithm. But in the above it is significantly less because of (a) that extra 1M nodes searched by the 2 cpu program and (b) search is not really deep enough to get going very well. NPS should be double. Run longer, the 2 cpu nps will climb to near 2x the 1 cpu nps. But since the tree grows as well, it can't search it twice as fast. Unless you use your flawed (node count) idea.

If you don't get it, you don't get it.

Let me remind to your post on other thread:
bob contradicting himself for a millionth time wrote: (1) split cost. Non-trivial copying a lot of data to various split blocks. Totally unavoidable, one just has to try to limit the number of splits.

(2) search overhead. Every time you do a split, and one of the threads gets a fail high, some (or all) of what the other threads have done is wasted. How much overhead is difficult to project. If you generate 40 moves, and for some reason the fail-high move is at slot 20, you search move one serially to satisfy the YBWC, then you split and start the search. If any moves past slot 20 are searched, partially or completely, by the time move 20 fails high, that is wasted effort.

(3) synchronization. Locks are generally used. It is possible to write lockless code, and various algorithms are given in parallel programming books, assuming you can guarantee the order of memory reads and writes. This is not true for all processors today, and there is no guarantee it will be true even for Intel in the future. So it is dangerous. But in any case, locks are sequential, by definition, and factor directly into Amdahl's law, where the first two items do not necessarily.

(4) hardware. If you share things, and they end up close in memory, they can share a cache block. And you can generate a ton of cache coherency traffic keeping up with who has the most current copy of something. This is difficult to measure unless you have some software tools. AMD has some slick stuff, but it is non-public, or at least was the last time they ran it for me. There are other issues such as memory conflicts, bus conflicts, does each processor have its own cache (L1 and L2) or a local L1 and a shared L2, or a local L1 and a L2 shared between 2 or 4 cores? Each changes performance by adding some unknown serialization.
Overhead from points (1), (3) and (4) can be measured with time to fixed number of nodes. Again you are contradicting yourself. You are really a champion.

On a dual-cpu system, they are minimal. The test I ran has shared L2 on a core-2 (above). So no cache coherency traffic. minimal memory conflicts. Search overhead, on the other hand. Most would understand that if you go to 8 cores, with either 2, 4 or 8 L2 caches (depending on the specific processor) you will see varying cache issues. And memory issues. But, as I pointed out in a different thread on SMP, the above numbers represent maybe 0.1% of the total "cost". Most of it is in those extra nodes.

Which, once again, you "don't get".

I believe I have forgotten more about parallel search than you know. And unfortunately for you, I have forgotten _very_ little since I work on it daily. But don't let me stop you from continuing to try. Just keep throwing 'em high and outside and watch 'em keep going into the center field stands...

"spread of uncertainty". You creating your own personal vocabulary today? Unix of 1994 is no different than unix of today at the process level. The only issue is statistical variance. One _can_ take observations (speedups) and measure variance to decide how many runs you want to drive the standard deviation down to whatever level you want. 99% of the variability in a parallel engine is _not_ a function of the operating system. It has almost nothing to do with it. But anyone doing parallel search would know this. Which excludes you.
It's certainly not only OS, but it is mostly OS+hardware. On one side you have Cray on the other some modern quad.
So what is going to be your next "argument" quad with OOE, shared L3 (as you say in other thread), and million other new things introduces less uncertainty than Cray???

[/quote]

Again, 99.9% of the variability within a parallel search is the extra nodes. The other issues can be a big hurdle when you first move to an architecture with a new feature. But they can eventually be controlled. Extra nodes? Not so easy. And so far, no one has solved that.

WHACK! another one in the stands...

bob · Post by **bob** » Thu Sep 16, 2010 8:46 pm

Since I know you are not going to "get this" even after the last discussion, here is a pretty good example of why the speedup varies all over the place. Without enough influence by cache or anything on this box (I have highly optimized the code to prevent just this, until some new hardware change requires more attention).

Here is the _same_ position (kopec #22) run to the same depth (20), using 8 cpus on my E5345 dual-cpu cluster on a single node. I just repeat the same position, same depth, over and over. Check the node counts:

Code: Select all

log.001&#58;              time=2.77  mat=0  n=47434414  fh=95%  nps=17.1M
log.002&#58;              time=2.39  mat=0  n=42481820  fh=95%  nps=17.8M
log.003&#58;              time=2.51  mat=0  n=43400723  fh=95%  nps=17.3M
log.004&#58;              time=2.93  mat=0  n=50864431  fh=94%  nps=17.4M
log.001&#58;              time=3.04  mat=0  n=48652443  fh=94%  nps=16.0M
log.002&#58;              time=2.56  mat=0  n=44432126  fh=94%  nps=17.4M
log.003&#58;              time=3.30  mat=0  n=54420287  fh=94%  nps=16.5M
log.004&#58;              time=2.61  mat=0  n=43579544  fh=94%  nps=16.7M
log.001&#58;              time=2.77  mat=0  n=47679815  fh=95%  nps=17.2M
log.002&#58;              time=2.84  mat=0  n=47706305  fh=95%  nps=16.8M
log.003&#58;              time=2.86  mat=0  n=47449906  fh=94%  nps=16.6M
log.004&#58;              time=2.55  mat=0  n=43616545  fh=95%  nps=17.1M

I ran the same position for 4 times, and then repeated that 3 times. So all 12 lines above are the same position to the same depth. Notice the nodes searched. Here is the same position, same depth, using just one CPU:

Code: Select all

              time=12.88  mat=0  n=41469209  fh=95%  nps=3.2M
              time=12.93  mat=0  n=41469209  fh=95%  nps=3.2M
              time=12.90  mat=0  n=41469209  fh=95%  nps=3.2M
              time=12.95  mat=0  n=41469209  fh=95%  nps=3.2M

Nodes remain constant, as they should. Everything is constant except for a tiny bit of time jitter. that is the operating system / cache / hardware "noise" that gets injected. Best time was 12.88, worst was 12.95, a jitter of .07, which translates to 0.5% noise. If I run the positions 4 times, but longer (going to depth 22 rather than depth 20) I get this:

Code: Select all

log.001:              time=36.30  mat=0  n=119782767  fh=95%  nps=3.3M
log.002:              time=36.23  mat=0  n=119782767  fh=95%  nps=3.3M
log.003:              time=36.29  mat=0  n=119782767  fh=95%  nps=3.3M
log.004:              time=36.37  mat=0  n=119782767  fh=95%  nps=3.3M
[code]

Now, jitter is from 36.23 to 36.37, or .14, which is 0.3%  Go to our usual tournament time controls of 90-120 seconds per move, and it drops to the 0.1% level.  And that is all the variability there is from the O/S.  But when you toss in parallel search, the node counts go nuts and the variability becomes huge.

Anyone working with parallel search would already know this, however.  So it appears you are not familiar with the topic at all...

Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka

Re: Deep Blue vs Rybka