YBWC: Active Reparenting

diep · Post by **diep** » Thu Apr 19, 2012 9:01 am

Daniel Shawul wrote:
bob wrote: My DTS search was NOT "optimized for the Cray". It was actually developed on a non-Cray shared memory system, and it has been run on a ton of different machines. It IS optimized for shared memory machines with symmetric hardware (no true master-slave cpus and such).
Ok if you say so. I read this in your paper and it is hard to interpret it any other way
To process the HELP command, a simple test was added to the sequential search code. After any node is processed, HELP(i) is checked, and if set, the tree state is copied to shared memory, the flag is cleared, and the search continues normally. This reduces the idle time for a processor to the time required for a processor to search one node and return to the top of Search() and then store the "tree state" for the idle processor to examine. (About 30 microseconds on a Cray C90.) After this time, an idle processor will have at least one tree state to examine for split points, making this reasonably efficient.
So you copy the tree state when ever an idle processor asks for it. How big is that tree state?

As i woke up sooner than Bob: 64 KB is what Cray Blitz copied.
But is it relevant? Sure maybe for the speedup of Cray Blitz.
This is not so interesting 35 years later.

Enough has been said about how Bob writes his articles around 2002.

DTS is a real invention, don't ever forget that.

Not sure when Bob invented it, yet we're in 2012 now and it still works great for up to a core or 32 at todays hardware giving a magnificent speedup, whereas it's not so complicated to build in a centralized manner.

I'll have to see anyone of us inventing something that still works 35 years later.

Cilkchess and Zugzwang and Hydra and Deep Blue are far more recent creations and these entities lose factor 40-50 handsdown and Deep Blue loses effectively a lot more than that.

The common YBW approach where working threads pick up idle ones and the actual copying is done when the idle processor is about to start the parallel search. Apparently you saw some benefit from gathering split points and picking up the best one. Marco did something similar only that he checks for the best one on the fly. Why didn't you do that? Does the working processor issue a different split point than it is currently working on when it gets a HELP from idle processors? I can only understand the copying if that is the case.
Max theoretical bandwidth is pointless when a PC has no hope of reaching those numbers. There's absolutely nothing in DTS that I would change to use it in Crafty. I don't use it because I don't want to get rid of the clean recursive search and go to an iterated search that is significantly messier to write, test, debug, and maintain. One can look at Cray Blitz's search to see what I mean. But it would work perfectly, and has been implemented on PCs more than once by others, with no problems of any kind other than the complexity.
I got curious and did some reading on cray blitz mainly from the talk by Harry Nielson. I see he is the Cray expert and did use vectorization to optimize your porgram and gave you an Attacks() which is much faster. Move generation is vectorized using SIMD on short vectors (similar to current SIMD with registers). Attacks() speeded up by 75% according to him. So you (atleast Harry) did do some SIMD on the cray. The NPS of that era were were shocking (2000, 10000 etc) so I don't know if findings then are directly applicable for current hardware. That was on Cray-1. Cray X-MP with 2/4 processors and then Cray Y-MP with 8 is where you developed DTS. If you wanted to, you could have written a purely SIMD chess on the Cray at that time, but since it is also an MIMD machine there is no need for that. NVIDIA gpus are SIMT (not SIMD) which is somewhere in between. They are more flexible than pure SIMD machines and that allows for applications other than what is commonly considered SIMD friendly.

I don't know what's wrong with you there. Wake up - we live in 2012.

Technology from half a century ago you're comparing with the superior GPU technology we have right now made by Nvidia, with another monster soon to show up from them that you measure in TERAFLOPS a gpu. Nothing can rival THAT on this planet.

Go have a look at Zugzwang 1988 ok. It's 10 years after Cray Blitz. They run at a 512 processor box with each 'processor' being 1Mhz.

Sure, not processors that can get the same IPC like todays ones and the box is not symmetric multiprocessing either, yet...

How much nodes a second does the box reach in total for Zugzwang?

200.
So not 200k. Not 20k. Two-hundred.

Checkout the specs of the box. The network of it was magnificent as compared to the CPU's.

Over 2.5 million cycles per node.

Their paper claims a 50% speedup kind of.

After some longer study of the weird technical terminology used in that article i concluded they divided the number of nodes from 1 processor by the number of nodes searched by n processors and claimed that as their speedup.

Now as they invented (?) YBW that's going to be pretty good of course, especially if you forget to mention that...

Factor time they forgot... ...so the simple algorithm of having 511 cpu's idle and just 1 search would get the magnificent speedup of 100% in this manner...

I hope you're not projecting these standards onto todays standards...

diep · Post by **diep** » Thu Apr 19, 2012 9:08 am

Daniel Shawul wrote:
1. 4 elo is tough to measure. Even 30,000 games gives an error bar of +/-4 so that's nowhere near enough. To get to +/- 1, you need something north of 150,000 games. Did you ACTUALLY measure accurately enough to say +4 elo?
I did not do the testing but only predicted it won't help much. Marco here did, and said that he got +4 elo from some thousand blitz games. So it does confirm my suspicion that it won't help much.

And i laughed for it and still do

2. I don't see any measurable idle time in my current search, and saw absolutely none in CB's search. That is not much difference, I don't see how it would add up to +4 elo...
Sure mine has also very small idle time. That is why my nps is almost doubles on 2 processors, but has some losses on more than 8 processors. The OP thought he might gain some on that which is why I replied with my guess that it will not help much in that regard. My guess (as I stated before) is that selection of the better split point among those available may help even if there is 0 reduction of idle time. To avoid confusion , by better split point I mean one which has more work. If you can split near the root, you will have fewer splits which means you avoid the copying of search stack every time we split. Similar to the reason why we limit parallel search above a certain depth.

Diep splits up to 2 ply depthleft.

No need to limit it much - just make a split cheap

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 19, 2012 2:15 pm

And i laughed for it and still do

Well then you are in a good mood

Bob thought I did the testing and it is fair he asked the error margin for the +4 elo. But you knew that I didn't do the test. Do you test everything to +1 elo even if you start out with an idea that you think won't help very much ? That would be really silly.

Diep splits up to 2 ply depthleft.

No need to limit it much - just make a split cheap

Mine is big with all the piece list move representation and all so atleast 4 for me. And much more when running on clusters.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 19, 2012 2:46 pm

As i woke up sooner than Bob: 64 KB is what Cray Blitz copied.
But is it relevant? Sure maybe for the speedup of Cray Blitz.
This is not so interesting 35 years later.

Enough has been said about how Bob writes his articles around 2002.

DTS is a real invention, don't ever forget that.

Not sure when Bob invented it, yet we're in 2012 now and it still works great for up to a core or 32 at todays hardware giving a magnificent speedup, whereas it's not so complicated to build in a centralized manner.

I'll have to see anyone of us inventing something that still works 35 years later.

Cilkchess and Zugzwang and Hydra and Deep Blue are far more recent creations and these entities lose factor 40-50 handsdown and Deep Blue loses effectively a lot more than that.

Vincent I do not doubt it is a great invention. The online article about DTS is a bit confusing what is going on with the HELP command. I made another post about when and how much is being copyied when a HELP is issued. But later I deleted it because I found something else which implied very little (not some 64kb) is being copied in IEEE journal. So I just decided to give Bob a chance to explain. My deleted post is quoted below.

Daniel wrote: First this:
online_article wrote: The HELP command is the primary signaling mechanism within DTS. Whenever a processor is "out of work" it sends the HELP command to what is effectively the entire group of active processors. The HELP command simply requests that any processors that are actively searching subtrees temporarily stop, copy the "tree state" to shared memory, and then continue searching.

As these "tree states" become available, the idle processor that initiated the HELP command analyzes each state to determine if it can find a satisfactory split point. If not, it simply re-broadcasts the HELP command.
But later you say.
online_article wrote: An interesting feature here, is that once a processor finds a viable split-point, and the Split() operation is performed, whenever any other processor becomes idle, they check for active split-points before broadcasting a HELP command. If a processor locates any valid split points, and there is work remaining at any of them, it simply attaches to the split point with the most work remaining, and does not broadcast a help command.
So clearly there is no copying if there are active split points. So the "HELP" command and hence the copying of tree state to shared memory is _rarely_ done, which is the key here. Also even after the HELP command is issued (i.e once we determine the active split points are no good), why do we ask the busy processors copy the tree state to shared memory ? Can't it look for a good split point like we did the test for reattaching to active split points? Copying should be done only when we actually split IMO, doing that while looking for a good split point is sure to consume bandwidth. The whole reason why we don't split at shallow depths is to reduce the cost of splitting (i.e copying tree state). I can understand that finding good split points is top priority at that time because YBW was not invented. So the copying while looking for split points could be justified because of that.

The IEEE journal says

Whenever a processor exhausts the work
(sub-tree) that it is working on, it broadcasts a help
request to all busy processors. These processors
make a quick copy of the type of each node they
are searching in the current sub-tree and the
number of unsearched branches at each node,
and give this information to the idle processor. The
busy processors then resume searching where
they were interrupted. The idle processor (or
processors if more than one is idle) examines the
data and picks the most likely split point based on
immediately stop and try to find more useful work
to proceed with, by broadcasting a help request.
As a processor finds a new best score for a split
point, it shares the value with other parallel
searchers at that split point to improve their AB
cutoff performance. These issues are dealt with
more deeply in Hyatt’s thesis [7].

So this clearly says very little is copyied. This makes much more sense. Does some one have the phd thesis btw ?

I don't know what's wrong with you there. Wake up - we live in 2012.

Technology from half a century ago you're comparing with the superior GPU technology we have right now made by Nvidia, with another monster soon to show up from them that you measure in TERAFLOPS a gpu. Nothing can rival THAT on this planet.

Go have a look at Zugzwang 1988 ok. It's 10 years after Cray Blitz. They run at a 512 processor box with each 'processor' being 1Mhz.

Sure, not processors that can get the same IPC like todays ones and the box is not symmetric multiprocessing either, yet...

How much nodes a second does the box reach in total for Zugzwang?

200.
So not 200k. Not 20k. Two-hundred.

Checkout the specs of the box. The network of it was magnificent as compared to the CPU's.

Over 2.5 million cycles per node.

Their paper claims a 50% speedup kind of.

After some longer study of the weird technical terminology used in that article i concluded they divided the number of nodes from 1 processor by the number of nodes searched by n processors and claimed that as their speedup.

Now as they invented (?) YBW that's going to be pretty good of course, especially if you forget to mention that...

Factor time they forgot... ...so the simple algorithm of having 511 cpu's idle and just 1 search would get the magnificent speedup of 100% in this manner...

I hope you're not projecting these standards onto todays standards...

Maybe I got carried away a bit but it sure seemed to be exciting for the participants at that time. Also I want to get to the bottom of what the old Crays are about. Those supercomputers have very few cores that got performance in 1-2 gflops. Like any other supercomputer the performance is measured with LINPACK and kind of weird to have a chess application on it. Harry said less than 5% of chess is vectorizable but he still managed to vectorize some on movegen and attacks. The large number of registers also helped to do a number of hash probes at one time (upto 7), and also to store bitboards and other things that can't be easily done on non-vector processors. If you can access the IEE journal you may find it interesting too

diep · Post by **diep** » Thu Apr 19, 2012 3:53 pm

Daniel Shawul wrote:
And i laughed for it and still do
Well then you are in a good mood Bob thought I did the testing and it is fair he asked the error margin for the +4 elo. But you knew that I didn't do the test. Do you test everything to +1 elo even if you start out with an idea that you think won't help very much ? That would be really silly.

If diep would've been tested at +- 10 elo accurate it would of course be 4000+ elo rated

I don't have the system time for that, though with the new cluster here things are a lot better than the 0 testmachines i had before that

Diep splits up to 2 ply depthleft.

No need to limit it much - just make a split cheap
Mine is big with all the piece list move representation and all so atleast 4 for me. And much more when running on clusters.

Is what you do more expensive than what Crafty is doing?

diep · Post by **diep** » Thu Apr 19, 2012 4:40 pm

Daniel Shawul wrote:
As i woke up sooner than Bob: 64 KB is what Cray Blitz copied.
But is it relevant? Sure maybe for the speedup of Cray Blitz.
This is not so interesting 35 years later.

Enough has been said about how Bob writes his articles around 2002.

DTS is a real invention, don't ever forget that.

Not sure when Bob invented it, yet we're in 2012 now and it still works great for up to a core or 32 at todays hardware giving a magnificent speedup, whereas it's not so complicated to build in a centralized manner.

I'll have to see anyone of us inventing something that still works 35 years later.

Cilkchess and Zugzwang and Hydra and Deep Blue are far more recent creations and these entities lose factor 40-50 handsdown and Deep Blue loses effectively a lot more than that.
Vincent I do not doubt it is a great invention. The online article about DTS is a bit confusing what is going on with the HELP command. I made another post about when and how much is being copyied when a HELP is issued. But later I deleted it because I found something else which implied very little (not some 64kb) is being copied in IEEE journal. So I just decided to give Bob a chance to explain. My deleted post is quoted below.

Daniel wrote: First this:
online_article wrote: The HELP command is the primary signaling mechanism within DTS. Whenever a processor is "out of work" it sends the HELP command to what is effectively the entire group of active processors. The HELP command simply requests that any processors that are actively searching subtrees temporarily stop, copy the "tree state" to shared memory, and then continue searching.

As these "tree states" become available, the idle processor that initiated the HELP command analyzes each state to determine if it can find a satisfactory split point. If not, it simply re-broadcasts the HELP command.
But later you say.
online_article wrote: An interesting feature here, is that once a processor finds a viable split-point, and the Split() operation is performed, whenever any other processor becomes idle, they check for active split-points before broadcasting a HELP command. If a processor locates any valid split points, and there is work remaining at any of them, it simply attaches to the split point with the most work remaining, and does not broadcast a help command.
So clearly there is no copying if there are active split points. So the "HELP" command and hence the copying of tree state to shared memory is _rarely_ done, which is the key here. Also even after the HELP command is issued (i.e once we determine the active split points are no good), why do we ask the busy processors copy the tree state to shared memory ? Can't it look for a good split point like we did the test for reattaching to active split points? Copying should be done only when we actually split IMO, doing that while looking for a good split point is sure to consume bandwidth. The whole reason why we don't split at shallow depths is to reduce the cost of splitting (i.e copying tree state). I can understand that finding good split points is top priority at that time because YBW was not invented. So the copying while looking for split points could be justified because of that.
The IEEE journal says
Whenever a processor exhausts the work
(sub-tree) that it is working on, it broadcasts a help
request to all busy processors. These processors
make a quick copy of the type of each node they
are searching in the current sub-tree and the
number of unsearched branches at each node,
and give this information to the idle processor. The
busy processors then resume searching where
they were interrupted. The idle processor (or
processors if more than one is idle) examines the
data and picks the most likely split point based on
immediately stop and try to find more useful work
to proceed with, by broadcasting a help request.
As a processor finds a new best score for a split
point, it shares the value with other parallel
searchers at that split point to improve their AB
cutoff performance. These issues are dealt with
more deeply in Hyatt’s thesis [7].
So this clearly says very little is copyied. This makes much more sense. Does some one have the phd thesis btw ?

Yes i have all those PHD thesises, also Bob's one. Search for Robert Morgan Hyatt.

You're speaking about Bob getting world champion in the 70s. His thesis he delivered a year or 10 later. 1988 or so?

In his thesis he claims a speedup of 7.78 or so out of 16. Looks correct claim to me for what he did do, and that wasn't a Cray of course.

Why are you interested in all this old junk?

Science was at a different level back then. Lying a factor 50 was common. Cilk is a good example there. Total junk idiocy from a few professors that slows you down factor 40+. Had Don done the SMP himself i'm sure he wouldn't have lost that much. But then he wouldn't have had the supercomputer that Leierson & co could get so easily by crabbling something on an a4 paper about computerchess being similar to guided missile flight...

I am much against all what happened back then in terms of scaling claims. If your program first gets slowed down a factor 40 to 50 i feel one should mention that first prior to doing a speedup claim. And not some hidden 1 liner you usually see.

Yet are you happy with postings some here do, who basically lift for 3200 elopoints upon someone else and then act as if they know all about the 2 elopoints they won by some testing another dude had setup?

You cannot compare anything of today with searches of the 80s and before. First of all they didn't have nullmove and either no hashtables or total junk type hashtables. Branching factors of 10+ were very common and we speak of programs that lose a factor 40 to 50 somewhere.

Dealing today with all that is a lot harder - of course it's also easier in the sense that up to a core or 16+ you can easily test it at home.
I remember how i made my own SMP code back in 1998-1999. Completely at home. Later on i could login for a while at one of bob's machines a quad pentium pro, that really fixed a lot of issues

In 2002-2003 i had to setup a new framework for Diep. At home i had a dual k7 machine. How to simulate a 500 cpu search there?
At a 32 CPU partition it seemed to work ok when i was allowed to run at 32 cpu's for afew minutes with Diep.
This was already december 2002. I had worked fulltime nearly the months before at diep's SMP.
I had to wait until end februari to run at a bigger partition and use a cpu or 130 there. This was only openingsposition for 1 hour run.
After some weeks finally it got executed. diep got... ...1000 nodes a second.
Then i realized something was wrong... ...and that all my assumptions had to be redone and that i had to CHANGE something in the program

I figured out some later one of the problems was that each thread of Diep timed itself using GetTimeOfDay. When i removed that, then 3 months later when it ran at 130 cpu's, it scaled a lot better. The reason was a centralized clockprocessor. If 130 cpu's go ask regurarly the time at 1 centralized processor that will ugh out... ...so Diep could not selfmeasure how efficient each CPU ran; that would backfire later on during world champs 2003...

Again if i started 130 processes at my dual k7 at home it worked great and it also worked great at 32 cpu's when running for a few minutes at 32 cpu's the openingsposition. What was the problem?

Any help from others i could not expect. I had a few months before that already emailed all kind of scientists asking for technical data regarding latencies of the machine. Also professor Aad v/d Steen. No answer. By phone i received the advice to not ask again Aad nor others anything as they were 'too busy' with more important things in life than helping out others. After all government money may not be used useful.

The next 1 hour i could use more than 130 cpu's at the big 512 processor partition was in september 2003, whe n it rebooted in between the runs of some big challenge project that was c alculating the height of the seawater. They had started those calculations at back then worlds fastest supercomputer, the earth supercomputer in Japan.
In summer 2003 they already had made a discovery channel series about the height of the seawater. It would rise 1 meter coming 100 years.
The project kept on running entire summer - when i wondered why they ran for so long, longer for my feeling than had been appointed - i saw they had had a problem. And as they occupied that many cpu's from this 512 processor partition i could never even test for 1 hour at it, and had to wait.
In the logfiles of this project i saw the problem they had had. "ooops mistake in initialization," it said. "we initialized the seawater level 1 meter too high".
Very good scientists you know. Brilliant guys. But the next run i had for 1 hour was september 2003.
Very nervous i awaited the results of the run; the actual run you couldn't see, it ran in the batch.
When i got back results i was so so disappointed.
"error allocating shared memory". The default unix function to randomize which number to get had a focking hasherror already after 166 numbers. CPU 43 or something hashed to the same number as 166. How on planet earth do you manage to write a hashfunction for unix that so quickly already gets a collission?
A few days before world champs 2003 was the next attempt for Diep.
Again openingsposition. I had dared to take 3 hours this time.
I was very worried though when i got back the results.
5 million nps.
On the back of a napkin i calculated that for 100% scaling i should've seen 20k nps * 500 cpu's = 10 million nps...
World champs 2003 a few days later was my next chance.
On friday i got access.
Guess what.
NO INTERNET AT THE LOCATION.
Despite that i had asked the organisation for internet already at the friday.
normally spoken i would have gone home then and sued the organisation, but using a government supercomputer you don't do that.
I arrived at 10 AM there and had to wait until 6 PM for internet.
a few minutes later we had already the order to leave the location.
Next morning i arrived again. It took 3 hours just to START diep.
With a new openingsbook i started up diep and prayed it would start up in time for the first round which was to start 3 hours and 2 minutes later.
Exactly 1 minute before the game had to start, it had started up.
Just 'by accident' typing a quick 'quit', as i'm used to do so much, that was OUT OF THE QUESTION.
I would lose then, as it would take 3 hours to startup.
During the game it worked shit. Initially 5 million nps, but when a move was not predicted it really worked bad. It got like 9 plies then. Predicted next move suddenly 17 ply.
You really want to compare those days that most hardly could test anything, with todays multicore cpu's where everyone at least can test *something* with?
Are you really realizing that things had to work 'right from scratch'?
Jonathan Schaeffer had had the same experience with a SGI box for Chinook there, so he told.
In my mailbox SGI had sworn to me 3 times (from which 2 times on paper and 1 times by phone) that the worst case latency to remote cpu's was 960 nanoseconds.
In fact when i tested it during world champs 2003 at 460 cpu's, it appeared that the AVERAGE was 5.8 microseconds. Not 960 ns at all.
Worst case it was a factor 12 difference.
This is a world of problems that you don't have with hardware you have at home.
Based upon THAT i played over there. With 2 hours sleep a night i worked on during the night in the hotelroom, and just blindfolded guessing what was the problem.
Then first hour in the morning i was allowed to be there, waiting for the official to open the cave, get in put the new version at the box. Start it and the next round then pray you didn't break something.
One game a version i obviously had broken something. Instead of splitting up to depthleft 2 ply (means searches in this case of 1 ply get done) i had put that amount higher. To 3+ ply.
That was against Falcon. Diep got horrilbe search depths.
After the round with Fritz there was a few spare hours. Finally i discovered the real problem. Again 2 hours sleep and from game 8 and onwards Diep scaled fantastic at the supercomputer.
Yet Diep lost that game versus Fritz, which it otherwise would have WON.
So that influenced later on the tournament outcome a lot.
The emergency fix i had done back then scaled a lot better. Years later i would get a much better idea how to fix that bottleneck, yet as i never again ran at hundreds of cpu's, i never got the chance to show it with THAT fix

The reason for that was a conversation after the world c hamps. By Phone. At the other side of the phone was professor Jaap v/d Herik who had fantastically supported me.
I told him that being busy in this amateuristic manner was not possible except if it would get paid. I begged for 100 euro a month as a reward for writing all those reports and working with 9 government organisations just for 1 supercomputer. Jaap was resolute. Paying for science was impossible and for sure not worth a 100 euro a month.
So that stopped the supercomputer project from Diep's side.
Jaap adviced me writing a management report, so that the organisations could store that report in a desk and have it get dusted.
Instead i wrote publicly something. One of my complaints was SGI lying about the latencies. A public reaction, at which i didn't get the opportunity to react upon by the dutch computerchess federation was that this person only needed to ask 'his friend Aad v/d Steen' who assured him it should be a latency of around a microsecond or 4. I should have done that as well and shouldn't write this nonsense said the guy...
You know, these governments are all so amateuristic, with N*SA (also meaning the locals here) being biggest amateurs who only by means of some espionage look like 1 IQ point more clever than IQ 100 dudes, you should really read what happened back then in a different manner...

Vincent

bob · Post by **bob** » Thu Apr 19, 2012 7:30 pm

Daniel Shawul wrote:
bob wrote: My DTS search was NOT "optimized for the Cray". It was actually developed on a non-Cray shared memory system, and it has been run on a ton of different machines. It IS optimized for shared memory machines with symmetric hardware (no true master-slave cpus and such).
Ok if you say so. I read this in your paper and it is hard to interpret it any other way
To process the HELP command, a simple test was added to the sequential search code. After any node is processed, HELP(i) is checked, and if set, the tree state is copied to shared memory, the flag is cleared, and the search continues normally. This reduces the idle time for a processor to the time required for a processor to search one node and return to the top of Search() and then store the "tree state" for the idle processor to examine. (About 30 microseconds on a Cray C90.) After this time, an idle processor will have at least one tree state to examine for split points, making this reasonably efficient.
So you copy the tree state when ever an idle processor asks for it. How big is that tree state? The common YBW approach where working threads pick up idle ones and the actual copying is done when the idle processor is about to start the parallel search. Apparently you saw some benefit from gathering split points and picking up the best one. Marco did something similar only that he checks for the best one on the fly. Why didn't you do that? Does the working processor issue a different split point than it is currently working on when it gets a HELP from idle processors? I can only understand the copying if that is the case.

In my current YBW implementation, when I notice that a thread is idle, a searching thread stops if the current node has satisfied the YBW criterion, copies the necessary data to at least two split blocks, one for itself, one for the new "partner", and then both start to search.

In Cray Blitz, I did it differently. I let an idle process do all the work. In fact, the last version used did not have the "searching task" copy anything at all. The idle thread looked at everyone's tree state, selected the best-looking split point, and did the necessary copying itself. Only trick was to set a specific lock for the thread we are "joining" at the specific ply where we have chosen to split, so that we can do the split stuff, then let him discover "Hmm, someone split with me here" (although it would really not notice that, it would think it had always been that way.

It was a minor optimization to keep actual searching threads from doing unnecessary work while an idle thread did nothing.

The split blocks on Cray Blitz were no bigger than what is currently used on Crafty. Wonder why?

However, I have not taken the step of rewriting my current search into an iterative version so that I could go "full-blown DTS" although it may happen one day.

You are not catching the subtle (but significant difference).

1. If there was an active split point in Cray Blitz, and a thread became idle, it would instantly join an active split point, assuming there was still work to do there. Just as what Marco is talking about. That was zero-cost. But it wasn't that common.

2. The issue of "who does the split work" is the difference here. In the latter Cray Blitz versions, the idle thread did the work rather than a busy thread. Again, a relatively small optimization.

Max theoretical bandwidth is pointless when a PC has no hope of reaching those numbers. There's absolutely nothing in DTS that I would change to use it in Crafty. I don't use it because I don't want to get rid of the clean recursive search and go to an iterated search that is significantly messier to write, test, debug, and maintain. One can look at Cray Blitz's search to see what I mean. But it would work perfectly, and has been implemented on PCs more than once by others, with no problems of any kind other than the complexity.
I got curious and did some reading on cray blitz mainly from the talk by Harry Nielson. I see he is the Cray expert and did use vectorization to optimize your porgram and gave you an Attacks() which is much faster. Move generation is vectorized using SIMD on short vectors (similar to current SIMD with registers). Attacks() speeded up by 75% according to him. So you (atleast Harry) did do some SIMD on the cray. The NPS of that era were were shocking (2000, 10000 etc) so I don't know if findings then are directly applicable for current hardware. That was on Cray-1. Cray X-MP with 2/4 processors and then Cray Y-MP with 8 is where you developed DTS. If you wanted to, you could have written a purely SIMD chess on the Cray at that time, but since it is also an MIMD machine there is no need for that. NVIDIA gpus are SIMT (not SIMD) which is somewhere in between. They are more flexible than pure SIMD machines and that allows for applications other than what is commonly considered SIMD friendly.

Cray was never SIMD. It was an actual vector box, where vector units could "chain" together so that the output from one goes to the input of the next, a classic "arithmetic pipeline". But not SIMD-like in the classic sense such as the Goodyear MPP and such.

The vector stuff had NOTHING to do with parallel search. It was about efficiently sucking things in from main memory with zero latency after the first element, which was extremely fast. DTS was purely shared-memory SMP hardware based. It was developed on a non-Cray (A sequent Balance 21000 with 30 CPUs to be precise.) We just could not use the cray assembly language replacement modules on that machine and had to run the pure Fortran version. Irrelevant to the DTS algorithm and its development.

If you read my dissertation, all the results were from the Sequent, not from a Cray.

bob · Post by **bob** » Thu Apr 19, 2012 8:59 pm

diep wrote:
Daniel Shawul wrote:
bob wrote: My DTS search was NOT "optimized for the Cray". It was actually developed on a non-Cray shared memory system, and it has been run on a ton of different machines. It IS optimized for shared memory machines with symmetric hardware (no true master-slave cpus and such).
Ok if you say so. I read this in your paper and it is hard to interpret it any other way
To process the HELP command, a simple test was added to the sequential search code. After any node is processed, HELP(i) is checked, and if set, the tree state is copied to shared memory, the flag is cleared, and the search continues normally. This reduces the idle time for a processor to the time required for a processor to search one node and return to the top of Search() and then store the "tree state" for the idle processor to examine. (About 30 microseconds on a Cray C90.) After this time, an idle processor will have at least one tree state to examine for split points, making this reasonably efficient.
So you copy the tree state when ever an idle processor asks for it. How big is that tree state?
As i woke up sooner than Bob: 64 KB is what Cray Blitz copied.
But is it relevant? Sure maybe for the speedup of Cray Blitz.
This is not so interesting 35 years later.

Enough has been said about how Bob writes his articles around 2002.

DTS is a real invention, don't ever forget that.

Not sure when Bob invented it, yet we're in 2012 now and it still works great for up to a core or 32 at todays hardware giving a magnificent speedup, whereas it's not so complicated to build in a centralized manner.

I'll have to see anyone of us inventing something that still works 35 years later.

Cilkchess and Zugzwang and Hydra and Deep Blue are far more recent creations and these entities lose factor 40-50 handsdown and Deep Blue loses effectively a lot more than that.

The common YBW approach where working threads pick up idle ones and the actual copying is done when the idle processor is about to start the parallel search. Apparently you saw some benefit from gathering split points and picking up the best one. Marco did something similar only that he checks for the best one on the fly. Why didn't you do that? Does the working processor issue a different split point than it is currently working on when it gets a HELP from idle processors? I can only understand the copying if that is the case.
Max theoretical bandwidth is pointless when a PC has no hope of reaching those numbers. There's absolutely nothing in DTS that I would change to use it in Crafty. I don't use it because I don't want to get rid of the clean recursive search and go to an iterated search that is significantly messier to write, test, debug, and maintain. One can look at Cray Blitz's search to see what I mean. But it would work perfectly, and has been implemented on PCs more than once by others, with no problems of any kind other than the complexity.
I got curious and did some reading on cray blitz mainly from the talk by Harry Nielson. I see he is the Cray expert and did use vectorization to optimize your porgram and gave you an Attacks() which is much faster. Move generation is vectorized using SIMD on short vectors (similar to current SIMD with registers). Attacks() speeded up by 75% according to him. So you (atleast Harry) did do some SIMD on the cray. The NPS of that era were were shocking (2000, 10000 etc) so I don't know if findings then are directly applicable for current hardware. That was on Cray-1. Cray X-MP with 2/4 processors and then Cray Y-MP with 8 is where you developed DTS. If you wanted to, you could have written a purely SIMD chess on the Cray at that time, but since it is also an MIMD machine there is no need for that. NVIDIA gpus are SIMT (not SIMD) which is somewhere in between. They are more flexible than pure SIMD machines and that allows for applications other than what is commonly considered SIMD friendly.
I don't know what's wrong with you there. Wake up - we live in 2012.

Technology from half a century ago you're comparing with the superior GPU technology we have right now made by Nvidia, with another monster soon to show up from them that you measure in TERAFLOPS a gpu. Nothing can rival THAT on this planet.

Go have a look at Zugzwang 1988 ok. It's 10 years after Cray Blitz. They run at a 512 processor box with each 'processor' being 1Mhz.

Sure, not processors that can get the same IPC like todays ones and the box is not symmetric multiprocessing either, yet...

How much nodes a second does the box reach in total for Zugzwang?

200.
So not 200k. Not 20k. Two-hundred.

Checkout the specs of the box. The network of it was magnificent as compared to the CPU's.

Over 2.5 million cycles per node.

Their paper claims a 50% speedup kind of.

After some longer study of the weird technical terminology used in that article i concluded they divided the number of nodes from 1 processor by the number of nodes searched by n processors and claimed that as their speedup.

Now as they invented (?) YBW that's going to be pretty good of course, especially if you forget to mention that...

Factor time they forgot... ...so the simple algorithm of having 511 cpu's idle and just 1 search would get the magnificent speedup of 100% in this manner...

I hope you're not projecting these standards onto todays standards...

Let me add that they invented the TERM "YBW". Not the concept. Monty Newborn's original "PVS (principal variation split)" search used exactly the same idea, that you had to search the first node at any point sequentially before starting a parallel search at that node. Only difference was Monty only did it down the "left-hand edge" of the tree. That is, for a 10 ply search, you had exactly 10 split points. Not bad for 2 cpus, and it worked reasonably for 4. But not beyond.

DTS was working in 1986 in an early version, at the WCCC that year... The final version for my disseratation was completed in 1987. It continued to be changed over the years until I officially retired Cray Blitz after the 1994 ACM tournament.

bob · Post by **bob** » Thu Apr 19, 2012 9:08 pm

Daniel Shawul wrote:
As i woke up sooner than Bob: 64 KB is what Cray Blitz copied.
But is it relevant? Sure maybe for the speedup of Cray Blitz.
This is not so interesting 35 years later.

Enough has been said about how Bob writes his articles around 2002.

DTS is a real invention, don't ever forget that.

Not sure when Bob invented it, yet we're in 2012 now and it still works great for up to a core or 32 at todays hardware giving a magnificent speedup, whereas it's not so complicated to build in a centralized manner.

I'll have to see anyone of us inventing something that still works 35 years later.

Cilkchess and Zugzwang and Hydra and Deep Blue are far more recent creations and these entities lose factor 40-50 handsdown and Deep Blue loses effectively a lot more than that.
Vincent I do not doubt it is a great invention. The online article about DTS is a bit confusing what is going on with the HELP command. I made another post about when and how much is being copyied when a HELP is issued. But later I deleted it because I found something else which implied very little (not some 64kb) is being copied in IEEE journal. So I just decided to give Bob a chance to explain. My deleted post is quoted below.

Daniel wrote: First this:
online_article wrote: The HELP command is the primary signaling mechanism within DTS. Whenever a processor is "out of work" it sends the HELP command to what is effectively the entire group of active processors. The HELP command simply requests that any processors that are actively searching subtrees temporarily stop, copy the "tree state" to shared memory, and then continue searching.

As these "tree states" become available, the idle processor that initiated the HELP command analyzes each state to determine if it can find a satisfactory split point. If not, it simply re-broadcasts the HELP command.
But later you say.
online_article wrote: An interesting feature here, is that once a processor finds a viable split-point, and the Split() operation is performed, whenever any other processor becomes idle, they check for active split-points before broadcasting a HELP command. If a processor locates any valid split points, and there is work remaining at any of them, it simply attaches to the split point with the most work remaining, and does not broadcast a help command.
So clearly there is no copying if there are active split points. So the "HELP" command and hence the copying of tree state to shared memory is _rarely_ done, which is the key here. Also even after the HELP command is issued (i.e once we determine the active split points are no good), why do we ask the busy processors copy the tree state to shared memory ? Can't it look for a good split point like we did the test for reattaching to active split points? Copying should be done only when we actually split IMO, doing that while looking for a good split point is sure to consume bandwidth. The whole reason why we don't split at shallow depths is to reduce the cost of splitting (i.e copying tree state). I can understand that finding good split points is top priority at that time because YBW was not invented. So the copying while looking for split points could be justified because of that.
The IEEE journal says
Whenever a processor exhausts the work
(sub-tree) that it is working on, it broadcasts a help
request to all busy processors. These processors
make a quick copy of the type of each node they
are searching in the current sub-tree and the
number of unsearched branches at each node,
and give this information to the idle processor. The
busy processors then resume searching where
they were interrupted. The idle processor (or
processors if more than one is idle) examines the
data and picks the most likely split point based on
immediately stop and try to find more useful work
to proceed with, by broadcasting a help request.
As a processor finds a new best score for a split
point, it shares the value with other parallel
searchers at that split point to improve their AB
cutoff performance. These issues are dealt with
more deeply in Hyatt’s thesis [7].
So this clearly says very little is copyied. This makes much more sense. Does some one have the phd thesis btw ?

I don't know what's wrong with you there. Wake up - we live in 2012.

Technology from half a century ago you're comparing with the superior GPU technology we have right now made by Nvidia, with another monster soon to show up from them that you measure in TERAFLOPS a gpu. Nothing can rival THAT on this planet.

Go have a look at Zugzwang 1988 ok. It's 10 years after Cray Blitz. They run at a 512 processor box with each 'processor' being 1Mhz.

Sure, not processors that can get the same IPC like todays ones and the box is not symmetric multiprocessing either, yet...

How much nodes a second does the box reach in total for Zugzwang?

200.
So not 200k. Not 20k. Two-hundred.

Checkout the specs of the box. The network of it was magnificent as compared to the CPU's.

Over 2.5 million cycles per node.

Their paper claims a 50% speedup kind of.

After some longer study of the weird technical terminology used in that article i concluded they divided the number of nodes from 1 processor by the number of nodes searched by n processors and claimed that as their speedup.

Now as they invented (?) YBW that's going to be pretty good of course, especially if you forget to mention that...

Factor time they forgot... ...so the simple algorithm of having 511 cpu's idle and just 1 search would get the magnificent speedup of 100% in this manner...

I hope you're not projecting these standards onto todays standards...
Maybe I got carried away a bit but it sure seemed to be exciting for the participants at that time. Also I want to get to the bottom of what the old Crays are about. Those supercomputers have very few cores that got performance in 1-2 gflops. Like any other supercomputer the performance is measured with LINPACK and kind of weird to have a chess application on it. Harry said less than 5% of chess is vectorizable but he still managed to vectorize some on movegen and attacks. The large number of registers also helped to do a number of hash probes at one time (upto 7), and also to store bitboards and other things that can't be easily done on non-vector processors. If you can access the IEE journal you may find it interesting too

Unfortunately, you are looking at something that evolved over time, whereas a specific article is about something being done at that specific interval in time.

Originally, when a thread became idle, it sent a "help" message (actually just set a help flag in global memory for each busy CPU that caused one active thread to acquire a lock, clear the flag so that others don't try to help, and then release the lock. Now I could look to see where in my current tree state I want this guy to help me, do a copy for him, and away we went.

Later, I let each active searcher copy their state to shared memory (each used local memory that technically could not be seen by other threads) so that the idle processor made the decision. This was a bit more efficient.

Finally, I let the idle searcher look at each thread's "local memory" via a tricky bit of addressing in a language that did not support pointers, so that no copy was needed except when we actually found the place where we wanted to force a split.

Crafty, today, doesn't do any of that. When a thread becomes idle, any active thread notices and begins to check the YBW condition whenever it completes a search for a single move at any node. If one move has been searched already, then we can split, unless there is no work left. But we only split at the current ply, or not at all, which is less efficient than what Cray Blitz would do. For example, if I am doing a 24 ply iteration, and I am at ply 18 when a thread goes idle today, I split at 18, or else wait until I get to 19 or 20 or whatever and have satisfied YBW. All other threads do the same. With CB, if I am at 18 and a thread goes idle, I can split anywhere. Preferably nearer to the root, and preferably where more than one move has already been searched leaving me even more convinced that this is an ALL node than the simple YBW idea..

bob · Post by **bob** » Thu Apr 19, 2012 9:11 pm

Daniel Shawul wrote:
1. 4 elo is tough to measure. Even 30,000 games gives an error bar of +/-4 so that's nowhere near enough. To get to +/- 1, you need something north of 150,000 games. Did you ACTUALLY measure accurately enough to say +4 elo?
I did not do the testing but only predicted it won't help much. Marco here did, and said that he got +4 elo from some thousand blitz games. So it does confirm my suspicion that it won't help much.

I'm not convinced there at all. 1000 games has an error bar FAR wider than 4 elo...

2. I don't see any measurable idle time in my current search, and saw absolutely none in CB's search. That is not much difference, I don't see how it would add up to +4 elo...
Sure mine has also very small idle time. That is why my nps is almost doubles on 2 processors, but has some losses on more than 8 processors. The OP thought he might gain some on that which is why I replied with my guess that it will not help much in that regard. My guess (as I stated before) is that selection of the better split point among those available may help even if there is 0 reduction of idle time. To avoid confusion , by better split point I mean one which has more work. If you can split near the root, you will have fewer splits which means you avoid the copying of search stack every time we split. Similar to the reason why we limit parallel search above a certain depth.

Note that it also might not be idle time, but memory conflicts. 8 threads can begin to hit a bottleneck getting stuff out of memory. Simple test is to run 8 separate instances of the program at the same time and see what happens to NPS. If it is not 8x one, then memory is the issue, not the parallel algorithm.

If it is 8x, then either the parallel search has some idle time, OR, you might have some significant cache coherency traffic due to poorly laying out variables in memory so that two end up in the same cache block, but are being updated by different cores.

YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting