Daniel Shawul wrote:
As i woke up sooner than Bob: 64 KB is what Cray Blitz copied.
But is it relevant? Sure maybe for the speedup of Cray Blitz.
This is not so interesting 35 years later.
Enough has been said about how Bob writes his articles around 2002.
DTS is a real invention, don't ever forget that.
Not sure when Bob invented it, yet we're in 2012 now and it still works great for up to a core or 32 at todays hardware giving a magnificent speedup, whereas it's not so complicated to build in a centralized manner.
I'll have to see anyone of us inventing something that still works 35 years later.
Cilkchess and Zugzwang and Hydra and Deep Blue are far more recent creations and these entities lose factor 40-50 handsdown and Deep Blue loses effectively a lot more than that.
Vincent I do not doubt it is a great invention. The online article about DTS is a bit confusing what is going on with the HELP command. I made another post about when and how much is being copyied when a HELP is issued. But later I deleted it because I found something else which implied very little (not some 64kb) is being copied in IEEE journal. So I just decided to give Bob a chance to explain. My deleted post is quoted below.
Daniel wrote:
First this:
online_article wrote:
The HELP command is the primary signaling mechanism within DTS. Whenever a processor is "out of work" it sends the HELP command to what is effectively the entire group of active processors. The HELP command simply requests that any processors that are actively searching subtrees temporarily stop, copy the "tree state" to shared memory, and then continue searching.
As these "tree states" become available, the idle processor that initiated the HELP command analyzes each state to determine if it can find a satisfactory split point. If not, it simply re-broadcasts the HELP command.
But later you say.
online_article wrote:
An interesting feature here, is that once a processor finds a viable split-point, and the Split() operation is performed, whenever any other processor becomes idle, they check for active split-points before broadcasting a HELP command. If a processor locates any valid split points, and there is work remaining at any of them, it simply attaches to the split point with the most work remaining, and does not broadcast a help command.
So clearly there is no copying if there are active split points. So the "HELP" command and hence the copying of tree state to shared memory is _rarely_ done, which is the key here. Also even after the HELP command is issued (i.e once we determine the active split points are no good), why do we ask the busy processors copy the tree state to shared memory ? Can't it look for a good split point like we did the test for reattaching to active split points? Copying should be done only when we actually split IMO, doing that while looking for a good split point is sure to consume bandwidth. The whole reason why we don't split at shallow depths is to reduce the cost of splitting (i.e copying tree state). I can understand that finding good split points is top priority at that time because YBW was not invented. So the copying while looking for split points could be justified because of that.
The IEEE journal says
Whenever a processor exhausts the work
(sub-tree) that it is working on, it broadcasts a help
request to all busy processors. These processors
make a quick copy of the type of each node they
are searching in the current sub-tree and the
number of unsearched branches at each node,
and give this information to the idle processor. The
busy processors then resume searching where
they were interrupted. The idle processor (or
processors if more than one is idle) examines the
data and picks the most likely split point based on
immediately stop and try to find more useful work
to proceed with, by broadcasting a help request.
As a processor finds a new best score for a split
point, it shares the value with other parallel
searchers at that split point to improve their AB
cutoff performance. These issues are dealt with
more deeply in Hyatt’s thesis [7].
So this clearly says very little is copyied. This makes much more sense. Does some one have the phd thesis btw ?
Yes i have all those PHD thesises, also Bob's one. Search for Robert Morgan Hyatt.
You're speaking about Bob getting world champion in the 70s. His thesis he delivered a year or 10 later. 1988 or so?
In his thesis he claims a speedup of 7.78 or so out of 16. Looks correct claim to me for what he did do, and that wasn't a Cray of course.
Why are you interested in all this old junk?
Science was at a different level back then. Lying a factor 50 was common. Cilk is a good example there. Total junk idiocy from a few professors that slows you down factor 40+. Had Don done the SMP himself i'm sure he wouldn't have lost that much. But then he wouldn't have had the supercomputer that Leierson & co could get so easily by crabbling something on an a4 paper about computerchess being similar to guided missile flight...
I am much against all what happened back then in terms of scaling claims. If your program first gets slowed down a factor 40 to 50 i feel one should mention that first prior to doing a speedup claim. And not some hidden 1 liner you usually see.
Yet are you happy with postings some here do, who basically lift for 3200 elopoints upon someone else and then act as if they know all about the 2 elopoints they won by some testing another dude had setup?
You cannot compare anything of today with searches of the 80s and before. First of all they didn't have nullmove and either no hashtables or total junk type hashtables. Branching factors of 10+ were very common and we speak of programs that lose a factor 40 to 50 somewhere.
Dealing today with all that is a lot harder - of course it's also easier in the sense that up to a core or 16+ you can easily test it at home.
I remember how i made my own SMP code back in 1998-1999. Completely at home. Later on i could login for a while at one of bob's machines a quad pentium pro, that really fixed a lot of issues

In 2002-2003 i had to setup a new framework for Diep. At home i had a dual k7 machine. How to simulate a 500 cpu search there?
At a 32 CPU partition it seemed to work ok when i was allowed to run at 32 cpu's for afew minutes with Diep.
This was already december 2002. I had worked fulltime nearly the months before at diep's SMP.
I had to wait until end februari to run at a bigger partition and use a cpu or 130 there. This was only openingsposition for 1 hour run.
After some weeks finally it got executed. diep got... ...1000 nodes a second.
Then i realized something was wrong... ...and that all my assumptions had to be redone and that i had to CHANGE something in the program

I figured out some later one of the problems was that each thread of Diep timed itself using GetTimeOfDay. When i removed that, then 3 months later when it ran at 130 cpu's, it scaled a lot better. The reason was a centralized clockprocessor. If 130 cpu's go ask regurarly the time at 1 centralized processor that will ugh out... ...so Diep could not selfmeasure how efficient each CPU ran; that would backfire later on during world champs 2003...
Again if i started 130 processes at my dual k7 at home it worked great and it also worked great at 32 cpu's when running for a few minutes at 32 cpu's the openingsposition. What was the problem?
Any help from others i could not expect. I had a few months before that already emailed all kind of scientists asking for technical data regarding latencies of the machine. Also professor Aad v/d Steen. No answer. By phone i received the advice to not ask again Aad nor others anything as they were 'too busy' with more important things in life than helping out others. After all government money may not be used useful.
The next 1 hour i could use more than 130 cpu's at the big 512 processor partition was in september 2003, whe n it rebooted in between the runs of some big challenge project that was c alculating the height of the seawater. They had started those calculations at back then worlds fastest supercomputer, the earth supercomputer in Japan.
In summer 2003 they already had made a discovery channel series about the height of the seawater. It would rise 1 meter coming 100 years.
The project kept on running entire summer - when i wondered why they ran for so long, longer for my feeling than had been appointed - i saw they had had a problem. And as they occupied that many cpu's from this 512 processor partition i could never even test for 1 hour at it, and had to wait.
In the logfiles of this project i saw the problem they had had. "ooops mistake in initialization," it said. "we initialized the seawater level 1 meter too high".
Very good scientists you know. Brilliant guys. But the next run i had for 1 hour was september 2003.
Very nervous i awaited the results of the run; the actual run you couldn't see, it ran in the batch.
When i got back results i was so so disappointed.
"error allocating shared memory". The default unix function to randomize which number to get had a focking hasherror already after 166 numbers. CPU 43 or something hashed to the same number as 166. How on planet earth do you manage to write a hashfunction for unix that so quickly already gets a collission?
A few days before world champs 2003 was the next attempt for Diep.
Again openingsposition. I had dared to take 3 hours this time.
I was very worried though when i got back the results.
5 million nps.
On the back of a napkin i calculated that for 100% scaling i should've seen 20k nps * 500 cpu's = 10 million nps...
World champs 2003 a few days later was my next chance.
On friday i got access.
Guess what.
NO INTERNET AT THE LOCATION.
Despite that i had asked the organisation for internet already at the friday.
normally spoken i would have gone home then and sued the organisation, but using a government supercomputer you don't do that.
I arrived at 10 AM there and had to wait until 6 PM for internet.
a few minutes later we had already the order to leave the location.
Next morning i arrived again. It took 3 hours just to START diep.
With a new openingsbook i started up diep and prayed it would start up in time for the first round which was to start 3 hours and 2 minutes later.
Exactly 1 minute before the game had to start, it had started up.
Just 'by accident' typing a quick 'quit', as i'm used to do so much, that was OUT OF THE QUESTION.
I would lose then, as it would take 3 hours to startup.
During the game it worked shit. Initially 5 million nps, but when a move was not predicted it really worked bad. It got like 9 plies then. Predicted next move suddenly 17 ply.
You really want to compare those days that most hardly could test anything, with todays multicore cpu's where everyone at least can test *something* with?
Are you really realizing that things had to work 'right from scratch'?
Jonathan Schaeffer had had the same experience with a SGI box for Chinook there, so he told.
In my mailbox SGI had sworn to me 3 times (from which 2 times on paper and 1 times by phone) that the worst case latency to remote cpu's was 960 nanoseconds.
In fact when i tested it during world champs 2003 at 460 cpu's, it appeared that the AVERAGE was 5.8 microseconds. Not 960 ns at all.
Worst case it was a factor 12 difference.
This is a world of problems that you don't have with hardware you have at home.
Based upon THAT i played over there. With 2 hours sleep a night i worked on during the night in the hotelroom, and just blindfolded guessing what was the problem.
Then first hour in the morning i was allowed to be there, waiting for the official to open the cave, get in put the new version at the box. Start it and the next round then pray you didn't break something.
One game a version i obviously had broken something. Instead of splitting up to depthleft 2 ply (means searches in this case of 1 ply get done) i had put that amount higher. To 3+ ply.
That was against Falcon. Diep got horrilbe search depths.
After the round with Fritz there was a few spare hours. Finally i discovered the real problem. Again 2 hours sleep and from game 8 and onwards Diep scaled fantastic at the supercomputer.
Yet Diep lost that game versus Fritz, which it otherwise would have WON.
So that influenced later on the tournament outcome a lot.
The emergency fix i had done back then scaled a lot better. Years later i would get a much better idea how to fix that bottleneck, yet as i never again ran at hundreds of cpu's, i never got the chance to show it with THAT fix

The reason for that was a conversation after the world c hamps. By Phone. At the other side of the phone was professor Jaap v/d Herik who had fantastically supported me.
I told him that being busy in this amateuristic manner was not possible except if it would get paid. I begged for 100 euro a month as a reward for writing all those reports and working with 9 government organisations just for 1 supercomputer. Jaap was resolute. Paying for science was impossible and for sure not worth a 100 euro a month.
So that stopped the supercomputer project from Diep's side.
Jaap adviced me writing a management report, so that the organisations could store that report in a desk and have it get dusted.
Instead i wrote publicly something. One of my complaints was SGI lying about the latencies. A public reaction, at which i didn't get the opportunity to react upon by the dutch computerchess federation was that this person only needed to ask 'his friend Aad v/d Steen' who assured him it should be a latency of around a microsecond or 4. I should have done that as well and shouldn't write this nonsense said the guy...
You know, these governments are all so amateuristic, with N*SA (also meaning the locals here) being biggest amateurs who only by means of some espionage look like 1 IQ point more clever than IQ 100 dudes, you should really read what happened back then in a different manner...
Vincent