Laskos wrote:diep wrote:Laskos wrote:diep wrote:Laskos wrote:diep wrote:
2400 cores seems really a lot to me there
Assuming this is first time Jonny uses that many - any speedup numbers from Jonny?
What speed-up? It gets 1 billion NPS. Don't know effective speed-up. Mark says it ponders many moves simultaneously, but ponder alone cannot mean more than twice the thinking time. Pity this thing is not tested in 100 games against Komodo, to have a better idea what this is.
Good pondering can be very powerful. It reduces the time of your opponent in half not seldom at cricial moments.
So it reduces elo bigtime of your opponent in such case.
with 2400 cores i would say a clever choice from Jonny.
Ponder is at most a factor 2 in thinking time on average (either decrease the opponent's time or increase yours), and at this level would mean some 30-40 ELO points.
Reducing the elo of your opponent is way more than factor 2.
Opponent doesn't have 2400 cores.
Talking about elo is not really interesting as you just play 2 games in total.
If you have 2400 cores and you group it in 4 x 600 cores which in turn calculate at different positions, sometimes 3 plies ahead sometimes pondering other moves,
that's IMHO pretty clever.
Factor 2 scaling won't give Jonny +40 elo for sure.
Wish i had done that world champs 2003...
What you say is clever only if the scaling of the usual parallel search with many cores is bad. Would you adopt the same strategy with 4 or 8 cores only? And again, ponder is a factor of 2 only, normal chess is mostly about playing good moves, not so much about stealing time.
With Diep position is different from other engines. Engine back then and even todays hardware is 10x lower in nps becasue of having more chessknowledge. Though not so well tuned back then.
Then further you need to be realistic about supercomputers. It has for example the GCC compiler usually and slower processors usually.
So you already lose a factor 2 in advance because of inferior compiler, communication overhead bla bla.
In case of Jonny i suspect he uses same SMP algorithm like how most do it nowadays. It's not a true YBW, it's more a kind of 'attempt" to do a little YBW without hurting scaling.
Diep i used a full blown hard YBW and crafty for example also uses a hard YBW and Glaurung (nowadays stockfish) also did do a hard YBW (didn't check stockfish there).
The German dudes and also Belgium (which automatically also means rybka/houdini and komodo) are doing this soft YBW algorithm.
Which IMHO doesn't give that much improvement going from 200 cpu's to 400 cpu's.
So if you got 2400 then and divide that in 4 x 600, i fully would agree with that.
With 8 cores that's just not an option. The average hardware here from top programs is like 28-32 cores or so.
If you are doing a very well implemented hard YBW then so to speak you can build very cheaply supercomputer.
As you need to use GCC, the amd processor, older 12 core ones, are pretty ok.
not to confuse with the failed bulldozer with minicores.
Build 8 machines with 48 cores AMD, second hand on ebay for peanuts. That's like 1500 euro a box maximum?
Then use a good mellanox switch (actually that'll have 36 ports soon) and put in 2 mellanox infiniband cards, so we call that double rail.
As the pci at those 4 socket motherboards is attached to different CPU's (not cores yet really cpu's) that's double the number of messages a second you can ship around to/from switch (switch isn't the problem).
You have then for peanuts a "supercomputer" of 48 x 8 = 384 cores and each core is around 2.3Ghz then or so.
So that's 883.2Ghz in total.
The SMP algorithms used by all the PC programs aren't good enough to run at such 'supercomputers'.
Latency from node to node IS simply 1.0 microsecond and that's not so fast. Yet it'll be factor 4-8 better than the box Jonny runs at.
So the world champs 2003 hack i used to partition in 4 pools, in this case you could use 8 pools.
You quickly start the YBW search with 48 cores and then when enough splitpoints are there you split in the other nodes cores in a few dangs.
That means you lose some scaling yet the speedup is great.
Diep at such hardware we can easily figure out how many nps it gets. Which might not seem so great to you.
Latency to the RAM or to other nodes is the reason why most pc programs won't work there.
So for them it is never an option to run on a supercomputer and do what Jonny is doing.
they CAN get 128+ processor machines easily for WCC - yet their algorithms do not scale well enough.
short answer: first 1000 processors i wouldn't consider doing it - yet jonny is a 10x faster searcher than Diep realize that. So he'll need the hardware less