Nothing but the elo gain when increasing number of CPUs.George Tsavdaris wrote:What is Delta ELO?
In other words what Miguel is saying you have to play some games, a lot of games actually...
Moderators: hgm, Rebel, chrisw
Nothing but the elo gain when increasing number of CPUs.George Tsavdaris wrote:What is Delta ELO?
That's about as good as you can do, with one addition. You will probably want to "normalize" the times so that each position using one cpu takes the same amount of time. Since you can't do this just by setting depth, you can come up with a multiplier so that for one cpu each position takes the same "adjusted" time.George Tsavdaris wrote:The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.M ANSARI wrote:Probably, but it is the only way I can think of measuring scaling cross platform, as I have yet see two original engines create similar knps profiles.bob wrote:That's a very imprecise way of measuring "scaling" and could be a way of measuring "parallel search bugs" in fact.M ANSARI wrote:Actually I have never tested against Crafty, but when testing Rybka against Zappa, the "scaling" I mean was that for example scores of Rybka 2.3.2a against ZMII on single core were say 80 ELO for Rybka, on 2 cores it would something like 60 ELO and so on. This was quite linear until at 8 cores and 5_0 matches, ZMII was pulling very close and pulling ahead when cores were pushed to 5 Ghz. That led me to think that ZMII was scaling better than R 2.3.2a. This was not the case with R3 where scores against ZMII were pretty much the same as you increased cores.
For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :
You first let Rybka run each of the positions on the Quad and you keep the time:
Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10
Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10
And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10
So average speedup = the average of the above.
The same for Crafty.
This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?
OK thanks although you didn't forgive my ignorance. But at least you helped.Milos wrote:Before posting ridiculous stuff, try to at least read Bob's paper on parallel search...George Tsavdaris wrote:But in fact i can't understand why they are dependent on the positions.
I mean a very small difference is normal, but a big one as you propose is a puzzle for me. I would have expected only tiny differences and even 10 positions to be many and actually a bit of a waste of time.
Does it really happening such a high deviation of the speedups noticed between various positions?
I've posted a ton of data on this subject, here and in several papers on parallel search. Some positions produce almost exactly the same speedup each time they are run with the same number of CPUs. Some produce wildly variable speedups under the same circumstances. The majority produce less wild variability, but when I say "less wild" I mean _exactly_ that. It is quite common to see a run with 8 cpus that looks like this: 1.5m, 2.8m, 1.1m, 2.3m, 5.8m. One "outlier point" is common. And that 3x+ variability is not uncommon either.George Tsavdaris wrote:If the speed up is heavily dependent on the positions then this method is not so good after all.michiguel wrote:10 positions is way too few, because the speed up is very dependent on positions. I would choose enough so the SD becomes small enough.George Tsavdaris wrote: The only general and valid way i see for measuring scaling is the simple procedure of let's say taking 10 different positions and measuring the time an engine does to reach a certain depth in each position(the bigger the depth the better) for each hardware and then dividing the times from the different hardwares to see the speedup for each engine, and then you take the average or something to have a good value of the actual speedup.
For example for Rybka,Crafty on Quad, Octal, with positions named P1,P2,...,P10 and depth to reach in the corresponding positions, D1(e.g 25 plies),D2,..,D10, with times to reach every depth on every position, t1,t2,...,t10 :
You first let Rybka run each of the positions on the Quad and you keep the time:
Quad_P1: time to reach D1 -> tq1
Quad_P2: time to reach D2 -> tq2
...............................
Quad_P10: time to reach D10 -> tq10
Then you let Rybka run each of the positions on the Octal. And you keep the time again:
Octal_P1: time to reach D1 -> to1
Octal_P2: time to reach D2 -> to2
...............................
Octal_P10: time to reach D10 -> to10
And the speedup, the scaling from Quad to Octal for Rybka is:
Position-1: to1/tq1
Position-2: to2/tq2
...........................
Position-10: to10/tq10
So average speedup = the average of the above.
The same for Crafty.
This seems a legitimate method of measuring the actual speedup, i.e the scaling.
Is there a better one or does this has any flaw?
But in fact i can't understand why they are dependent on the positions.
I mean a very small difference is normal, but a big one as you propose is a puzzle for me. I would have expected only tiny differences and even 10 positions to be many and actually a bit of a waste of time.
Does it really happening such a high deviation of the speedups noticed between various positions?
What is Delta ELO?I think that the only method that takes into account all the potential issues and parameters is to measure Delta ELO vs N of CPUs, but of course, it is time consuming.
Yes, check the tables in the end and mind that in those times variability in node count was noticeably smaller than today when search is far more complex.George Tsavdaris wrote:I guess Bob's paper is the following right?
http://www.cis.uab.edu/hyatt/search.html
If you have a cluster you can always play games to accurately determine the speed up in Elo.bob wrote:Technically, no matter how you compute speedup, you introduce bias. You can just add all the times and average. Longest searches get weighed disproportionally. You can compute each speedup (for each position) and then average. This can unfairly bias the results if you have unusual positions here and there. You can do as I suggested. The problem with any of the above is that "an average of an average" is an issue. There are more accurate ways, including running each position N times, using some sort of smoothing formula to get rid of the variability. Bottom line is a lot of positions, run a lot of times (more times as you use more CPUs because variability will go through the roof).
That's still problematic. The cluster I use has dual cpus per node. Not much help. Our other cluster only has 70 nodes, 8 cores per node. But it is much busier. And SMP improves with depth so that fast games are worse than slower games. And longer games become much more problematic in terms of time needed...Milos wrote:If you have a cluster you can always play games to accurately determine the speed up in Elo.bob wrote:Technically, no matter how you compute speedup, you introduce bias. You can just add all the times and average. Longest searches get weighed disproportionally. You can compute each speedup (for each position) and then average. This can unfairly bias the results if you have unusual positions here and there. You can do as I suggested. The problem with any of the above is that "an average of an average" is an issue. There are more accurate ways, including running each position N times, using some sort of smoothing formula to get rid of the variability. Bottom line is a lot of positions, run a lot of times (more times as you use more CPUs because variability will go through the roof).
I know I was joking a bit, playing games for determining speed-up is extremely impractical.bob wrote:That's still problematic. The cluster I use has dual cpus per node. Not much help. Our other cluster only has 70 nodes, 8 cores per node. But it is much busier. And SMP improves with depth so that fast games are worse than slower games. And longer games become much more problematic in terms of time needed...
That covers a lot of ground.Milos wrote:I know I was joking a bit, playing games for determining speed-up is extremely impractical.bob wrote:That's still problematic. The cluster I use has dual cpus per node. Not much help. Our other cluster only has 70 nodes, 8 cores per node. But it is much busier. And SMP improves with depth so that fast games are worse than slower games. And longer games become much more problematic in terms of time needed...
However, a serious question now, how is that in times of Cray Blitz depths in the range of 9-12 (9 single CPU, 12 16-CPUs, opening and mid-game positions) were sufficient to claim accuracy and today depths of 15-18 (for very fast games) are not sufficient with much larger node counts?
Yes there are different prunnings, LMRs, things that change tree shape a lot, but still, would that mean that if you repeated Cray Blitz parallel search tests on today's hardware you would get much better performance???
My understanding is that SMP improves with depth only to a certain point (which is in the depth range from 15-20 for today's hardware and engines) and than you don't see any significant improvement any more.
So then logically comes the question. If you leave it running on 16 cpu system for days would 16 be asymptotic speedup for most of the positions, or there still would be a majority of positions that can never reach linear speed-up, no matter how much time they are left to run?bob wrote:With respect to depth, if you crank up Crafty on (say) a 16 cpu system, in a middlegame position, and tell it to "go" it will start to search and display the NPS frequently. It will start off at about 1/2 the speed it will be searching after 30 seconds or so. I've not noticed any "stabilization" of parallel performance after a certain depth. The deeper the search, the less overhead we seem to see, in general, because the further from the tips you split, the better the move ordering is and the less likely you are to split at a CUT node which kills performance. Is 30 minutes per search better than 1? Yes. A lot? probably not. But there is a steady gain, at least for as far as I have measured. I have not compared 1 day searches to 1 hour searches for obvious reasons, however.