Laskos wrote:bob wrote:
Where do "I claim 1.9x speedup from 16 cores to 32 cores?" Never have, never will. Unless you talk about my simple approximation, which has NEVER been a claim of anything, just a simple estimate. You misuse the word "claim" badly here. I claim that is just a rough estimate. and I've made the claim much weaker for 16-32 and beyond as well.
So, you say that formula doesn't apply anywhere near 16 or 32 threads. Couldn't you just remember 3 numbers for doublings? "Approximation" a la Hyatt.
Grow up. When was that formula developed? Back when most parallel PCs were dual-cpus and I had a quad-pentium-pro box. And then the 8-core opteron fit that 3 points of data 1, 2 and 4 quite accurately as well. And 12 is right in there as well, testing I have done a lot of having a cluster of 70 12-core nodes.
It is old. I have not tried to re-calibrate it as of yet, although I suppose now that we have 16 cores and up, even on a single chip, it is about time. But it is not particularly easy when a more accurate formula will need N different sets of coefficients since not all parallel machines are created equally.
The number is highly accurate for most, certainly through 8 cores it is even a bit pessimistic as previous data has shown. Beyond 16 I have not claimed any sort of error estimate. If you choose to use it beyond where it was really intended, then don't blame me for something I have never claimed.
Now guy, you said that you got x32 effective speedup (time to strength) on 64 core Itanium box. Then smirked when I said it IS very high effective speed-up, with you citing some papers from 80es for DTS on completely different architecture. You know what x32 _effective_speed_up means in terms of _average_ effective speed-up for doubling time to 64 threads?
I claimed a 32x parallel time-to-depth speedup on 64 cores. Nothing about time to strength. I have not personally tried to quantify speedup-to-Elo in my testing, because 99% of my testing is not using parallel search anyway. I also stated that 32X does not particularly impress me. We saw some numbers by Feldman quite a few years ago showing a speedup of something like 256 on a 512 node hyper-cube type box. None of us considered THAT to be particularly interesting. I don't have any of my journals here at home or I would try to provide a citation. The 256 might be off a bit up or down I do not recall. But none of us (those working on parallel search) consider a 50% hardware loss as acceptable.
1.78, per each doubling, 6 doublings.
That is VERY HIGH effective speed-up for 6 doublings. Either hardware is very different or Crafty is exceptionally, almost linearly scaling engine. What V. Rajlich reported years ago for effective speed-up of Rybka was a series of doublings to 16 cores looking like that:
1.8*1.7*1.6*1.5
The total is 7.3, already below half of 16 cores used. That's why I consider x32 for Crafty on 64 core box very high.
Would you stand for a bet:
If Andreas bothers to run 32 cores test on his Dual Opteron, I claim Komodo will gain 1.34 effective speed-up, or 38 Elo points, with log model, your claim is that it will gain 1.78 in effective speed-up from 16 to 32 threads (from your Itanium 64 core tests with claimed x32 effective speed-up), or 75 Elo points on that hardware at Andreas' 60''+0.05'' time control. Whoever is closer in numbers wins the bet, whoever loses has to post here "I AM AN IDIOT".
I may open a plea thread to ask Andreas to perform such a test, either Komodo 32 threads against Komodo 1 thread, or even better, Komodo 32 threads against Komodo 16 threads.
Agreed on such a bet?
Why would I bet anything on hardware I have not used? You said "dual opteron" which has 16 cores per CPU? Not a very good machine with a huge cache and memory bottleneck. My original approximation was tested in the Fierz thread I mentioned back at least 15 years ago if not more, on an 8-cpu opteron box that was 8 individual chips, not one chip with 8 cores. Hence no cache bottleneck, just memory. Today there are a ton of variables. I have a machine that is driving me nuts as I have mentioned. Capable of 60M nodes per second (2 x 6 cores) but I can only reach 45-50M for this same problem. Can it be fixed? Don't know as of yet. So betting on hardware, particularly hardware that comes with a built-in bottleneck, doesn't make a lot of sense and wouldn't show a lot of judgement. I have run on a few odd machines where I could not get a 50% speedup. A particular quad-core laptop comes to mind that had a horrible memory interface.
I'm not a kid. I don't gamble. I bet on sure things where I know the outcome before the bet is made. Betting on unknown hardware is not exactly in that class, sorry.