4 x Intel Xeon good idea?

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, Dann Corbit, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Milos
Posts: 3971
Joined: Wed Nov 25, 2009 12:47 am

Re: 4 x Intel Xeon good idea?

Post by Milos » Wed Nov 18, 2020 11:36 pm

Laskos wrote:
Wed Nov 18, 2020 11:30 pm
Milos wrote:
Wed Nov 18, 2020 10:56 pm
Laskos wrote:
Wed Nov 18, 2020 9:54 pm
Alayan wrote:
Wed Nov 18, 2020 8:40 pm
His claim is unsupported so unless proper tests demonstrate that the scaling efficiency actually got worse, I'd just assume it was rubbish.

The proper measure of scaling efficiency of N threads is finding the time factor X needed so that 1-core with X*T time scores 50% against N-cores with T time. The resulting efficiency is X/N.
It was shown to at least 64 threads that classical SF with improved Lazy SMP scaled amazingly well compared to YBWC. For 64 cores, the effective speedup was about 40 compared to something like 18 with YBWC. Maybe I will find that plot among Andreas (the author of the excellent FGRL rating list) data, if it was him showing this.
Well Alayan goes and calls BS on any result that he doesn't like.
There are quite a few convincing results in the thread that he started by basically calling CCRL results BS.
http://talkchess.com/forum3/viewtopic.p ... 3&start=50
Some of those results from mwyoung looked quite strong (you know my opinion about his testing methods). That and the other result (http://talkchess.com/forum3/viewtopic.php?f=2&t=75363) looked suspicious enough so I did myself a test of scaling of SF-NNUEdev vs Lc0.
Ofc, it's nowhere near comprehensive as what Andreas did (http://talkchess.com/forum3/viewtopic.php?f=2&t=74188) but for me it's pretty indicative.
I played SF_1c vs Lc0, SF_2c vs Lc0, SF_4c vs Lc0, SF_8c vs Lc0, SF_16c vs Lc0. No HT, fixed multiplier for all cores, bullet match (1'+0.6''), 500 custom openings reversed colors.
And I got, respectively, the following results: 58.5%, 62.85%, 66.1%, 67.85%, 70.45%.
Then I repeated it with SF11 vs a weaker Lc0 net so that winning and draw percentages stay similar. And here I got: 51.55% 58.9% 64.65% 68.3% 72.15%.
Image
Wow! And what's the matter with LazySMP and NNUE? It isn't very obvious that this should happen at all.
I don't know if YBWC would indeed help but would be good if someone tried. LazySMP was widening the tree and covering for mistakes of SFs aggressive pruning but it could be (just a speculation) that move ordering improved with NNUE and that tree widening of LazySMP is not bringing as much as before.

Alayan
Posts: 489
Joined: Tue Nov 19, 2019 7:48 pm
Full name: Alayan Feh

Re: 4 x Intel Xeon good idea?

Post by Alayan » Thu Nov 19, 2020 3:22 am

Milos wrote:
Wed Nov 18, 2020 10:56 pm
Well Alayan goes and calls BS on any result that he doesn't like.

There are quite a few convincing results in the thread that he started by basically calling CCRL results BS.
http://talkchess.com/forum3/viewtopic.p ... 3&start=50
Some of those results from mwyoung looked quite strong (you know my opinion about his testing methods). That and the other result (http://talkchess.com/forum3/viewtopic.php?f=2&t=75363) looked suspicious enough so I did myself a test of scaling of SF-NNUEdev vs Lc0.
Ofc, it's nowhere near comprehensive as what Andreas did (http://talkchess.com/forum3/viewtopic.php?f=2&t=74188) but for me it's pretty indicative.
I played SF_1c vs Lc0, SF_2c vs Lc0, SF_4c vs Lc0, SF_8c vs Lc0, SF_16c vs Lc0. No HT, fixed multiplier for all cores, bullet match (1'+0.6''), 500 custom openings reversed colors.
And I got, respectively, the following results: 58.5%, 62.85%, 66.1%, 67.85%, 70.45%.
Then I repeated it with SF11 vs a weaker Lc0 net so that winning and draw percentages stay similar. And here I got: 51.55% 58.9% 64.65% 68.3% 72.15%.
Nice personal attack you got there. I happen to not like incorrect results and I don't shy from calling them out. The relevant question is whether I'm wrong or not.

The CCRL results were indeed incorrect when I called them out. I acknowledged the uncertainty coming from small sample size but also pointed out the deeper issue with some fundamental assumptions of the elo model. Changing the opponent mix can significantly change the elo results and the engine ordering.

As for your test, it's flawed.

I wrote:
The proper measure of scaling efficiency of N threads is finding the time factor X needed so that 1-core with X*T time scores 50% against N-cores with T time. The resulting efficiency is X/N.
You can alter the experiment by finding the time factor X so that 1-core with X*T time scores the same against a set of opponent as N-cores with T time does, but there are some important invariant that have to be preserved for results to be valid.

Measuring the elo gap between 1 core and N core tells you almost nothing by itself.

High elo differences that may happen introduce distortions. Stockfish with high contempt doesn't have better SMP scaling than SF without contempt, unlike what 1 core vs N core could lead one to believe.

More importantly, without information on the single core time-scaling properties, this test is useless as a measure of scaling efficiency. Scaling efficiency is a measure of the multi-threading losses compared to the single-thread algorithm. Not more, and not less. Big elo gains with N threads might still mean crappy scaling if 1-core with X*T time achieve the same elo gain with X much smaller than N. Small elo gains with N threads might still mean perfect scaling if X=N.

Comparing the elo difference of SF-NNUE N threads vs SF-NNUE 1 thread to the elo difference of SF 11 N threads vs SF 11 1 thread would be bad enough, you used some random Leelas, a different one for each, as the base opponent. That's the icing on the cake. That only 1K games were used is relatively irrelevant considering the other flaws.

What could explain your results if not bad SMP scaling ?

For all we know, it could just be that because SF-NNUE starting 150 elo stronger in 1-core means that gaining further playing strength is much harder, whether by adding more time or more threads.

Of course, just as it has not been proven the SMP scaling got worse, it has not been proven it didn't. In the absence of actual proof, and considering the existing evidence, I don't think likely SMP scaling got significantly worse (or better, for that matter). I might be wrong, but it would take a methodologically sound measurement of SMP scaling (or something crazy like SF classical beating SF-NNUE on ultra high thread counts) for me to change my mind.

Raphexon
Posts: 341
Joined: Sun Mar 17, 2019 11:00 am
Full name: Henk Drost

Re: 4 x Intel Xeon good idea?

Post by Raphexon » Thu Nov 19, 2020 1:19 pm

Milos wrote:
Wed Nov 18, 2020 10:56 pm
Laskos wrote:
Wed Nov 18, 2020 9:54 pm
Alayan wrote:
Wed Nov 18, 2020 8:40 pm
His claim is unsupported so unless proper tests demonstrate that the scaling efficiency actually got worse, I'd just assume it was rubbish.

The proper measure of scaling efficiency of N threads is finding the time factor X needed so that 1-core with X*T time scores 50% against N-cores with T time. The resulting efficiency is X/N.
It was shown to at least 64 threads that classical SF with improved Lazy SMP scaled amazingly well compared to YBWC. For 64 cores, the effective speedup was about 40 compared to something like 18 with YBWC. Maybe I will find that plot among Andreas (the author of the excellent FGRL rating list) data, if it was him showing this.
Well Alayan goes and calls BS on any result that he doesn't like.
There are quite a few convincing results in the thread that he started by basically calling CCRL results BS.
http://talkchess.com/forum3/viewtopic.p ... 3&start=50
Some of those results from mwyoung looked quite strong (you know my opinion about his testing methods). That and the other result (http://talkchess.com/forum3/viewtopic.php?f=2&t=75363) looked suspicious enough so I did myself a test of scaling of SF-NNUEdev vs Lc0.
Ofc, it's nowhere near comprehensive as what Andreas did (http://talkchess.com/forum3/viewtopic.php?f=2&t=74188) but for me it's pretty indicative.
I played SF_1c vs Lc0, SF_2c vs Lc0, SF_4c vs Lc0, SF_8c vs Lc0, SF_16c vs Lc0. No HT, fixed multiplier for all cores, bullet match (1'+0.6''), 500 custom openings reversed colors.
And I got, respectively, the following results: 58.5%, 62.85%, 66.1%, 67.85%, 70.45%.
Then I repeated it with SF11 vs a weaker Lc0 net so that winning and draw percentages stay similar. And here I got: 51.55% 58.9% 64.65% 68.3% 72.15%.
Image
Why not test SF11 against the same Lc0 but with time odds so SF11_1c gets the same score as SF12_1c?
Right now your test is extremely flawed and also can't account for the possibility of elo-compression.

And that "horrible" scaling was also really apparent during TCEC when SFclassical did so much better than NNUE vs Leela... oh wait.

User avatar
Laskos
Posts: 10945
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: 4 x Intel Xeon good idea?

Post by Laskos » Thu Nov 19, 2020 5:55 pm

Raphexon wrote:
Thu Nov 19, 2020 1:19 pm
Milos wrote:
Wed Nov 18, 2020 10:56 pm
Laskos wrote:
Wed Nov 18, 2020 9:54 pm
Alayan wrote:
Wed Nov 18, 2020 8:40 pm
His claim is unsupported so unless proper tests demonstrate that the scaling efficiency actually got worse, I'd just assume it was rubbish.

The proper measure of scaling efficiency of N threads is finding the time factor X needed so that 1-core with X*T time scores 50% against N-cores with T time. The resulting efficiency is X/N.
It was shown to at least 64 threads that classical SF with improved Lazy SMP scaled amazingly well compared to YBWC. For 64 cores, the effective speedup was about 40 compared to something like 18 with YBWC. Maybe I will find that plot among Andreas (the author of the excellent FGRL rating list) data, if it was him showing this.
Well Alayan goes and calls BS on any result that he doesn't like.
There are quite a few convincing results in the thread that he started by basically calling CCRL results BS.
http://talkchess.com/forum3/viewtopic.p ... 3&start=50
Some of those results from mwyoung looked quite strong (you know my opinion about his testing methods). That and the other result (http://talkchess.com/forum3/viewtopic.php?f=2&t=75363) looked suspicious enough so I did myself a test of scaling of SF-NNUEdev vs Lc0.
Ofc, it's nowhere near comprehensive as what Andreas did (http://talkchess.com/forum3/viewtopic.php?f=2&t=74188) but for me it's pretty indicative.
I played SF_1c vs Lc0, SF_2c vs Lc0, SF_4c vs Lc0, SF_8c vs Lc0, SF_16c vs Lc0. No HT, fixed multiplier for all cores, bullet match (1'+0.6''), 500 custom openings reversed colors.
And I got, respectively, the following results: 58.5%, 62.85%, 66.1%, 67.85%, 70.45%.
Then I repeated it with SF11 vs a weaker Lc0 net so that winning and draw percentages stay similar. And here I got: 51.55% 58.9% 64.65% 68.3% 72.15%.
Image
Why not test SF11 against the same Lc0 but with time odds so SF11_1c gets the same score as SF12_1c?
Right now your test is extremely flawed and also can't account for the possibility of elo-compression.

And that "horrible" scaling was also really apparent during TCEC when SFclassical did so much better than NNUE vs Leela... oh wait.
It's not extremely flawed. In normal circumstances, if the Lc0 net is not very different, the data seems quite ok to seriously hint that SF11 scales somewhat better. What you and Alayan quote as possible sources of faults are very exaggerated things, and in given conditions specified by Milos, unlikely. "Measuring the elo gap between 1 core and N core tells you almost nothing by itself" is a heavily exaggerated statement, and here most likely wrong. Nobody measures directly the scaling to multicore as the following textbook proposition by Alayan: "You can alter the experiment by finding the time factor X so that 1-core with X*T time scores the same against a set of opponent as N-cores with T time does, but there are some important invariant that have to be preserved for results to be valid." These are definitions almost impossible to apply in practice. It's a bit like Uri's nitpicking to every empirical result with his outlandish theoretical possibilities. Your proposal is more reasonable, but it doesn't seem a big deal if one doesn't follow it. What Milos did is quite reasonable in order to check the scaling, provided nothing too unexpected happens (picking the wrongly behaving Lc0 nets or similar silly things). So, his result does hint that SF12 might have scaling issues. And practically, it's likely that there won't be many results comparing the scaling using a much better methodology (maybe your proposal is indeed what I would pick). Probably using better hardware, but using those proposals --- almost nobody will apply theoretical definitions in measuring the multicore scaling. And I think I can deal with Elo compression which doesn't even seem to be a big deal here, but even if it appears, there is Normalized Elo route to deal with it.

Post Reply