Horrible SF scaling

Werewolf · Post by **Werewolf** » Tue Dec 20, 2022 11:25 am

This probably isn't the Christmas message you wanted, but over the last 3 months I've tried an experiment:

1) to collect 100 hard tactical test positions. Difficult, but solvable.
2) to wait for SF 15.1
3) to test SF 15.1 on these 100 positions at very long TC (2000 seconds per position) on a variety of hardware

HARDWARE
There are too many options to mention, but early results show an alarming trend. Just focusing on an 8-fold increase in physical core count, all on modern hardware, we have:

1 core
8 cores
64 cores (Threadripper)
512 cores (cluster)

I get an almost perfect halving of solve time (i.e. doubling of true search speed) for every 8-fold increase in core count. i.e. the 512 core cluster is (1,2,4,8) 8x faster than a single core.

I'll do more experiments and I fully accept this is as rough and ball park as a dog's dinner, but the trend alarmed me.

Vinvin · Post by **Vinvin** » Tue Dec 20, 2022 5:44 pm

Werewolf wrote: ↑Tue Dec 20, 2022 11:25 am I get an almost perfect halving of solve time (i.e. doubling of true search speed) for every 8-fold increase in core count. i.e. the 512 core cluster is (1,2,4,8) 8x faster than a single core.

Please, also compare depth.

peter · Post by **peter** » Wed Dec 21, 2022 7:43 pm

Vinvin wrote: ↑Tue Dec 20, 2022 5:44 pm
Werewolf wrote: ↑Tue Dec 20, 2022 11:25 am I get an almost perfect halving of solve time (i.e. doubling of true search speed) for every 8-fold increase in core count. i.e. the 512 core cluster is (1,2,4,8) 8x faster than a single core.
Please, also compare depth.

Hi Vincent!
Just for fun (I like it especially because it's very quickly done with a minimum of hardware- time) the opposite trial, ultra short TC with the to that fitting positions and Ferdy's MEA- tool. The 888 positions

https://www.dropbox.com/s/1m3cnrnqtq01q ... 8.epd?dl=0

I made by replacing the too easy ones from new STS from Ed and Ferdy by not quite as easy ones (Eret+Arasan), the points per each one position are all changed to new values by me too.
That way the "bracing" of Elo (or "native" points) is biggest, which helps to discriminate single runs and any comparison to other tests isn't of any interest here anyhow. I just did let SF 11 and SF 15.1 run with 1, 2, 4, 8, 16 and 32 threads of a 16x3.5GHz CPU, corresponding hash and always 200msec/pos.:

Jouni · Post by **Jouni** » Sat Dec 24, 2022 9:10 am

It's difficult to test because multicore search is so unpredictable. I made test in i5 prosessor 1 core vs 6 cores. Another problem is that SF 15.1 and 1 core crashes in my Friz GUI

. My result: 6 cores found solution as average 4,15 times faster! And it needed less depth than 1 core for solution.

chessica · Post by **chessica** » Sat Dec 24, 2022 1:44 pm

well, that's a comparison between apples and pears. completely unsuitable for drawing conclusions. I have also made such comparisons with the wonderful mea tool. But what about testing different engines that offer the same number of solutions.
Engine 1, 2, 3... with!

e.g. engine 1 with 1 core and 1 thread xxx solutions and others enginges with the same solutions

peter · Post by **peter** » Sat Dec 24, 2022 3:13 pm

The other one parallel thread here

https://talkchess.com/forum3/viewtopic. ... 60#p939460

was continued too at CSS about Elo- loss of newer versions of SF dev. in MultiPV- mode

https://forum.computerschach.de/cgi-bin ... #pid160252

In this thread Joerg Oster gave a link to a special modification of simple MultiPV- mode of SF's

https://forum.computerschach.de/cgi-bin ... #pid160251

Direct link to github:

https://github.com/joergoster/Stockfish ... pleMultiPV

peter · Post by **peter** » Sat Dec 24, 2022 3:47 pm

chessica wrote: ↑Sat Dec 24, 2022 1:44 pm well, that's a comparison between apples and pears. completely unsuitable for drawing conclusions.

peter wrote: ↑Wed Dec 21, 2022 7:43 pm That way the "bracing" of Elo (or "native" points) is biggest, which helps to discriminate single runs and any comparison to other tests isn't of any interest here anyhow.

Have a second look at the part I pointed out for you this time in bold.
To draw conclusions which the thread is about ("scaling" of single versions of a single one engine, you see?) it's a fine way to test excactly like I did here.
Other questions, other threads, other tests, other conclusions, regards

Jouni · Post by **Jouni** » Sat Dec 24, 2022 8:41 pm

Peter: stop testing MV=4. It's cheating and makes Stockfish -199.3 Elo weaker

peter · Post by **peter** » Sat Dec 24, 2022 9:14 pm

Jouni wrote: ↑Sat Dec 24, 2022 8:41 pm Peter: stop testing MV=4. It's cheating and makes Stockfish -199.3 Elo weaker

Not with the 256 positions, 5" TC and EloStatTS

You saw my direct comparisons at CSS, I guess.
BTW, if you compare single positions (which of course gives the most exact way of measurements, the most important parameters of engines' playing strength at all, time to solution(s), time to best lines, time to best evals, those can be compared directly engine by engine and hardware by hardware this way only, but to get statistically relevant results, you'll have to have very many single tests, even without taking SMP into account too.
Elo- gain or -loss by VSTC positional testing is again of course another one measurement, will try to get something of relevance into this one posting in edit- time, but have to finish one of the runs still.

And of course even much more, if you do so with positions of unforced lines, e.g. opening- postions. and eng-eng-match too will depend very much on openings, hardware- TC and especially on sample of opponents to get Elo- differences out of error bar of MultiPV- rating- lists of their own (not to talk about MultiPV- match against LC0 and MCTS- search- based engines).

Edit: got it ready just in time:

"Simple" ist Jörg's MultiPV- mode modifiation, MV again means number of MultiPVs, regards

peter · Post by **peter** » Sat Dec 24, 2022 10:32 pm

peter wrote: ↑Sat Dec 24, 2022 9:14 pm
Jouni wrote: ↑Sat Dec 24, 2022 8:41 pm Peter: stop testing MV=4. It's cheating and makes Stockfish -199.3 Elo weaker
Not with the 256 positions, 5" TC and EloStatTS

You saw my direct comparisons at CSS, I guess.

BTW, cheating whom? With my own tests I could cheat on myself only, knowing how to interprete the results denies that, so let me try to get the meanings of the measurements a little clearer yet still.

The MEA above again is an internal version- comparison only, so not to be compared to rating- und ranking- lists with more than one engine taking part, but this other one kind of lists is to be seen in CSS anyhow already too (see link above).

The EloStatTS- list there is one of its own too only of course, 5" for the 256 positions each is something in between tactical (longer TC and harder but forced positions only, that's a third one list of mine) and positional (STC with more not so hard positions). The 888 are for VSTC of mainly positonal testing (VSTC and positions of multiple solutions like in STS together with easy tactical ones).

Just to give one more example, how much difference hardware- TC makes in those MEA- tests of mine of versions and settings so near to each other as here, here's the same sample of "opponents" with 300msec/pos. but single threaded, instead of 200msec above with 8 CPU- threads at each run.

The sense of MEA as for these tests to me is, as I already wrote at the very start, to get the bigggest size of discrimination with shortest hardware- time between the single runs, to "brace or spread Elo" (or native points of STS- kind) the largest.
The picture you get by seeing all these measurements only summating them at all shows, Elo isn't transitive at all as for none of any measurement of engine- playing- strength, never was and nowadays is lesser and lesser so, trying to find a difference statistically relevant (if given in Elo, performance- percentage or whatever doesn't matter) for an engine- sample within which settings, versions and nets of engines are compared to each other, you'll have need of either very much hardware- time or very narrow restriction to very specially selected (and so also somewhat "biased") samples and positions and hardware- TCs, or all of those restrictions together with much hardware time yet still

Horrible SF scaling

Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling

Re: Horrible SF scaling