Stockfish benchmark test

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Philofive
Posts: 3
Joined: Sun Sep 22, 2024 9:02 pm
Full name: Philipp Enoeckl

Stockfish benchmark test

Post by Philofive »

Hi there!

I am new here, I am a Chess FM, but never had anything to do with computerchess (well, i installed stockfish and let it run for 5 sec each move to analyse my games, was more than sufficient enough :wink: ).
However things changed now. A club collegue, who is into correspondence chess gave away his old hardware, because he bought new one, and i was quite surprised, when i started the old machine.

It is a Dual 2699v4 Xeon with 256GB ECC RAM and the machine never did anything else than running 24/7 stockfish since 2016.

Ok, i wiped the harddrive, installed Debian, installed ScidvsPC, compiled stockfish17and other engines and i just wanted to compare this old, but in his time very expensive hardware from 2016 with my "normal" desktop computer from 2023 (AMD 7600, 32 GB Ram), when it comes to computer chess.

Here the trouble started and i realized, things are much more complicated, than i thought and i actually have no understanding, how things work, and I came to the conclusion, that my collegue doesn't know much about benchmarking as well, so i am asking here.

Now my question and i am just looking at one specific chess engine (stockfish 17)

I thought, when comparing hardware, i just have to have a look on the Mn/s and thats it. Double threads and you get nearly double Mn/s, because stockfish scales well on parallelization. So far so good. On my new hardware, HT increases the performance quite a bit, on the 2699 it looks like it doesn't. Ok fine, whatever.

But then i came across the stockfish benchmark tool, because people claim, you can't compare hardware just by looking at the Mn/s but you need to look on the "time-to-depth" as well and the stockfish benchmark is a good tool for both. (Ok, of course, the best tool is to let my two machines play some matches, but lets put that aside)

so what i did is:

Code: Select all

stockfish bench 2048 11 25
on one machine (6 real cores) and

Code: Select all

stockfish bench 2048 43 25
on the other (44 cores) and

Code: Select all

stockfish bench 2048 86 25
2048.... hash size
11 .... threads
25 .... fixed depth (btw, what is that exactly, in my understanding, it is just a depth for some relevant variations, but what is relevant and what isnt? )


Result AMD: (roughly 10 Mn/s on the starting position)

Total time (ms) : 59644
Nodes searched : 1111302912
Nodes/second : 18632266

Result Xeon with 43 Threads (roughly 20 Mn/s on the starting position):

===========================
Total time (ms) : 117966
Nodes searched : 5223385708
Nodes/second : 44278738

Result Xeon with 86 Threads (still roughly 20 Mn/s on the starting position):

Total time (ms) : 191927
Nodes searched : 10987894123
Nodes/second : 57250382

So what can i assume now?
The Xeon takes much longer to reach depth, but analysis much more positions, and HT makes somehow sense in the benchmarking tool, but not when i let it run on the starting position, but it even takes longer to reach a certain depth. I don't understand it.

So what is the best setting for the Xeon and how much stronger is it than the AMD? (or is latter a dumb question to ask without letting them play against each other?)

I know, a long post, thx a lot for at least reading it
User avatar
Ras
Posts: 2698
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Stockfish benchmark test

Post by Ras »

Philofive wrote: Mon Sep 23, 2024 9:38 amI thought, when comparing hardware, i just have to have a look on the Mn/s and thats it. Double threads and you get nearly double Mn/s, because stockfish scales well on parallelization.
Not all Mn/s are equal. The problem with parallel computing is redundant calculations so that you may see the same raw Mn/s, but the percentage of actually useful nodes may be quite different.

In a single core version, the engine can use data from previous calculations that don't exist yet in the parallel version because the workers start at the same time. So if you have a given total computing performance, what you want is to have that split up into as few cores as possible. A single 2699 is roughly equal to the 7600 in terms of total performance, but the 2699 splits that up into 22 cores while the 7600 only splits up into 6 cores. Hence, a single 2699 will play worse than a 7600. However, you're also adding a second 2699, so now you have twice the raw performance, but also split into 44 cores.

That's also the problem with hyperthreading. Yes, it does raise the raw Mn/s, but it also doubles the split (and the Mn/s don't even double).
Rasmus Althoff
https://www.ct800.net
Viz
Posts: 223
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: Stockfish benchmark test

Post by Viz »

Philofive wrote: Mon Sep 23, 2024 9:38 ambecause people claim, you can't compare hardware just by looking at the Mn/s but you need to look on the "time-to-depth" as well
Time to depth is a completely useless metric that shows quite literally nothing, especially at multicore.
People who claim otherwise just don't have expertise on this topic.
MN/s is also pretty mediocre metric, because the higher is your thread count with the same MN/s the worse it actually plays, but it's still much better than time to depth ofc.
Jouni
Posts: 3621
Joined: Wed Mar 08, 2006 8:15 pm
Full name: Jouni Uski

Re: Stockfish benchmark test

Post by Jouni »

Question. SF wiki says: "Threads type spin default 1 min 1 max 1024
The number of CPU threads used for searching a position. For best performance, set this equal to the number of CPU cores available". But why TCEC and CCC uses hyperthreaded cores?!
Jouni
Philofive
Posts: 3
Joined: Sun Sep 22, 2024 9:02 pm
Full name: Philipp Enoeckl

Re: Stockfish benchmark test

Post by Philofive »

Viz wrote: Mon Sep 23, 2024 10:38 am MN/s is also pretty mediocre metric
Ok, is there something like a gold standard then? Or does the "gold standard metric" simply not exist?
noobpwnftw
Posts: 694
Joined: Sun Nov 08, 2015 11:10 pm
Full name: Bojun Guo

Re: Stockfish benchmark test

Post by noobpwnftw »

Gold standard for chess performance is simply have them play statistically sound number of matches and see results.

If that cannot be done, then it is more like trying to infer the performance of two atheletes by comparing their height, weight and probably their lung capacity and muscle ratio, it will not get anywhere better than that tbf.

If you are looking for strictly hardware differences(the catch is that different engines or different versions of them do not have the same MN/s in relation to their performance that is comparable), then MN/s of engines with exactly the same settings is a scientifically sound approach, if they have different number of cores, then first compare with a reasonable amount of cores available to both, then the more cores the better provided that the first holds equal or better for the one with more cores. And under this condition MN/s directly translates to chess performance.
Viz
Posts: 223
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: Stockfish benchmark test

Post by Viz »

Philofive wrote: Mon Sep 23, 2024 12:07 pm
Viz wrote: Mon Sep 23, 2024 10:38 am MN/s is also pretty mediocre metric
Ok, is there something like a gold standard then? Or does the "gold standard metric" simply not exist?
As noob said "golden standard" is more or less playing enough games to determine that one is better than another.
But probably it's not practical, especially if you plan to do somewhat long analysis. So it's more or less coming from experiense and existing data on sf scalability with cores.
Philofive
Posts: 3
Joined: Sun Sep 22, 2024 9:02 pm
Full name: Philipp Enoeckl

Re: Stockfish benchmark test

Post by Philofive »

Thx for all the answers, i understand it much more clearly now.

Basically Mn/s is a bad metric, because of the parallelization "overhead", but what is the theoretical worst outcome of that? Pretty bad, i guess.
I am asking, because there are companies (i won't say any names), which sell Mn/s and i am pretty sure, those Mn/s won't come from 2 or 3 CPUs. Basically what you are all saying is... You don't know, how good these Mn/s are.

Did somebody do an evaluation of these offers already?
Vinvin
Posts: 5287
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: Stockfish benchmark test

Post by Vinvin »

Philofive wrote: Wed Sep 25, 2024 3:27 pm Thx for all the answers, i understand it much more clearly now.

Basically Mn/s is a bad metric, because of the parallelization "overhead", but what is the theoretical worst outcome of that? Pretty bad, i guess.
I am asking, because there are companies (i won't say any names), which sell Mn/s and i am pretty sure, those Mn/s won't come from 2 or 3 CPUs. Basically what you are all saying is... You don't know, how good these Mn/s are.

Did somebody do an evaluation of these offers already?
A lot of numbers in this thread : forum3/viewtopic.php?t=74188
Jouni
Posts: 3621
Joined: Wed Mar 08, 2006 8:15 pm
Full name: Jouni Uski

Re: Stockfish benchmark test

Post by Jouni »

But totally outdated as done before NNUE.
Jouni