Stockfish benchmark test

Philofive · Post by **Philofive** » Mon Sep 23, 2024 9:38 am

Hi there!

I am new here, I am a Chess FM, but never had anything to do with computerchess (well, i installed stockfish and let it run for 5 sec each move to analyse my games, was more than sufficient enough

).
However things changed now. A club collegue, who is into correspondence chess gave away his old hardware, because he bought new one, and i was quite surprised, when i started the old machine.

It is a Dual 2699v4 Xeon with 256GB ECC RAM and the machine never did anything else than running 24/7 stockfish since 2016.

Ok, i wiped the harddrive, installed Debian, installed ScidvsPC, compiled stockfish17and other engines and i just wanted to compare this old, but in his time very expensive hardware from 2016 with my "normal" desktop computer from 2023 (AMD 7600, 32 GB Ram), when it comes to computer chess.

Here the trouble started and i realized, things are much more complicated, than i thought and i actually have no understanding, how things work, and I came to the conclusion, that my collegue doesn't know much about benchmarking as well, so i am asking here.

Now my question and i am just looking at one specific chess engine (stockfish 17)

I thought, when comparing hardware, i just have to have a look on the Mn/s and thats it. Double threads and you get nearly double Mn/s, because stockfish scales well on parallelization. So far so good. On my new hardware, HT increases the performance quite a bit, on the 2699 it looks like it doesn't. Ok fine, whatever.

But then i came across the stockfish benchmark tool, because people claim, you can't compare hardware just by looking at the Mn/s but you need to look on the "time-to-depth" as well and the stockfish benchmark is a good tool for both. (Ok, of course, the best tool is to let my two machines play some matches, but lets put that aside)

so what i did is:

Code: Select all

stockfish bench 2048 11 25

on one machine (6 real cores) and

Code: Select all

stockfish bench 2048 43 25

on the other (44 cores) and

Code: Select all

stockfish bench 2048 86 25

2048.... hash size
11 .... threads
25 .... fixed depth (btw, what is that exactly, in my understanding, it is just a depth for some relevant variations, but what is relevant and what isnt? )

Result AMD: (roughly 10 Mn/s on the starting position)

Total time (ms) : 59644
Nodes searched : 1111302912
Nodes/second : 18632266

Result Xeon with 43 Threads (roughly 20 Mn/s on the starting position):

===========================
Total time (ms) : 117966
Nodes searched : 5223385708
Nodes/second : 44278738

Result Xeon with 86 Threads (still roughly 20 Mn/s on the starting position):

Total time (ms) : 191927
Nodes searched : 10987894123
Nodes/second : 57250382

So what can i assume now?
The Xeon takes much longer to reach depth, but analysis much more positions, and HT makes somehow sense in the benchmarking tool, but not when i let it run on the starting position, but it even takes longer to reach a certain depth. I don't understand it.

So what is the best setting for the Xeon and how much stronger is it than the AMD? (or is latter a dumb question to ask without letting them play against each other?)

I know, a long post, thx a lot for at least reading it

Ras · Post by **Ras** » Mon Sep 23, 2024 10:08 am

Philofive wrote: ↑Mon Sep 23, 2024 9:38 amI thought, when comparing hardware, i just have to have a look on the Mn/s and thats it. Double threads and you get nearly double Mn/s, because stockfish scales well on parallelization.

Not all Mn/s are equal. The problem with parallel computing is redundant calculations so that you may see the same raw Mn/s, but the percentage of actually useful nodes may be quite different.

In a single core version, the engine can use data from previous calculations that don't exist yet in the parallel version because the workers start at the same time. So if you have a given total computing performance, what you want is to have that split up into as few cores as possible. A single 2699 is roughly equal to the 7600 in terms of total performance, but the 2699 splits that up into 22 cores while the 7600 only splits up into 6 cores. Hence, a single 2699 will play worse than a 7600. However, you're also adding a second 2699, so now you have twice the raw performance, but also split into 44 cores.

That's also the problem with hyperthreading. Yes, it does raise the raw Mn/s, but it also doubles the split (and the Mn/s don't even double).

Viz · Post by **Viz** » Mon Sep 23, 2024 10:38 am

Philofive wrote: ↑Mon Sep 23, 2024 9:38 ambecause people claim, you can't compare hardware just by looking at the Mn/s but you need to look on the "time-to-depth" as well

Time to depth is a completely useless metric that shows quite literally nothing, especially at multicore.
People who claim otherwise just don't have expertise on this topic.
MN/s is also pretty mediocre metric, because the higher is your thread count with the same MN/s the worse it actually plays, but it's still much better than time to depth ofc.

Jouni · Post by **Jouni** » Mon Sep 23, 2024 11:13 am

Question. SF wiki says: "Threads type spin default 1 min 1 max 1024
The number of CPU threads used for searching a position. For best performance, set this equal to the number of CPU cores available". But why TCEC and CCC uses hyperthreaded cores?!

Philofive · Post by **Philofive** » Mon Sep 23, 2024 12:07 pm

Viz wrote: ↑Mon Sep 23, 2024 10:38 am MN/s is also pretty mediocre metric

Ok, is there something like a gold standard then? Or does the "gold standard metric" simply not exist?

noobpwnftw · Post by **noobpwnftw** » Mon Sep 23, 2024 12:25 pm

Gold standard for chess performance is simply have them play statistically sound number of matches and see results.

If that cannot be done, then it is more like trying to infer the performance of two atheletes by comparing their height, weight and probably their lung capacity and muscle ratio, it will not get anywhere better than that tbf.

If you are looking for strictly hardware differences(the catch is that different engines or different versions of them do not have the same MN/s in relation to their performance that is comparable), then MN/s of engines with exactly the same settings is a scientifically sound approach, if they have different number of cores, then first compare with a reasonable amount of cores available to both, then the more cores the better provided that the first holds equal or better for the one with more cores. And under this condition MN/s directly translates to chess performance.

Viz · Post by **Viz** » Mon Sep 23, 2024 12:39 pm

Philofive wrote: ↑Mon Sep 23, 2024 12:07 pm
Viz wrote: ↑Mon Sep 23, 2024 10:38 am MN/s is also pretty mediocre metric
Ok, is there something like a gold standard then? Or does the "gold standard metric" simply not exist?

As noob said "golden standard" is more or less playing enough games to determine that one is better than another.
But probably it's not practical, especially if you plan to do somewhat long analysis. So it's more or less coming from experiense and existing data on sf scalability with cores.

Philofive · Post by **Philofive** » Wed Sep 25, 2024 3:27 pm

Thx for all the answers, i understand it much more clearly now.

Basically Mn/s is a bad metric, because of the parallelization "overhead", but what is the theoretical worst outcome of that? Pretty bad, i guess.
I am asking, because there are companies (i won't say any names), which sell Mn/s and i am pretty sure, those Mn/s won't come from 2 or 3 CPUs. Basically what you are all saying is... You don't know, how good these Mn/s are.

Did somebody do an evaluation of these offers already?

Vinvin · Post by **Vinvin** » Wed Sep 25, 2024 9:25 pm

Philofive wrote: ↑Wed Sep 25, 2024 3:27 pm Thx for all the answers, i understand it much more clearly now.

Basically Mn/s is a bad metric, because of the parallelization "overhead", but what is the theoretical worst outcome of that? Pretty bad, i guess.
I am asking, because there are companies (i won't say any names), which sell Mn/s and i am pretty sure, those Mn/s won't come from 2 or 3 CPUs. Basically what you are all saying is... You don't know, how good these Mn/s are.

Did somebody do an evaluation of these offers already?

A lot of numbers in this thread : forum3/viewtopic.php?t=74188

Jouni · Post by **Jouni** » Wed Sep 25, 2024 11:36 pm

But totally outdated as done before NNUE.

Stockfish benchmark test

Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test

Re: Stockfish benchmark test