M1 Apple Silicon for Chess?

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Feb 10, 2021 9:09 am

BetaPro wrote: ↑Wed Feb 10, 2021 12:35 am That's also a pretty good result for M1, matching Zen 2 while having weaker vector instructions.

Is it actually weaker though?

https://images.anandtech.com/graphs/gra ... 119343.png

Floating point performance on the M1 is best of class.

BetaPro · Post by **BetaPro** » Wed Feb 10, 2021 2:38 pm

If you look at George's benchmark, M1 dropped way more performace than Intel going from classical to NNUE. So, good vector instructions seem more important than floating point performance. ARM Neon is only 128 bits compared to at least 256 bits in x86 AVX stuff, that's probably why ARM is slower for NNUE.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Feb 10, 2021 3:10 pm

BetaPro wrote: ↑Wed Feb 10, 2021 2:38 pm If you look at George's benchmark, M1 dropped way more performace going from classical to NNUE than Intel. So, good vector instructions seem more important than floating point performance. ARM Neon is only 128 bits compared to at least 256 bits in x86 AVX stuff, that's probably why ARM is slower for NNUE.

That's totally irrelevant though! The width of the vector instructions doesn't determine the performance. In fact, if you could choose you would choose size=1 or 32-bit vectors because it is more flexible. It's the amount of instructions you can execute at once and their/latency throughput. You could have 512-bit vectors and take 5 times as long to process them, you will just run slower. (This was the case on AMD Bulldozer for AVX code, and Zen 1 had a similar throughput disadvantage compared to Haswell and later).

M1 seems to have 4 full 128-bit NEON pipelines, compared to the 4 full 256-bit AVX2 pipelines in Zen2+ and Haswell+.

So it sounds as if Zen2 has twice the throughput, BUT, of those 4 pipelines, all 4 of them can do a multiply-add in the same cycle on the M1, whereas Zen2 can only do 2. That's the why M1 looks so competitive in the floating point tests, instead of running at half the speed. (Also remember Zen3 can run at 5GHz single core, while the M1 is "only" 3.2GHz, yet it's still performing better in the test I linked!)

But as to why it is a bit slower on the Stockfish NNUE code, probably requires diving more deeply into the details. The relevant code might just not be as well optimized yet either.

BetaPro · Post by **BetaPro** » Wed Feb 10, 2021 4:35 pm

Gian-Carlo Pascutto wrote: ↑Wed Feb 10, 2021 3:10 pm
BetaPro wrote: ↑Wed Feb 10, 2021 2:38 pm If you look at George's benchmark, M1 dropped way more performace going from classical to NNUE than Intel. So, good vector instructions seem more important than floating point performance. ARM Neon is only 128 bits compared to at least 256 bits in x86 AVX stuff, that's probably why ARM is slower for NNUE.
That's totally irrelevant though! The width of the vector instructions doesn't determine the performance. In fact, if you could choose you would choose size=1 or 32-bit vectors because it is more flexible. It's the amount of instructions you can execute at once and their/latency throughput. You could have 512-bit vectors and take 5 times as long to process them, you will just run slower. (This was the case on AMD Bulldozer for AVX code, and Zen 1 had a similar throughput disadvantage compared to Haswell and later).

M1 seems to have 4 full 128-bit NEON pipelines, compared to the 4 full 256-bit AVX2 pipelines in Zen2+ and Haswell+.

So it sounds as if Zen2 has twice the throughput, BUT, of those 4 pipelines, all 4 of them can do a multiply-add in the same cycle on the M1, whereas Zen2 can only do 2. That's the why M1 looks so competitive in the floating point tests, instead of running at half the speed. (Also remember Zen3 can run at 5GHz single core, while the M1 is "only" 3.2GHz, yet it's still performing better in the test I linked!)

But as to why it is a bit slower on the Stockfish NNUE code, probably requires diving more deeply into the details. The relevant code might just not be as well optimized yet either.

Wait, the floating point tests in SPEC, how many of them actually use vector instructions? I thought most of them don't actually use it. If they do, then M1 indeed has good SIMD performance, the performance loss must have come from somewhere else, could be worth doing more optimization.

In SPECInt, Sjeng and Leela are all written by you, right? I thought in chess, even though eval are done in fp, they aren't a big part of the performance numbers, that's why they are all in SPECInt instead of SPECfp, right?

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Feb 10, 2021 4:55 pm

BetaPro wrote: ↑Wed Feb 10, 2021 4:35 pm Wait, the floating point tests in SPEC, how many of them actually use vector instructions? I thought most of them don't actually use it.

SPEC is pure C/C++/Fortran source code - it needs to be 100% portable and machine independent - but modern compilers have no problem vectorizing appropriately written code. I don't know how much this applies to SPECint2017, but I know that for SPECint2006 the Intel C compiler even vectorized this loop:

https://github.com/gcp/sjeng/blob/master/search.c#L449

Thus demonstrating that given sufficient incentive, the compiler writers will vectorize everything. Practically, I found modern Clang to be very good at it. Not as good as a highly skilled coder manually moving things around in memory, of course, but good enough to compare architectures, perhaps.

In SPECInt, Sjeng and Leela are all written by you, right? I thought in chess, even though eval are done in fp, they aren't a big part of the performance numbers, that's why they are all in SPECInt instead of SPECfp, right?

The eval in chess engines tends to be pure integer, I think even NNUE uses mostly 16-bit integers? For Leela it is a bit more complicated, but the version in SPEC is the MCTS engine, so there's no neural network evaluation and runtime is indeed dominated by integer performance.

For Leela Zero it would be different, but BLAS performance would be a very good proxy. This gets complicated on M1 because it has hardware acceleration support for it...

BetaPro · Post by **BetaPro** » Wed Feb 10, 2021 6:43 pm

From what I heard, the Intel C compiler could even be specifically targeting SPEC in order to make its CPUs look better. Otherwise I thought it's still pretty hard for compilers to auto vectorize.

I just took a look at Stockfish source code, indeed even the eval is done in integers.

AlexChess · Post by **AlexChess** » Wed Feb 10, 2021 7:03 pm

Gian-Carlo Pascutto wrote: ↑Tue Feb 09, 2021 9:56 pm Performance of Stockfish dev from today looks like:
Code: Select all
Zen1 1700X 3.2GHz:     1520000
Ivy Bridge 3.5GHz:     1760000
Haswell 3.4GHz:        1860000
M1 3.2GHz:             2390000
Zen2 3900X 3.8-4.2GHz: 2490000
This is on an M1 Air. I love the fact that it's 100% fanless.

Hi Giancarlo,
I also have a Mac mini M1 and Stockfish ARM64 is wonderful! Have you compiled other top engines natively for Silicon M1?
I have tried to compile LC0 using Homebrew now compatible with M1, I obtain the bin but it doesn't work (BanksiaGui says that isn't an UCI engine

)
Since all M1 computers share the same SOC, could you give me a link with the executable?

Thank you!
Grazie,
Alessandro, Italy

AlexChess · Post by **AlexChess** » Wed Feb 10, 2021 7:30 pm

syzygy wrote: ↑Wed Dec 09, 2020 12:27 pm
Alayan wrote: ↑Wed Dec 09, 2020 1:13 am
syzygy wrote: ↑Tue Dec 08, 2020 11:16 pm So you agree that x86 is going the way of the light bulb.
I take you not bothering to answer to my message as meaning that you have not a single counterpoint to what I said.
Or I just spend my time wisely. My intention here was just to point out the in my view unwarranted negativity. It reminds me of the initial reception in this forum of Alpha Chess Zero (and many here will probably still deny that LC0 wouldn't have existed without Deepmind's papers).

Hi!
I'm looking for CFish for Apple Silicon M1. Is there a link to download it?

Thank you, AlexChess

twobeer · Post by **twobeer** » Wed Feb 10, 2021 7:39 pm

Un-open stuff like Apple are really just overpriced and M1 is sadly severely overrated (like most Apple proprietary HW).. In the similar price range one could just get a Asus TUF Gaming A15 with AMD Ryzen 9 4900H and NVIDIA GeForce RTX 2060 that runs ALL chess programs much better than an inflexible locked down "appliance-books" from Apple.

I am truly surprised people still throw money at "brand"-recognition (They buy Computers in the same mental mind-set as buying fashion), hype over quality, performance, flexibility and price/value ....

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Feb 10, 2021 8:01 pm

twobeer wrote: ↑Wed Feb 10, 2021 7:39 pm In the similar price range one could just get a Asus TUF Gaming A15 with AMD Ryzen 9 4900H and NVIDIA GeForce RTX 2060 that runs ALL chess programs much better than an inflexible locked down "appliance-books" from Apple.

I am truly surprised people still throw money at "brand"-recognition (They buy Computers in the same mental mind-set as buying fashion), hype over quality, performance, flexibility and price/value ....

No, the M1 is just that good, regardless of the brand on it. The 4900H you compare is a 45W chip (still slower for single core tasks!), and the RTX 2060 (mobile) is a 60-ish W card. The M1 is ~15W, total. You might appreciate the difference if you try to put both machines on your lap while using them.

M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?

Re: M1 Apple Silicon for Chess?