Apple M2

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Apple M2

Post by Sopel »

smatovic wrote: Fri Aug 05, 2022 10:12 am
dangi12012 wrote: Wed Jul 27, 2022 10:28 pm Are there any people here that have M1 and M2?

Any Stockfish NPS charts for the M2 chip?
There are M1 benches on Ipman Chess:

https://ipmanchess.yolasite.com/amd--in ... ckfish.php

and I assume you can inter/extrapolate SF NPS across M1 and M2 models by frequency (or power budget) of the power cores alone.
Those are from before I've optimized stockfish for M1. These numbers are anywere between 30% to 80% away from the current results. Still far behind intel/amd
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
smatovic
Posts: 2572
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Apple M2

Post by smatovic »

Sopel wrote: Sat Aug 06, 2022 2:39 am ...
Sopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?

According to:

https://www.anandtech.com/show/16226/ap ... eep-dive/2
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?

--
Srdja
Magnum
Posts: 162
Joined: Thu Feb 04, 2021 10:24 pm
Full name: Arnold Magnum

Re: Apple M2

Post by Magnum »

smatovic wrote: Sat Aug 06, 2022 5:52 am
Sopel wrote: Sat Aug 06, 2022 2:39 am ...
Sopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?

According to:

https://www.anandtech.com/show/16226/ap ... eep-dive/2
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?

--
Srdja
NEON SIMD is great.

As far as I remember:
1. Stockfish is using only 2 pipelines. Developers need to change that.

2. AMD / Intel CPUs = 2x 256-bit
3. Apple M1 CPU = 4x 128-bit NEON FP pipelines.

But
4. Stockfish is using 2x 256-bit = 512
5. Stockfish is using 2x 128-bit = 256

6. If Stockfish could use 4x 128-bit NEON FP pipelines, then we would also have = 512 and now you can compare Apple vs AMD / Intel :D
Take a look at ipman chess :lol:

After that the much faster Stockfish on M1, than on AMD / Intel, will still not be optimized if you compare it to a heavily optimized AMD / Intel.
smatovic
Posts: 2572
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Apple M2

Post by smatovic »

Magnum wrote: Sat Aug 06, 2022 10:01 am NEON SIMD is great.

As far as I remember:
1. Stockfish is using only 2 pipelines. Developers need to change that.

2. AMD / Intel CPUs = 2x 256-bit
3. Apple M1 CPU = 4x 128-bit NEON FP pipelines.

But
4. Stockfish is using 2x 256-bit = 512
5. Stockfish is using 2x 128-bit = 256

6. If Stockfish could use 4x 128-bit NEON FP pipelines, then we would also have = 512 and now you can compare Apple vs AMD / Intel :D
Take a look at ipman chess :lol:

After that the much faster Stockfish on M1, than on AMD / Intel, will still not be optimized if you compare it to a heavily optimized AMD / Intel.
Well, that might fit if SF uses FP, floating-point, math for NNUE inference, but AFAIK the first network layer is INT16 and the further INT8, I am too lazy to dig up the specs for recent AVX2 and M1 NEON with the according instructions, latencies and throughput used for SF NNUE, maybe someone else can jump in and clarify?

--
Srdja
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Apple M2

Post by Sopel »

smatovic wrote: Sat Aug 06, 2022 5:52 am
Sopel wrote: Sat Aug 06, 2022 2:39 am ...
Sopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?

According to:

https://www.anandtech.com/show/16226/ap ... eep-dive/2
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?

--
Srdja
Since I don't have an M1 device I'm unable to answer this completely. The NEON optimizations are an attempt based on some profiling done by someone else, general good SIMD practices, and apple's instruction latency/throuhput reference. The instruction dependency chains are now pretty much as long as possible and should take advantage of impractically large amount of execution units. I cannot say why exactly it's so slow, or if it's even the NNUE part being the problem. One thing that stood out though is that for some reason the feature transformer part, which is just a bunch of very wide in-register i16 additions appeared slow, possibly slower than counterpart x86-64 code, but I don't have hard data on this. The floating point units are not relevant to Stockfish.
Magnum wrote: Sat Aug 06, 2022 10:01 am
As far as I remember:
1. Stockfish is using only 2 pipelines. Developers need to change that.
That's not how any of this works
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
Magnum
Posts: 162
Joined: Thu Feb 04, 2021 10:24 pm
Full name: Arnold Magnum

Re: Apple M2

Post by Magnum »

Sopel wrote: Sat Aug 06, 2022 1:01 pm
smatovic wrote: Sat Aug 06, 2022 5:52 am
Sopel wrote: Sat Aug 06, 2022 2:39 am ...
Sopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?

According to:

https://www.anandtech.com/show/16226/ap ... eep-dive/2
On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?

--
Srdja
Since I don't have an M1 device I'm unable to answer this completely.
Solution:
Buy M1 or M1 Pro or M1 Max or M1 ULTRA or M2 or all.
wickedpotus
Posts: 136
Joined: Sun May 16, 2021 5:33 pm
Full name: Aron Rodgriges

Re: Apple M2

Post by wickedpotus »

Magnum wrote: Sat Aug 06, 2022 1:56 pm Buy M1 or M1 Pro or M1 Max or M1 ULTRA or M2 or all.
Solution: Stay far away away from overpriced locked-in underperfroming, oversold bigoted Apple-stuff !!! Buy something less expensive that outperforms even M1 Max for ALL strong chess engines and get youreself a great GPU and some additional RAM+storage for the price-diff.
smatovic
Posts: 2572
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Apple M2

Post by smatovic »

Magnum wrote: Sat Aug 06, 2022 1:56 pm
Sopel wrote: Sat Aug 06, 2022 1:01 pm [..]
Since I don't have an M1 device I'm unable to answer this completely.
[..]
Solution:
Buy M1 or M1 Pro or M1 Max or M1 ULTRA or M2 or all.
If Apple would buy him a beer and sponsor team SF with some MacBooks, they could at least profile by themselves ;)

--
Srdja
Joost Buijs
Posts: 1562
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Apple M2

Post by Joost Buijs »

wickedpotus wrote: Sat Aug 06, 2022 3:23 pm
Magnum wrote: Sat Aug 06, 2022 1:56 pm Buy M1 or M1 Pro or M1 Max or M1 ULTRA or M2 or all.
Solution: Stay far away away from overpriced locked-in underperfroming, oversold bigoted Apple-stuff !!! Buy something less expensive that outperforms even M1 Max for ALL strong chess engines and get youreself a great GPU and some additional RAM+storage for the price-diff.
I fully agree!!! I will never buy anything from Apple, they lock you in a straitjacket with their software, and you have to pay double the price for under-performing hardware.
Modern Times
Posts: 3517
Joined: Thu Jun 07, 2012 11:02 pm

Re: Apple M2

Post by Modern Times »

Joost Buijs wrote: Sat Aug 06, 2022 5:04 pm
I fully agree!!! I will never buy anything from Apple, they lock you in a straitjacket with their software, and you have to pay double the price for under-performing hardware.
I totally agree as well.

What I find incredibly annoying is when Apple fanboys constantly push their fanaticism onto others, like religious zealots and extremists but far worse.

I do support freedom of choice however. As long as you are happy though with Mac OS, the premium pricing and the restrictions imposed, the M1 and M2 are excellent chips for the majority of laptop use cases. In some specific tasks they are incredible performers.