Those are from before I've optimized stockfish for M1. These numbers are anywere between 30% to 80% away from the current results. Still far behind intel/amdsmatovic wrote: ↑Fri Aug 05, 2022 10:12 amThere are M1 benches on Ipman Chess:dangi12012 wrote: ↑Wed Jul 27, 2022 10:28 pm Are there any people here that have M1 and M2?
Any Stockfish NPS charts for the M2 chip?
https://ipmanchess.yolasite.com/amd--in ... ckfish.php
and I assume you can inter/extrapolate SF NPS across M1 and M2 models by frequency (or power budget) of the power cores alone.
Apple M2
Moderators: hgm, Dann Corbit, Harvey Williamson
-
Sopel
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Apple M2
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
smatovic
- Posts: 2572
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Apple M2
Sopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?
According to:
https://www.anandtech.com/show/16226/ap ... eep-dive/2
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
--
Srdja
-
Magnum
- Posts: 162
- Joined: Thu Feb 04, 2021 10:24 pm
- Full name: Arnold Magnum
Re: Apple M2
NEON SIMD is great.smatovic wrote: ↑Sat Aug 06, 2022 5:52 amSopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?
According to:
https://www.anandtech.com/show/16226/ap ... eep-dive/2
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
--
Srdja
As far as I remember:
1. Stockfish is using only 2 pipelines. Developers need to change that.
2. AMD / Intel CPUs = 2x 256-bit
3. Apple M1 CPU = 4x 128-bit NEON FP pipelines.
But
4. Stockfish is using 2x 256-bit = 512
5. Stockfish is using 2x 128-bit = 256
6. If Stockfish could use 4x 128-bit NEON FP pipelines, then we would also have = 512 and now you can compare Apple vs AMD / Intel
Take a look at ipman chess
After that the much faster Stockfish on M1, than on AMD / Intel, will still not be optimized if you compare it to a heavily optimized AMD / Intel.
-
smatovic
- Posts: 2572
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Apple M2
Well, that might fit if SF uses FP, floating-point, math for NNUE inference, but AFAIK the first network layer is INT16 and the further INT8, I am too lazy to dig up the specs for recent AVX2 and M1 NEON with the according instructions, latencies and throughput used for SF NNUE, maybe someone else can jump in and clarify?Magnum wrote: ↑Sat Aug 06, 2022 10:01 am NEON SIMD is great.
As far as I remember:
1. Stockfish is using only 2 pipelines. Developers need to change that.
2. AMD / Intel CPUs = 2x 256-bit
3. Apple M1 CPU = 4x 128-bit NEON FP pipelines.
But
4. Stockfish is using 2x 256-bit = 512
5. Stockfish is using 2x 128-bit = 256
6. If Stockfish could use 4x 128-bit NEON FP pipelines, then we would also have = 512 and now you can compare Apple vs AMD / Intel
Take a look at ipman chess![]()
After that the much faster Stockfish on M1, than on AMD / Intel, will still not be optimized if you compare it to a heavily optimized AMD / Intel.
--
Srdja
-
Sopel
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Apple M2
Since I don't have an M1 device I'm unable to answer this completely. The NEON optimizations are an attempt based on some profiling done by someone else, general good SIMD practices, and apple's instruction latency/throuhput reference. The instruction dependency chains are now pretty much as long as possible and should take advantage of impractically large amount of execution units. I cannot say why exactly it's so slow, or if it's even the NNUE part being the problem. One thing that stood out though is that for some reason the feature transformer part, which is just a bunch of very wide in-register i16 additions appeared slow, possibly slower than counterpart x86-64 code, but I don't have hard data on this. The floating point units are not relevant to Stockfish.smatovic wrote: ↑Sat Aug 06, 2022 5:52 amSopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?
According to:
https://www.anandtech.com/show/16226/ap ... eep-dive/2
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
--
Srdja
That's not how any of this works
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
Magnum
- Posts: 162
- Joined: Thu Feb 04, 2021 10:24 pm
- Full name: Arnold Magnum
Re: Apple M2
Solution:Sopel wrote: ↑Sat Aug 06, 2022 1:01 pmSince I don't have an M1 device I'm unable to answer this completely.smatovic wrote: ↑Sat Aug 06, 2022 5:52 amSopel, people claim now and then SF is simply not that optimized for M1 as it is for x86, others say it is the NEON SIMD unit which does not perform for NNUE inference, can you elaborate?
According to:
https://www.anandtech.com/show/16226/ap ... eep-dive/2
M1 has four 128-bit NEON FP pipelines, and should outperform Intel/AMD with 256b AVX2 on paper. I am not into the details, INT8 instructions, cycles, latency, but maybe a word from you, if SF is able to utilize all four NEON pipelines, or maybe just one for NNUE inference, and upcoming ARM SVE2 with a broader bit-width would perform better?On the floating point and vector execution side of things, the new Firestorm cores are actually more impressive as they a 33% increase in capabilities, enabled by Apple’s addition of a fourth execution pipeline. The FP rename registers here seem to land at 384 entries, which is again comparatively massive. The four 128-bit NEON pipelines thus on paper match the current throughput capabilities of desktop cores from AMD and Intel, albeit with smaller vectors. Floating-point operations throughput here is 1:1 with the pipeline count, meaning Firestorm can do 4 FADDs and 4 FMULs per cycle with respectively 3 and 4 cycles latency. That’s quadruple the per-cycle throughput of Intel CPUs and previous AMD CPUs, and still double that of the recent Zen3, of course, still running at lower frequency. This might be one reason why Apples does so well in browser benchmarks (JavaScript numbers are floating-point doubles).
--
Srdja
Buy M1 or M1 Pro or M1 Max or M1 ULTRA or M2 or all.
-
wickedpotus
- Posts: 136
- Joined: Sun May 16, 2021 5:33 pm
- Full name: Aron Rodgriges
Re: Apple M2
Solution: Stay far away away from overpriced locked-in underperfroming, oversold bigoted Apple-stuff !!! Buy something less expensive that outperforms even M1 Max for ALL strong chess engines and get youreself a great GPU and some additional RAM+storage for the price-diff.
-
smatovic
- Posts: 2572
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Apple M2
If Apple would buy him a beer and sponsor team SF with some MacBooks, they could at least profile by themselves
--
Srdja
-
Joost Buijs
- Posts: 1562
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: Apple M2
I fully agree!!! I will never buy anything from Apple, they lock you in a straitjacket with their software, and you have to pay double the price for under-performing hardware.wickedpotus wrote: ↑Sat Aug 06, 2022 3:23 pmSolution: Stay far away away from overpriced locked-in underperfroming, oversold bigoted Apple-stuff !!! Buy something less expensive that outperforms even M1 Max for ALL strong chess engines and get youreself a great GPU and some additional RAM+storage for the price-diff.
-
Modern Times
- Posts: 3517
- Joined: Thu Jun 07, 2012 11:02 pm
Re: Apple M2
I totally agree as well.Joost Buijs wrote: ↑Sat Aug 06, 2022 5:04 pm
I fully agree!!! I will never buy anything from Apple, they lock you in a straitjacket with their software, and you have to pay double the price for under-performing hardware.
What I find incredibly annoying is when Apple fanboys constantly push their fanaticism onto others, like religious zealots and extremists but far worse.
I do support freedom of choice however. As long as you are happy though with Mac OS, the premium pricing and the restrictions imposed, the M1 and M2 are excellent chips for the majority of laptop use cases. In some specific tasks they are incredible performers.