M1 Apple Silicon for Chess?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

acepoint_de
Posts: 86
Joined: Tue Jun 11, 2013 1:14 am

Re: M1 Apple Silicon for Chess?

Post by acepoint_de »

Milos wrote: Thu Apr 08, 2021 3:53 pm Title...
https://samanthanorth.com/8-common-onli ... ndle-them/
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: M1 Apple Silicon for Chess?

Post by mar »

Alayan wrote: Tue Apr 06, 2021 12:56 am Many claims about the superiority of the M1's as a CPU are exaggerated. I argued earlier how the power efficiency comparisons with desktop x86 CPUs are highly misleading.
I don't like apple either (from a dev's perspective), in general I don't like how the company behaves.
but I'm really impressed with M1 so far.

so - how are they exaggerated? I saw what I saw - my engine runs 1.5+ times faster on M1 (single thread) than on my 2700X and yes it's a desktop CPU and yes I'm talking 15W M1 vs 100W desktop. that's a fact - that M1 is 5nm and my 2700x is 12nm is not my concern.

the power consumption would be backed by how long the battery seems to last - much longer than my old mbp/intel-based laptop for sure, even though it's still new.

from what I saw Zen3 should be even faster than that (at least for SF) - but again it's a desktop CPU.
the "claims" are probably related to intel mobile chips, which are a pile of crap compared to M1, whether you like it or not. M1 should easily smoke my i5 laptop by a factor of 3, I'm pretty sure of that.

also - the x86 instruction set is ancient ugly bloated junk as well, aarch64 has a slim, modern instruction set (except for the encoding of bit mask immediates, that's just silly - also a lack of ctz where you have to rbit + clz instead seems curious, but probably not a big deal).

I'm also impressed with the performance of Rosetta 2, because most of the apps run nearly at full speed, which is something, I predicted a factor of 2+ slower because of CPU emulation, but I was wrong. also note that ARM has a weaker memory model than x86, so the emulation should suffer from that as well (also: no long immediates, limited branch range etc.)

as for C++ compile times, well... my 200k sloc project compiles in ~30 seconds on my desktop and about the same time on M1. however, that's not a fair comparison.
first - because that'd be comparing msvc vs clang and second - because xcode actually builds two binaries at once, embedded in one: x64 and arm64, so that's essentially 2x the amount of work! (and no, you can't even reuse the AST due to different preprocessor conditionals)

I'm much less impressed with M1 GPU though - I suspect apple is deliberately throttling GL apps (conspiracy :)
surely fillrate/pixel shading perf has exactly zero to do with low level graphics API, yet even with shadertoy I only get half the expected perf (small wonder because WebGL is still GL)
still - way better than any intel integrated GPUs for sure
Martin Sedlak
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: M1 Apple Silicon for Chess?

Post by Ras »

mar wrote: Thu Apr 08, 2021 5:07 pmI'm also impressed with the performance of Rosetta 2, because most of the apps run nearly at full speed, which is something, I predicted a factor of 2+ slower because of CPU emulation, but I was wrong.
That's because Rosetta 2 is not an emulation, but a translation. Means, this is done once at installation, not during the program's runtime.
Rasmus Althoff
https://www.ct800.net
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: M1 Apple Silicon for Chess?

Post by mar »

Ras wrote: Thu Apr 08, 2021 6:55 pm
mar wrote: Thu Apr 08, 2021 5:07 pmI'm also impressed with the performance of Rosetta 2, because most of the apps run nearly at full speed, which is something, I predicted a factor of 2+ slower because of CPU emulation, but I was wrong.
That's because Rosetta 2 is not an emulation, but a translation. Means, this is done once at installation, not during the program's runtime.
I'm not so sure, it worked for my program that I compiled on my intel-based mac and it ran flawlessly, no installation.
note that my program does some JITting (=generates x86 machine code at runtime), so it had to be emulated for sure.

I've seen an interesting project however (forgot its name) where they ran an analysis on a x86 binary, converted to LLVM bitcode and retargeted. won't work for dynamic codegen, but still pretty impressive
Martin Sedlak
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: M1 Apple Silicon for Chess?

Post by Milos »

mar wrote: Thu Apr 08, 2021 5:07 pm
Alayan wrote: Tue Apr 06, 2021 12:56 am Many claims about the superiority of the M1's as a CPU are exaggerated. I argued earlier how the power efficiency comparisons with desktop x86 CPUs are highly misleading.
I don't like apple either (from a dev's perspective), in general I don't like how the company behaves.
but I'm really impressed with M1 so far.

so - how are they exaggerated? I saw what I saw - my engine runs 1.5+ times faster on M1 (single thread) than on my 2700X and yes it's a desktop CPU and yes I'm talking 15W M1 vs 100W desktop. that's a fact - that M1 is 5nm and my 2700x is 12nm is not my concern.

the power consumption would be backed by how long the battery seems to last - much longer than my old mbp/intel-based laptop for sure, even though it's still new.

from what I saw Zen3 should be even faster than that (at least for SF) - but again it's a desktop CPU.
the "claims" are probably related to intel mobile chips, which are a pile of crap compared to M1, whether you like it or not. M1 should easily smoke my i5 laptop by a factor of 3, I'm pretty sure of that.
Ofc that there is an equivalent Zen3 for mobile with comparable if not better power consumption (same TDP and even lower peak power) - Cezanne architecture. Take a look at Ryzen 7 5600U and 5800U.
For chess single core they are in worst case a tad bit slower, but multicore even 6 core 5600U is much faster, while 8 core 5800U is a totally different category.
On top of that GPU performance of Cezanne chips is much better than of M1 chip.
The only real advantage Apple has is preferential treatment by TSCM. They are at 5nm node compared to AMD's 7nm and they managed to secure sufficient amount of wafers for their products which seems to be almost impossible these days.
Believing that there is some magical advantage of ARM architecture compared to x86 is pure BS. Architecturally I believe AMD's design is much more advanced. If they were on the same process node they'd be far superior in terms of processing power/W that today is probably equal.
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: M1 Apple Silicon for Chess?

Post by MikeB »

I'm fine with the way this thread has gone so far. Some engine stuff, some Apple discussion.

I was all set to get new apple with the m1 and then the more I thought about , the more of pain it would be for me and how I would want to use it. New chip, new OS , being an early adopter here could be a royal pain., so I'll wait for the m1-ver2 chip.

My $.02
apple make very good products, but a little pricey
very easy to use for most non technical people. that's not always true with other manufacturers or OS
the hardware is usually exceptionally well built and is meant to last, but they drop support way too early
apple support is very good, I have used it for both business and personal items
they stand by their products, it much better support than wireless carriers - I always buy from Apple direct
resale value holds very well...

my biggest pet peeve, they keep a really tight grip on their hardware and software , not as fun/friendly for the more technically sophisticated user as compared to more open systems
their high-end desktop, the Mac Pro, is ridiculously price
their fanboy base can be as obnoxious as any...

with that said, if they ever come out will an all electric car, I will be tempted, but probably priced out ...
Image
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: M1 Apple Silicon for Chess?

Post by mar »

Milos wrote: Thu Apr 08, 2021 8:00 pm Believing that there is some magical advantage of ARM architecture compared to x86 is pure BS. Architecturally I believe AMD's design is much more advanced. If they were on the same process node they'd be far superior in terms of processing power/W that today is probably equal.
why bs? 2x more regs, trivial to decode instructions that never span pages, some fancy ideas like calls via link register (why touch stack at all),
ldp/stp that load/store register pairs, special addressing modes for preincrement and postincrement and so on.

plus I assume that the overall complexity of the chip should be much lower for arm64? (am I wrong?)

I'm not sure if all this helps in HW design, but I believe it might?

x86 is heinously complicated while dragging a lot of historically useless baggage around, like XLAT, AAM, ...
do modern x86-64 chips still support legacy 16-bit mode? I wouldn't be surprised if they did.
Martin Sedlak
Ras
Posts: 2487
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: M1 Apple Silicon for Chess?

Post by Ras »

mar wrote: Thu Apr 08, 2021 7:26 pmnote that my program does some JITting (=generates x86 machine code at runtime), so it had to be emulated for sure.
Rosetta 2 does at installation time what's possible to do then, but it has also a JIT part that avoids runtime emulation at the cost of having to do that upon each start of the program.

An actual emulation would be quite a bit slower. I remember when Intel tried to push x86 Android devices which had a software emulation (not translation like Rosetta 2) to run ARM binaries, i.e. the other way around. I measured the emulated performance at 50% of using the same binary (my engine) on the same device but compiled directly for x86.

Milos wrote: Thu Apr 08, 2021 8:00 pmThe only real advantage Apple has is preferential treatment by TSCM. They are at 5nm node compared to AMD's 7nm
Also remember that the M1 uses 4600 MHz RAM - which has nothing to do with the CPU architecture. My 4700U laptop scores pretty much on the high end among other 4700U machines on Geekbench, and that's because mine is built to order with 3200 MHz RAM instead of being ready-built by a vendor who might cheap out. (Note: whatever Geekbench is measuring, but that's irrelevant for this point.)

This also explains part of the M1 GPU performance. AMD doesn't put in more iGPU horsepower such as RDNA2 because it would be pointless, given that Vega is already memory bottlenecked, and AMD doesn't officially support 4600 RAM speed because few people even buy that, given the memory cost, and it would make the CPU's memory controller more expensive.
Believing that there is some magical advantage of ARM architecture compared to x86 is pure BS.
Especially when considering that the decoder took an irrelevant part of the die already 20 years ago, and that x86 CPUs havn't had an actual x86 CPU architecture after the 80486 or so. x86 assembly is indeed ugly, but that's just irrelevant. Also, the x86 ISA does not cost more energy. What does cost more energy is the rich set of features such as PCIe support and stuff.

Also, the x86 Android devices I mentioned above were actually competetive to the ARM ones both in terms of native performance and energy. What killed them is that they were "only" on a par, but that doesn't cut it when you're that late to the party, and the real world performance with the necessary software emulation dropped to 50%. That triggered a vicious cycle where nobody bought x86 Android devices and SW devs didn't bother to do native compiles.
MikeB wrote: Thu Apr 08, 2021 8:21 pmthe hardware is usually exceptionally well built
Actually not. Misdesigned hardware that can't even be easily repaired is common with Apple, and after warranty, their customer support is even worse. If you don't believe it, check out Louis Rossmann on YT who repairs these things for a living (https://www.youtube.com/watch?v=AUaJ8pDlxi8).
Rasmus Althoff
https://www.ct800.net
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: M1 Apple Silicon for Chess?

Post by Milos »

mar wrote: Thu Apr 08, 2021 8:37 pm
Milos wrote: Thu Apr 08, 2021 8:00 pm Believing that there is some magical advantage of ARM architecture compared to x86 is pure BS. Architecturally I believe AMD's design is much more advanced. If they were on the same process node they'd be far superior in terms of processing power/W that today is probably equal.
why bs? 2x more regs, trivial to decode instructions that never span pages, some fancy ideas like calls via link register (why touch stack at all),
ldp/stp that load/store register pairs, special addressing modes for preincrement and postincrement and so on.

plus I assume that the overall complexity of the chip should be much lower for arm64? (am I wrong?)

I'm not sure if all this helps in HW design, but I believe it might?

x86 is heinously complicated while dragging a lot of historically useless baggage around, like XLAT, AAM, ...
do modern x86-64 chips still support legacy 16-bit mode? I wouldn't be surprised if they did.
I think you are looking at things too much from a software point of view and load too much into instruction set. It's a mindset of most ppl that still think of RISC and CISC architectures as two monolithic concepts. What really matters is the microarchitecture. Because under the hood when you decode your x86 instructions into micro-ops you end up with a very similar thing as in ARM. What then makes a difference is branch prediction bandwidth and quality (misprediction recovery), optimization of prefetching, depth of pipelines, micro-op cache size, dedicated branch pipes, load/store bandwidth (and their respective micro-ops optimization), memory dependence detection, SIMD support and bandwidth, all those floating point and integer unit implementation tricks (INT/FP dispatch size, physical register file size, internal schedulers and branch prediction, hardware support for special dedicated ops, etc, etc).
An assumption that a compiler (a programmer) would do a better job with simpler instructions in ARM compared to hardware automated internal micro-ops prefetching, decoding, parallelization and optimization in modern x86 is IMO inherently problematic.
Alayan
Posts: 550
Joined: Tue Nov 19, 2019 8:48 pm
Full name: Alayan Feh

Re: M1 Apple Silicon for Chess?

Post by Alayan »

mar wrote: Thu Apr 08, 2021 5:07 pm
Alayan wrote: Tue Apr 06, 2021 12:56 am Many claims about the superiority of the M1's as a CPU are exaggerated. I argued earlier how the power efficiency comparisons with desktop x86 CPUs are highly misleading.
I don't like apple either (from a dev's perspective), in general I don't like how the company behaves.
but I'm really impressed with M1 so far.

so - how are they exaggerated? I saw what I saw - my engine runs 1.5+ times faster on M1 (single thread) than on my 2700X and yes it's a desktop CPU and yes I'm talking 15W M1 vs 100W desktop. that's a fact - that M1 is 5nm and my 2700x is 12nm is not my concern.
You compare the 1 thread performance for which the number of cores is irrelevant, but you compare the full chip TDP, which is for 4 firestorm cores (and power efficient but irrelevant in the performance discussion 4 icestorm cores), with a 8 core CPU. Because the 2700X has to drive twice as much cores (it also needs to be handle full SMT load with 16 threads which is about 30% more power & perf at cost of peak per-thread perf, for the sake of the discussion let's say it's cancelled out with the icestorm cores) as the M1, your "100W desktop vs 15W M1" comparison is skewed by a factor of 2.

A factor of 2 in power efficiency is huge.

The other major issue is that laptop chips and desktop chips have vastly different design priorities. Desktop designs are tuned for performance, power be damned, because for most desktop buyers power consumption is not a major consideration. For laptop, power consumption is extremely important. Laptop chips are always going to come out more power efficient when it comes to "perf at TDP".

Here is my original post on the topic of comparing power efficiency:
Alayan wrote: Tue Dec 08, 2020 1:14 am Even Apple apologists don't claim more than the 3-4x perf-per-watt that you get when comparing a mobile M1 designed for low power consumption with a stock desktop x86 CPU designed for max performance. 10x is fantasy.

And this way of comparing perf-per-watt is completely wrong when it comes to architectural comparisons.

This is like when Nvidia compared an undervolted and underclocked RTX 3080 to a RTX 2080 at iso-performance, putting them at very different points of their efficiency curve, to claim a 2x perf-per-watt advantage for Ampere compared to Turing. The real improvement from architectural changes and process nodes was somewhere around 20% (GPUs going wider make it hard to normalize).

The proper measure of performance per watt is performance at iso-wattage. A laptop chip or a server chip, for example, aren't designed around a target level of performance with then an adjusted wattage. They are designed around a target wattage and try to get the highest level of performance possible within these power consumption limit, which often involves finding ways to reduce power consumption.

This is even more relevant for CPUs where normalization is easy by looking at single-core performance. Comparison at iso-wattage is much more resilient to the distortions of perf/watt introduced by the efficiency curve. Increasing clocks increases power consumption by itself, but also increases the required voltage increasing power consumption once again.
Power consumption is proportional to clocks and quadratic to voltage. If you have to increase voltage by 10% to get clocks 10% higher, that's less than 10% perf (clock scaling is always less than perfect because of RAM) for over 33% power consumption increase.
Alayan wrote: Tue Dec 08, 2020 1:14 am Chasing the last few percents of performance make power consumption explode.

A Zen 3 core with reduced clocks and voltage to use similar power to a M1 core ends up having 20-25% less performance (25-33% perf advantage for the M1). A sizable advantage for Apple, but not a mindblowing difference that would condemn x86 CPUs to irrelevance. A lot of Apple's advantage is the fruit of them hiring a lot of very highly skilled engineers and giving them the transistor budget to do something great (Apple designs are not focused on minimizing die space), this is not by itself a proof that ARM is significantly better than x86.

It would be tempting to correct for Apple's 5nm vs 7nm node advantage by applying the TSMC claim of 30% power reduction to the Zen3 core, getting a similar perf/watt, but that would be a mistake because it would break the iso-power comparison. The performance of a Zen3 core with a third more power than a M1 core is a better comparison, and the Zen3 core still loses, but by a thinner margin than when not correcting for process node.

You'll notice that Apple's own Icestorm cores have a better perf/watt than their Firestorm cores, yet the hype comes from the big Firestorm cores. This is also a reminder that the absolute level of performance reached is, for many workloads, very important. The sole perf/watt is not the only relevant factor for a responsiveness-oriented workload rather than an instance-parallelizable throughput-oriented workload like many server tasks are.

If one designs peaks at 1.00 perf at 5W but can't reach more performance at 15W, and another design gets 0.9 perf at 5W but peaks at 1.1 perf at 15W, the first design is the better choice if you have a 5W budget but the second one is better if you have a 15W budget.
Now that the Zen 3 mobile chips (5600U/5800U) are coming out, you should soon be able to look at benchmarks for 15W parts to compare with the M1, and the M1 is not going to be 3 times as fast as those in any workload. It will have a perf advantage when few threads are loaded, but enough that the Cezanne chips can't use most of their TDP to boost 1 core high, and will lose on peak throughput (say running games for fishtest) because of less cores.

Regarding this:
mar wrote: Thu Apr 08, 2021 5:07 pm I'm also impressed with the performance of Rosetta 2, because most of the apps run nearly at full speed, which is something, I predicted a factor of 2+ slower because of CPU emulation, but I was wrong. also note that ARM has a weaker memory model than x86, so the emulation should suffer from that as well (also: no long immediates, limited branch range etc.)
The M1 chip has special hardware to also support strong memory ordering like x86's, which is a big part of why the emulation performance is quite good. That was a good design choice by Apple engineers.