AVX v AVX 2 vs AVX 512 - Engine Analysis

CornfedForever · Post by **CornfedForever** » Sun Jul 10, 2022 7:38 pm

As the upcoming 13th generation INTEL CPU's are doing away with AVX-512 all together and the upcoming AMD Ryzen (both to be released about the same time) are set to included the AVX 512 instruction set, I am wondering just what the different is in strength of analysis ( 3 sec, 5 min, overnight, etc) of a move as concerns engines like Dragon / Komodo / Stockfish.

I seem to recall Larry Kaufman (a year or so ago) thinking the difference between 256 and 512 was not very much, but I do not think I have ever seen anyone try to quantify just what the differences might be. I know 'more is better', but...not sure about how that corresponds to real world chess analysis.

Curious as I am about to get a new computer...and expect to run engines on it quite often for analysis of lines (primarily) or games. I am currently looking at a set up with a Ryzen 5950x - CPU

Raphexon · Post by **Raphexon** » Sun Jul 10, 2022 9:47 pm

Use whatever's fastest.

Chessqueen · Post by **Chessqueen** » Mon Jul 11, 2022 12:07 am

Raphexon wrote: ↑Sun Jul 10, 2022 9:47 pm Use whatever's fastest.

And which is the fastest ?

smatovic · Post by **smatovic** » Mon Jul 11, 2022 4:38 am

We won't be able to tell until benchmarks are out, with optimized engines, cos in the past there were underclocking issues with broader bit-width and we don't know for sure if only the AVX-512 instruction set is supported via AVX2-256 or via a real 512-bit wide vector-unit, time will tell. Sopel et al. could estimate the max NPS speedup possible by AVX-512, maybe 1.25x to 1.5x, depends on engine and implementation.

--
Srdja

Magnum · Post by **Magnum** » Mon Jul 11, 2022 9:57 am

Chessqueen wrote: ↑Mon Jul 11, 2022 12:07 am
Raphexon wrote: ↑Sun Jul 10, 2022 9:47 pm Use whatever's fastest.
And which is the fastest ?

It’s your self compiled version.

Or use a simple MacBook Pro M1 MAX.

smatovic · Post by **smatovic** » Tue Aug 30, 2022 9:21 am

Anandtech on AMD Zen 4 Ryzen and AVX-512:
https://www.anandtech.com/show/17552/am ... ng-sept-27

[...]
Papermaster also confirmed for the first time that Zen 4 – including the Ryzen 7000 series – will support AVX-512 instructions. AVX-512 is a bit of a mess of standards, so besides the foundation (AVX-512F) instructions, it’s still not entirely clear which subsets of AVX-512 AMD will support. But Papermaster did explicitly mention Vector Neural Network Instructions (VNNI) as among the additional subsets supported.

Critically, however, AMD is diverging from Intel in one important aspect: whereas Intel built a true, 512-bit wide SIMD machine for executing AVX-512 instructions, AMD did not. Instead, AMD will be executing these instructions over two cycles. This means AMD’s implementation still benefits from all of the additional instructions, register file space, and other technical improvements that came as part of AVX-512, but they won’t gain the innate doubling in SIMD throughput.

In discussing the rationale for AMD’s decision, Papermaster cited the extreme power requirements for a true 512-bit SIMD block as the biggest impetus for keeping AMD’s SIMD design at 256-bits. As we’ve already seen in Intel chips with AVX-512 support, the massive throughput of a 512-bit SIMD combined with its high density results in a hard spike in power consumption when using it, requiring Intel’s chips to downclock on AVX-512 workloads (sometimes severely) in order to keep power and thermals in check. Using a narrower 256-bit SIMD means that AMD won’t need to light up nearly as many transistors at once, which will in turn make it easier to keep clockspeeds and power consumption more consistent. At the same time, I don’t think AMD minds that the die space requirements for a 256-bit SIMD are significantly less than a 512-bit SIMD; a full 512-bit SIMD is a lot of transistors to build, and a lot of transistors to fire up during heavy workloads.
[...]

As the article mentions, NNUE inference might profit from new insructions but the SIMD-width did not double.

--
Srdja

CornfedForever · Post by **CornfedForever** » Tue Aug 30, 2022 2:46 pm

smatovic wrote: ↑Tue Aug 30, 2022 9:21 am Anandtech on AMD Zen 4 Ryzen and AVX-512:
https://www.anandtech.com/show/17552/am ... ng-sept-27

[...]
Papermaster also confirmed for the first time that Zen 4 – including the Ryzen 7000 series – will support AVX-512 instructions. AVX-512 is a bit of a mess of standards, so besides the foundation (AVX-512F) instructions, it’s still not entirely clear which subsets of AVX-512 AMD will support. But Papermaster did explicitly mention Vector Neural Network Instructions (VNNI) as among the additional subsets supported.

Critically, however, AMD is diverging from Intel in one important aspect: whereas Intel built a true, 512-bit wide SIMD machine for executing AVX-512 instructions, AMD did not. Instead, AMD will be executing these instructions over two cycles. This means AMD’s implementation still benefits from all of the additional instructions, register file space, and other technical improvements that came as part of AVX-512, but they won’t gain the innate doubling in SIMD throughput.

In discussing the rationale for AMD’s decision, Papermaster cited the extreme power requirements for a true 512-bit SIMD block as the biggest impetus for keeping AMD’s SIMD design at 256-bits. As we’ve already seen in Intel chips with AVX-512 support, the massive throughput of a 512-bit SIMD combined with its high density results in a hard spike in power consumption when using it, requiring Intel’s chips to downclock on AVX-512 workloads (sometimes severely) in order to keep power and thermals in check. Using a narrower 256-bit SIMD means that AMD won’t need to light up nearly as many transistors at once, which will in turn make it easier to keep clockspeeds and power consumption more consistent. At the same time, I don’t think AMD minds that the die space requirements for a 256-bit SIMD are significantly less than a 512-bit SIMD; a full 512-bit SIMD is a lot of transistors to build, and a lot of transistors to fire up during heavy workloads.
[...]
As the article mentions, NNUE inference might profit from new insructions but the SIMD-width did not double.

--
Srdja

Even then...isn't letting non-AVX run a couple of seconds more on a position, essentially the same as AVX-512? I ask because I've never seen any data about how much 'faster' 512 really is.
In the end, for analysis, you are not in a 'race' situation...so the extra seconds really do not matter, just the engine software.

smatovic · Post by **smatovic** » Tue Aug 30, 2022 10:33 pm

CornfedForever wrote: ↑Tue Aug 30, 2022 2:46 pm Even then...isn't letting non-AVX run a couple of seconds more on a position, essentially the same as AVX-512? I ask because I've never seen any data about how much 'faster' 512 really is.
In the end, for analysis, you are not in a 'race' situation...so the extra seconds really do not matter, just the engine software.

Hmm, if you run your chess analysis only for a couple of sencods it really does not matter if AVX-512 gives a NPS speedup of maybe 1.25x or not.

--
Srdja

smatovic · Post by **smatovic** » Thu Apr 10, 2025 2:39 pm

Just for the files:

AMD Zen 3 supports AVX2 via 256b vector unit, Idk how many pipelines (ie for HyperThreading/SMT) per core.

AMD Zen 4 supports AVX-512 via 2x256b vector unit:

Zen 4 is the first AMD microarchitecture to support AVX-512 instruction set extension. Most 512-bit vector instructions are split in two and executed by the 256-bit SIMD execution units internally. The two halves execute in parallel on a pair of execution units and are still tracked as a single micro-OP (except for stores), which means the execution latency isn't doubled compared to 256-bit vector instructions. There are four 256-bit execution units, which gives a maximum throughput of two 512-bit vector instructions per clock cycle, e.g. one multiplication and one addition. The maximum number of instructions per clock cycle is doubled for vectors of 256 bits or less. Load and store units are also 256 bits each, retaining the throughput of up to two 256-bit loads or one store per cycle that was supported by Zen 3. This translates to up to one 512-bit load per cycle or one 512-bit store per two cycles.[10][12][13]

https://en.wikipedia.org/wiki/Zen_4

With four 256b vector units per core, probably for HyperThreading/SMT.

AMD Zen 5 has additional 512-bit data path:

Zen 4 introduced AVX-512 instructions. AVX-512 capabilities have been expanded with Zen 5 with a doubling of the floating point pipe width to a native 512-bit floating point datapath. The AVX-512 datapath is configurable depending on the product. Ryzen 9000 series desktop processors and EPYC 9005 server processors feature the full 512-bit datapath but Ryzen AI 300 mobile processors feature a 256-bit datapath in order to reduce power consumption. AVX-512 instruction has been extended to VNNI/VEX instructions. Additionally, there is greater bfloat16 throughput which is beneficial for AI workloads.

https://en.wikipedia.org/wiki/Zen_5

Idk with how many pipelines in total.

And, Stockfish 16.1 gains more by wider vector unit than SF 14.1, here probably Amdahl's Law steps in, cos of bigger net size:

https://ipmanchess.yolasite.com/amd--in ... ckfish.php
viewtopic.php?p=967471#p967471

--
Srdja

AVX v AVX 2 vs AVX 512 - Engine Analysis

AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis