AVX v AVX 2 vs AVX 512 - Engine Analysis

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

CornfedForever
Posts: 646
Joined: Mon Jun 20, 2022 4:08 am
Full name: Brian D. Smith

AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by CornfedForever »

As the upcoming 13th generation INTEL CPU's are doing away with AVX-512 all together and the upcoming AMD Ryzen (both to be released about the same time) are set to included the AVX 512 instruction set, I am wondering just what the different is in strength of analysis ( 3 sec, 5 min, overnight, etc) of a move as concerns engines like Dragon / Komodo / Stockfish.

I seem to recall Larry Kaufman (a year or so ago) thinking the difference between 256 and 512 was not very much, but I do not think I have ever seen anyone try to quantify just what the differences might be. I know 'more is better', but...not sure about how that corresponds to real world chess analysis.

Curious as I am about to get a new computer...and expect to run engines on it quite often for analysis of lines (primarily) or games. I am currently looking at a set up with a Ryzen 5950x - CPU
Raphexon
Posts: 476
Joined: Sun Mar 17, 2019 12:00 pm
Full name: Henk Drost

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by Raphexon »

Use whatever's fastest.
Chessqueen
Posts: 5685
Joined: Wed Sep 05, 2018 2:16 am
Location: Moving
Full name: Jorge Picado

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by Chessqueen »

Raphexon wrote: Sun Jul 10, 2022 9:47 pm Use whatever's fastest.
And which is the fastest ?
smatovic
Posts: 3169
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by smatovic »

We won't be able to tell until benchmarks are out, with optimized engines, cos in the past there were underclocking issues with broader bit-width and we don't know for sure if only the AVX-512 instruction set is supported via AVX2-256 or via a real 512-bit wide vector-unit, time will tell. Sopel et al. could estimate the max NPS speedup possible by AVX-512, maybe 1.25x to 1.5x, depends on engine and implementation.

--
Srdja
Magnum
Posts: 195
Joined: Thu Feb 04, 2021 10:24 pm
Full name: Arnold Magnum

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by Magnum »

Chessqueen wrote: Mon Jul 11, 2022 12:07 am
Raphexon wrote: Sun Jul 10, 2022 9:47 pm Use whatever's fastest.
And which is the fastest ?
It’s your self compiled version.

Or use a simple MacBook Pro M1 MAX.
smatovic
Posts: 3169
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by smatovic »

Anandtech on AMD Zen 4 Ryzen and AVX-512:
https://www.anandtech.com/show/17552/am ... ng-sept-27
[...]
Papermaster also confirmed for the first time that Zen 4 – including the Ryzen 7000 series – will support AVX-512 instructions. AVX-512 is a bit of a mess of standards, so besides the foundation (AVX-512F) instructions, it’s still not entirely clear which subsets of AVX-512 AMD will support. But Papermaster did explicitly mention Vector Neural Network Instructions (VNNI) as among the additional subsets supported.

Critically, however, AMD is diverging from Intel in one important aspect: whereas Intel built a true, 512-bit wide SIMD machine for executing AVX-512 instructions, AMD did not. Instead, AMD will be executing these instructions over two cycles. This means AMD’s implementation still benefits from all of the additional instructions, register file space, and other technical improvements that came as part of AVX-512, but they won’t gain the innate doubling in SIMD throughput.

In discussing the rationale for AMD’s decision, Papermaster cited the extreme power requirements for a true 512-bit SIMD block as the biggest impetus for keeping AMD’s SIMD design at 256-bits. As we’ve already seen in Intel chips with AVX-512 support, the massive throughput of a 512-bit SIMD combined with its high density results in a hard spike in power consumption when using it, requiring Intel’s chips to downclock on AVX-512 workloads (sometimes severely) in order to keep power and thermals in check. Using a narrower 256-bit SIMD means that AMD won’t need to light up nearly as many transistors at once, which will in turn make it easier to keep clockspeeds and power consumption more consistent. At the same time, I don’t think AMD minds that the die space requirements for a 256-bit SIMD are significantly less than a 512-bit SIMD; a full 512-bit SIMD is a lot of transistors to build, and a lot of transistors to fire up during heavy workloads.
[...]
As the article mentions, NNUE inference might profit from new insructions but the SIMD-width did not double.

--
Srdja
CornfedForever
Posts: 646
Joined: Mon Jun 20, 2022 4:08 am
Full name: Brian D. Smith

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by CornfedForever »

smatovic wrote: Tue Aug 30, 2022 9:21 am Anandtech on AMD Zen 4 Ryzen and AVX-512:
https://www.anandtech.com/show/17552/am ... ng-sept-27
[...]
Papermaster also confirmed for the first time that Zen 4 – including the Ryzen 7000 series – will support AVX-512 instructions. AVX-512 is a bit of a mess of standards, so besides the foundation (AVX-512F) instructions, it’s still not entirely clear which subsets of AVX-512 AMD will support. But Papermaster did explicitly mention Vector Neural Network Instructions (VNNI) as among the additional subsets supported.

Critically, however, AMD is diverging from Intel in one important aspect: whereas Intel built a true, 512-bit wide SIMD machine for executing AVX-512 instructions, AMD did not. Instead, AMD will be executing these instructions over two cycles. This means AMD’s implementation still benefits from all of the additional instructions, register file space, and other technical improvements that came as part of AVX-512, but they won’t gain the innate doubling in SIMD throughput.

In discussing the rationale for AMD’s decision, Papermaster cited the extreme power requirements for a true 512-bit SIMD block as the biggest impetus for keeping AMD’s SIMD design at 256-bits. As we’ve already seen in Intel chips with AVX-512 support, the massive throughput of a 512-bit SIMD combined with its high density results in a hard spike in power consumption when using it, requiring Intel’s chips to downclock on AVX-512 workloads (sometimes severely) in order to keep power and thermals in check. Using a narrower 256-bit SIMD means that AMD won’t need to light up nearly as many transistors at once, which will in turn make it easier to keep clockspeeds and power consumption more consistent. At the same time, I don’t think AMD minds that the die space requirements for a 256-bit SIMD are significantly less than a 512-bit SIMD; a full 512-bit SIMD is a lot of transistors to build, and a lot of transistors to fire up during heavy workloads.
[...]
As the article mentions, NNUE inference might profit from new insructions but the SIMD-width did not double.

--
Srdja
Even then...isn't letting non-AVX run a couple of seconds more on a position, essentially the same as AVX-512? I ask because I've never seen any data about how much 'faster' 512 really is.
In the end, for analysis, you are not in a 'race' situation...so the extra seconds really do not matter, just the engine software.
smatovic
Posts: 3169
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by smatovic »

CornfedForever wrote: Tue Aug 30, 2022 2:46 pm Even then...isn't letting non-AVX run a couple of seconds more on a position, essentially the same as AVX-512? I ask because I've never seen any data about how much 'faster' 512 really is.
In the end, for analysis, you are not in a 'race' situation...so the extra seconds really do not matter, just the engine software.
Hmm, if you run your chess analysis only for a couple of sencods it really does not matter if AVX-512 gives a NPS speedup of maybe 1.25x or not.

--
Srdja
smatovic
Posts: 3169
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: AVX v AVX 2 vs AVX 512 - Engine Analysis

Post by smatovic »

Just for the files:

AMD Zen 3 supports AVX2 via 256b vector unit, Idk how many pipelines (ie for HyperThreading/SMT) per core.

AMD Zen 4 supports AVX-512 via 2x256b vector unit:
Zen 4 is the first AMD microarchitecture to support AVX-512 instruction set extension. Most 512-bit vector instructions are split in two and executed by the 256-bit SIMD execution units internally. The two halves execute in parallel on a pair of execution units and are still tracked as a single micro-OP (except for stores), which means the execution latency isn't doubled compared to 256-bit vector instructions. There are four 256-bit execution units, which gives a maximum throughput of two 512-bit vector instructions per clock cycle, e.g. one multiplication and one addition. The maximum number of instructions per clock cycle is doubled for vectors of 256 bits or less. Load and store units are also 256 bits each, retaining the throughput of up to two 256-bit loads or one store per cycle that was supported by Zen 3. This translates to up to one 512-bit load per cycle or one 512-bit store per two cycles.[10][12][13]
https://en.wikipedia.org/wiki/Zen_4

With four 256b vector units per core, probably for HyperThreading/SMT.


AMD Zen 5 has additional 512-bit data path:
Zen 4 introduced AVX-512 instructions. AVX-512 capabilities have been expanded with Zen 5 with a doubling of the floating point pipe width to a native 512-bit floating point datapath. The AVX-512 datapath is configurable depending on the product. Ryzen 9000 series desktop processors and EPYC 9005 server processors feature the full 512-bit datapath but Ryzen AI 300 mobile processors feature a 256-bit datapath in order to reduce power consumption. AVX-512 instruction has been extended to VNNI/VEX instructions. Additionally, there is greater bfloat16 throughput which is beneficial for AI workloads.
https://en.wikipedia.org/wiki/Zen_5

Idk with how many pipelines in total.

And, Stockfish 16.1 gains more by wider vector unit than SF 14.1, here probably Amdahl's Law steps in, cos of bigger net size:

https://ipmanchess.yolasite.com/amd--in ... ckfish.php
viewtopic.php?p=967471#p967471

--
Srdja