Using Neural or Graphical Processor Units in NNUE engines

RogerC · Post by **RogerC** » Wed Aug 09, 2023 3:48 am

Since Implementation of AffineTransformSparseInput for armv8 and specific compilation for armv8-dotprod architectures, Android users had got a huge speed improvement.

This shows us that a small change in coding and/or compilation can impact speed a lot, on modern architectures.

Magnum · Post by **Magnum** » Sat Aug 12, 2023 6:29 pm

RogerC wrote: ↑Wed Aug 09, 2023 3:48 am Since Implementation of AffineTransformSparseInput for armv8 and specific compilation for armv8-dotprod architectures, Android users had got a huge speed improvement.

This shows us that a small change in coding and/or compilation can impact speed a lot, on modern architectures.

That’s not a huge speed improvement.

+75.63% on Apple devices. This is a huge speed improvement

It‘s 3x more than armv8-dotprod, Cortex-X1 : 27.1% speed-up

https://en.wikipedia.org/wiki/ARM_Cortex-X1
ARMv8.2-A
https://en.wikipedia.org/wiki/AArch64#ARMv8.2-A
https://en.wikipedia.org/wiki/ARM_architecture_family

Stockfish developers should try: ARMv9.2
https://www.anandtech.com/show/18871/ar ... -exclusive

Ras · Post by **Ras** » Sat Aug 12, 2023 6:43 pm

Magnum wrote: ↑Sat Aug 12, 2023 6:29 pmStockfish developers should try: ARMv9.2

How would they try it? With what hardware as of now? ARM themselves design the CPUs without actually manufacturing them.

smatovic · Post by **smatovic** » Sat Aug 12, 2023 9:44 pm

dangi12012 wrote: ↑Sun Aug 06, 2023 11:04 pm [...]
Open question:
If we have 32threads for one NNUE eval or have 32 evals concurrently executing in lockstep is an open question. I recon the second approach will win since the later parts of nnue work on 15 elements making half of the threads idle with the first approach.
[...]

My take, you will need to couple a Warp with 32 or Wavefront with 64 threads to compete with AVX2, and you might want to use vector-packed math with char4, a vector of 4x8-bit, hence, to utilize a Warp with 32xchar4 you will need an optimized neural net architecture, which differs from current SF, generally, Lc0 has different net sizes, and I can imagine that it would make sense to use 128b, 256b, 512b...2048b optimized net archs for NNUE to run on different SIMD units with different bit-width.

--
Srdja

Sopel · Post by **Sopel** » Wed Aug 16, 2023 1:18 am

smatovic wrote: ↑Mon Aug 07, 2023 1:17 pm There is Intel's Xeon Sapphire Rapids arch with so called AMX, a TMUL compute unit, a matrix-math compute unit, but until now these are not present in consumer brand CPUs.

We tried using it. It's trash for level-2 BLAS routines. It's so bad it's worse than SSSE3 for NNUE. Because of this I don't have much hope in matmul accelerators being helpful for NNUE apart from very specialized networks that would be useless on other hardware.

Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines