Since Implementation of AffineTransformSparseInput for armv8 and specific compilation for armv8-dotprod architectures, Android users had got a huge speed improvement.
This shows us that a small change in coding and/or compilation can impact speed a lot, on modern architectures.
Using Neural or Graphical Processor Units in NNUE engines
Moderators: hgm, Rebel, chrisw
-
- Posts: 41
- Joined: Tue Oct 29, 2019 8:33 pm
- Location: French Polynesia
- Full name: Roger C.
-
- Posts: 195
- Joined: Thu Feb 04, 2021 10:24 pm
- Full name: Arnold Magnum
Re: Using Neural or Graphical Processor Units in NNUE engines
That’s not a huge speed improvement.RogerC wrote: ↑Wed Aug 09, 2023 3:48 am Since Implementation of AffineTransformSparseInput for armv8 and specific compilation for armv8-dotprod architectures, Android users had got a huge speed improvement.
This shows us that a small change in coding and/or compilation can impact speed a lot, on modern architectures.
+75.63% on Apple devices. This is a huge speed improvement
It‘s 3x more than armv8-dotprod, Cortex-X1 : 27.1% speed-up
https://en.wikipedia.org/wiki/ARM_Cortex-X1
ARMv8.2-A
https://en.wikipedia.org/wiki/AArch64#ARMv8.2-A
https://en.wikipedia.org/wiki/ARM_architecture_family
Stockfish developers should try: ARMv9.2
https://www.anandtech.com/show/18871/ar ... -exclusive
-
- Posts: 2556
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Using Neural or Graphical Processor Units in NNUE engines
How would they try it? With what hardware as of now? ARM themselves design the CPUs without actually manufacturing them.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 2858
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Using Neural or Graphical Processor Units in NNUE engines
My take, you will need to couple a Warp with 32 or Wavefront with 64 threads to compete with AVX2, and you might want to use vector-packed math with char4, a vector of 4x8-bit, hence, to utilize a Warp with 32xchar4 you will need an optimized neural net architecture, which differs from current SF, generally, Lc0 has different net sizes, and I can imagine that it would make sense to use 128b, 256b, 512b...2048b optimized net archs for NNUE to run on different SIMD units with different bit-width.dangi12012 wrote: ↑Sun Aug 06, 2023 11:04 pm [...]
Open question:
If we have 32threads for one NNUE eval or have 32 evals concurrently executing in lockstep is an open question. I recon the second approach will win since the later parts of nnue work on 15 elements making half of the threads idle with the first approach.
[...]
--
Srdja
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Using Neural or Graphical Processor Units in NNUE engines
We tried using it. It's trash for level-2 BLAS routines. It's so bad it's worse than SSSE3 for NNUE. Because of this I don't have much hope in matmul accelerators being helpful for NNUE apart from very specialized networks that would be useless on other hardware.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.