Ryzen optimization

cdani · Post by **cdani** » Wed Jul 05, 2017 1:51 pm

I'm trying to add an additional prefetch, initially on fail high, to try a faster save of the hash. I found that is bad for Ryzen. Searching google I found that they discourage the use of prefetch:

https://community.amd.com/thread/213045
http://32ipi028l5q82yhj72224m8j.wpengin ... -Ryzen.pdf

So I think I will do a special compile for Ryzen cpus, withouth prefetch and bmi2 instructions, and whatever I or we can find more. Maybe Stockfish guys or other engine developers want to do it for his engine also.

Ozymandias · Post by **Ozymandias** » Wed Jul 05, 2017 3:57 pm

Nice tip, so you got yourself one of those babies.

cdani · Post by **cdani** » Wed Jul 05, 2017 8:25 pm

Ozymandias wrote:Nice tip

The testing is conclusive. Prefetching on Ryzen is not bad, is disastrous.

Adding a secondary prefetch was a slight win on AMD-FX 8350, also on an I7-5820K, and was -10 (minus ten) elo on the Ryzen machine.

Obviously my next attempt was to remove the primary prefetch that Andscacs and most engines have on the make move function, and guess what? It's wining like 10 elo on the Ryzen machine.

Sure it will be the same for Stockfish and others.

AlvaroBegue · Post by **AlvaroBegue** » Wed Jul 05, 2017 8:56 pm

cdani wrote:
Ozymandias wrote:Nice tip
The testing is conclusive. Prefetching on Ryzen is not bad, is disastrous.

Adding a secondary prefetch was a slight win on AMD-FX 8350, also on an I7-5820K, and was -10 (minus ten) elo on the Ryzen machine.

Obviously my next attempt was to remove the primary prefetch that Andscacs and most engines have on the make move function, and guess what? It's wining like 10 elo on the Ryzen machine.

Sure it will be the same for Stockfish and others.

This must all be measurable in nodes per second, right? Testing something like this using games means dealing with noise for no good reason.

cdani · Post by **cdani** » Wed Jul 05, 2017 9:01 pm

AlvaroBegue wrote:This must all be measurable in nodes per second, right? Testing something like this using games means dealing with noise for no good reason.

I understand, but I always end testing almost everything. For example if I had not tested this secondary prefetch change in other computers than the main one I use (the I7) because I had found that the change was good (in the I7), I had overlooked the Ryzen problem.

DustyMonkey · Post by **DustyMonkey** » Thu Jul 06, 2017 2:02 am

Everybody should keep their eyes on Agner Fog's blog as well:

http://www.agner.org/optimize/blog/

..and his optimization manuals:

http://www.agner.org/optimize/?e=0,34#manuals

When he finally does release for Ryzen, the thing engine authors should pay the most attention to is the microarchitecture docs rather than the "optimizing for..." docs (except for the author of asmfish) because the general ideas hold sway when you can't schedule instructions yourself.

DustyMonkey · Post by **DustyMonkey** » Thu Jul 06, 2017 3:00 am

Actually it appears he has already dived into Ryzen completely... and inside the micro-architecture docs:

""Automatic hardware prefetching is more efficient than explicit software prefetching in most cases.

And some other interesting portions:

""... But the capacity of each core is higher than what a single-threaded application is likely to need. Therefore, the Ryzen gets more advantage out of simultaneous multithreading than similar Intel processors do. Inter-thread communication should be kept within the same 4-core CPU complex if possible.

This says several things to me:

AMD's Ryzen "hyper-threading" is more likely to be of benefit to an engine than current-gen Intels, and it seems to me that at least some of the latest Intels are almost indifferent w.r.t. HT On + 2x threads vs HT Off (based on anecdotal TTD numbers I have seen)

When only 1 thread is executing on a core, Ryzen typically has idle execution units even with highly optimized code, therefore instruction scheduling is significantly less important on Ryzen than on Intel.

Any top-down "bull by the horns" SMT strategy for Ryzen ideally needs to take into account which "core complex" threads are on if possible.

"" It is important to avoid long dependency chains if you want to even get close to the maximum throughput of five instructions per clock cycle.

The general idea has always been the case since the Pentium 3 and Athlon days, but with 5 uops per cycle on the table it isnt just a matter of interleaving a couple dependency chains together.

Instead of interleaving whatever dependency chains are next, its now so beneficial to actively create chances for interleaving. Ideal interleaving now dominate the structure of the code itself.

""Branch prediction in Ryzen is based on perceptrons.

I just thought that this was very very interesting. Never thought I would see the day...

""The instruction fetcher is shared between the two cores of an execution unit. The instruction fetcher can fetch 32 aligned bytes of code per clock cycle from the level-1 code cache, according to AMD documents, but the maximum measured throughput is only slightly more than 16 bytes per clock, rarely exceeding 17.

With only 16 bytes/cycle fetch, its going to be very hard to hit 5 u-ops/cycle with many of the longer instructions. Remember that many instruction forms are long by necessity, for instance memory addressing of the form REG+offset require that that offset be encoded. 5 u-ops per cycle would be almost impossible* if a second thread is on the core.

*contrived code could probably still hit 5 uops/cycle

""The integer register file has 168 physical registers of 64 bits each. The floating point register file has 160 registers of 128 bits each.

So there is literally no chance of being confronted with a bottleneck on the available shadow registers with only 1 thread per core.

""Register-to-register move instructions are resolved at the register rename stage without using any execution units. These instructions have zero latency. It is possible to do six such register renamings per clock cycle, and it is even possible to rename the same register several times in one clock cycle.

This could help with interleaving dependency chains for the register-specific instruction forms that typically target rdx:rax, eax:edx, or ax:dx in some way. These instructions can be very short, and so are the register-to-register instructions. This is possibly how you can get to 5 u-ops/cycle when it otherwise doesnt seem possible.

cdani · Post by **cdani** » Thu Jul 06, 2017 8:44 am

I did something bad on the second test. In the first one, adding a second prefetch on fail high was -10 after 10000 games only on the ryzen as told. But the second, removing the prefetch of the make move function, has given a win of only like 1 elo after 15000 games. So I suppose that the bad effect depends on the immediately surrounding code.

syzygy · Post by **syzygy** » Fri Jul 07, 2017 9:41 am

As Álvaro already said, a pure speed change can be tested far more easily, quickly and accurately by measuring nps on a few test positions. No need for thousands of games after which you still don't really know.

Ryzen optimization

Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization

Re: Ryzen optimization