prefetch questions

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

prefetch questions

Post by lucasart »

1/ When I call the GCC intrinsic

Code: Select all

__builtin_prefetch(void *addr,...)
and addr is 64-byte aligned, will a 64-byte cache line be prefetched from that address ? I absolutely need 64-bytes (from addr to addr+63 included)

2/ What kind of speed increase should I expect from prefetching. I've done it for the TT and Pawn Cache entries, and only measured a tiny +0.4% speed increase :cry:
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
hgm
Posts: 27703
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: prefetch questions

Post by hgm »

On modern i86 hardware cache fetches are alway 64-byte.

How much you will gain depend very much on whether you have something useful to do for the CPU between the pre-fetch and the actual use of the data. It must be significantly more than the look-ahead span of the CPU, otherwise you might as well have done the real read of the data, and what followed would have been executed anyway before you really start to use the data in calculation (or branch on it).

There are also situations conceivable where you would not gain anything. Namely when memory bandwidth is the current bottleneck.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: prefetch questions

Post by bob »

lucasart wrote:1/ When I call the GCC intrinsic

Code: Select all

__builtin_prefetch(void *addr,...)
and addr is 64-byte aligned, will a 64-byte cache line be prefetched from that address ? I absolutely need 64-bytes (from addr to addr+63 included)

2/ What kind of speed increase should I expect from prefetching. I've done it for the TT and Pawn Cache entries, and only measured a tiny +0.4% speed increase :cry:
You can't prefetch anything OTHER than 64 byte cache lines, unless you are on hardware that uses a different cache line/block size...

Speed increases are difficult to predict. When you prefetch A, you replace B. If you need B before you need A, you get yet another cache line fill and hurt performance. If you don't need A, but prefetch it anyway, you replaced something you might need, and burned some memory bandwidth unnecessarily. It can be a mixed bag unless done carefully.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: prefetch questions

Post by lucasart »

bob wrote:
lucasart wrote:1/ When I call the GCC intrinsic

Code: Select all

__builtin_prefetch(void *addr,...)
and addr is 64-byte aligned, will a 64-byte cache line be prefetched from that address ? I absolutely need 64-bytes (from addr to addr+63 included)

2/ What kind of speed increase should I expect from prefetching. I've done it for the TT and Pawn Cache entries, and only measured a tiny +0.4% speed increase :cry:
You can't prefetch anything OTHER than 64 byte cache lines, unless you are on hardware that uses a different cache line/block size...
Thank you. That confirms my intuition. I just couldn't find an answer in the GCC doc, so I just assumed it would do that. The thing that confused me is Stockfish does

Code: Select all

__builtin_prefetch(addr)
__builtin_prefetch(addr+64)
and explicitely comments this as a way to fetch 64-bytes. But this is useless if addr is already 64-byte aligned (which it clearly should in this case, otherwise a TT entry will be mixed in two cache lines). So it's probably a mistake in Stockfish.
bob wrote: Speed increases are difficult to predict. When you prefetch A, you replace B. If you need B before you need A, you get yet another cache line fill and hurt performance. If you don't need A, but prefetch it anyway, you replaced something you might need, and burned some memory bandwidth unnecessarily. It can be a mixed bag unless done carefully.
The problem I have it's that it's very hard to measure. Prefetching TT and Pawn Cache entries gains 0.4% speed in total. But I can't isolate the prefetch effect (and if I tried the result wqould be meaningless, as it depends on how much stuff is being done before prefetch and usage of the cache line).
Basically you need to do enough work between the prefetch and the use of the cache line so that the CPU will have had time to fetch the cache line before you use it, but too much work and the cache line might already have been overridden. So it's hard to know how much without being able to measure precisely...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: prefetch questions

Post by Evert »

Isn't adding prefetch instructions where it makes sense typically something that the compiler can do with profile-guided optimisation?
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: prefetch questions

Post by lucasart »

Another question related to 64-byte alignment. Actually this is more a C++ question.

Suppose I define a struct TTable::Entry that is 64-bytes long. If I do something like

Code: Select all

TTable::Entry *p = new TTable::Entry[count];
Can I assume (void *)p to be divisible by 64 ? Is this specified by the C++ standard, or compiler specific ? If the latter, is there a portable way to make sure ? 64-byte alignment is crucial here, not doing it defeats the purpose of prefetching in the first place.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: prefetch questions

Post by Evert »

lucasart wrote: Can I assume (void *)p to be divisible by 64 ? Is this specified by the C++ standard, or compiler specific ? If the latter, is there a portable way to make sure ? 64-byte alignment is crucial here, not doing it defeats the purpose of prefetching in the first place.
You cannot assume. In practice it's often, but not always the case (in my experience, particularly with that other operating system).

The only portable way that I know of (and this is C rather than C++) is to allocate the memory, with a bit extra, then test whether the pointer is aligned properly. If it isn't, increment the pointer to the point where it is aligned (this is why you allocate extra memory).

As I said, this works with C malloc'ed memory. I don't think you can (should) do the same with memory allocated with C++'s new.

Having said that, there are probably compiler pragma's or platform specific functions that do it for you.
User avatar
hgm
Posts: 27703
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: prefetch questions

Post by hgm »

It is easy enough to figure this out: just print out the address of an object that you thus allocated.

The problem with malloc() is that the garbage collection system uses the firt two words of any allocated area to store info about the size, which is needed by free(), and then returns the address just after that. The actual allocations are usually aligned to very high powers of two, but the offset of 8 or 16 bytes is guaranteed to spoil that.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: prefetch questions

Post by lucasart »

Evert wrote:
lucasart wrote: Can I assume (void *)p to be divisible by 64 ? Is this specified by the C++ standard, or compiler specific ? If the latter, is there a portable way to make sure ? 64-byte alignment is crucial here, not doing it defeats the purpose of prefetching in the first place.
You cannot assume. In practice it's often, but not always the case (in my experience, particularly with that other operating system).

The only portable way that I know of (and this is C rather than C++) is to allocate the memory, with a bit extra, then test whether the pointer is aligned properly. If it isn't, increment the pointer to the point where it is aligned (this is why you allocate extra memory).

As I said, this works with C malloc'ed memory. I don't think you can (should) do the same with memory allocated with C++'s new.

Having said that, there are probably compiler pragma's or platform specific functions that do it for you.
Yes, that would work (and can be done the same way with new and delete, but there's an awful lot of pointer casting that is verbose and ugly in C++, but still possible). Possible, but horrible... I'll have a look at the pragma option.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
jdart
Posts: 4361
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: prefetch questions

Post by jdart »

See Arasan's types.h file for how to do this portably (there are also variants of malloc that do alignment).

http://www.arasanchess.org/

--Jon