prefetch questions

Evert · Post by **Evert** » Mon Mar 18, 2013 11:06 am

lucasart wrote: Yes, that would work (and can be done the same way with new and delete, but there's an awful lot of pointer casting that is verbose and ugly in C++, but still possible). Possible, but horrible... I'll have a look at the pragma option.

I guess it depends on how C++ your code is. If you're relying on constructors to initialise the data, or use things like virtual functions (actually not sure about the latter) things may not work as intended. Not that I've tried, but something that always nagged me.

Of course if the code is really just "C with extras" it's probably fine. But then I'd use malloc() myself.

bob · Post by **bob** » Mon Mar 18, 2013 5:00 pm

lucasart wrote:
bob wrote:
lucasart wrote:1/ When I call the GCC intrinsic
Code: Select all
__builtin_prefetch(void *addr,...)
and addr is 64-byte aligned, will a 64-byte cache line be prefetched from that address ? I absolutely need 64-bytes (from addr to addr+63 included)

2/ What kind of speed increase should I expect from prefetching. I've done it for the TT and Pawn Cache entries, and only measured a tiny +0.4% speed increase
You can't prefetch anything OTHER than 64 byte cache lines, unless you are on hardware that uses a different cache line/block size...
Thank you. That confirms my intuition. I just couldn't find an answer in the GCC doc, so I just assumed it would do that. The thing that confused me is Stockfish does
Code: Select all
__builtin_prefetch(addr)
__builtin_prefetch(addr+64)
and explicitely comments this as a way to fetch 64-bytes. But this is useless if addr is already 64-byte aligned (which it clearly should in this case, otherwise a TT entry will be mixed in two cache lines). So it's probably a mistake in Stockfish.
bob wrote: Speed increases are difficult to predict. When you prefetch A, you replace B. If you need B before you need A, you get yet another cache line fill and hurt performance. If you don't need A, but prefetch it anyway, you replaced something you might need, and burned some memory bandwidth unnecessarily. It can be a mixed bag unless done carefully.
The problem I have it's that it's very hard to measure. Prefetching TT and Pawn Cache entries gains 0.4% speed in total. But I can't isolate the prefetch effect (and if I tried the result wqould be meaningless, as it depends on how much stuff is being done before prefetch and usage of the cache line).
Basically you need to do enough work between the prefetch and the use of the cache line so that the CPU will have had time to fetch the cache line before you use it, but too much work and the cache line might already have been overridden. So it's hard to know how much without being able to measure precisely...

That comment is wrong. That is a way to prefetch TWO 64 byte blocks of memory that are contiguous in the virtual address space. You only need to prefetch one byte within a 64 byte block to get the entire block copied into cache. Not only is that comment misleading, but you are accessing two bytes that have 64 bytes BETWEEN them, which guarantees two consecutive cache blocks. No idea why that would be written that way, unless their intent was to prefetch 128 bytes.

BTW the gcc docs are the wrong place to look. This is architecture-specific information, which means you need to go to the Intel docs for the processor you are using. That's where you will discover cache characteristics.

bob · Post by **bob** » Mon Mar 18, 2013 5:04 pm

lucasart wrote:
bob wrote:
lucasart wrote:1/ When I call the GCC intrinsic
Code: Select all
__builtin_prefetch(void *addr,...)
and addr is 64-byte aligned, will a 64-byte cache line be prefetched from that address ? I absolutely need 64-bytes (from addr to addr+63 included)

2/ What kind of speed increase should I expect from prefetching. I've done it for the TT and Pawn Cache entries, and only measured a tiny +0.4% speed increase
You can't prefetch anything OTHER than 64 byte cache lines, unless you are on hardware that uses a different cache line/block size...
Thank you. That confirms my intuition. I just couldn't find an answer in the GCC doc, so I just assumed it would do that. The thing that confused me is Stockfish does
Code: Select all
__builtin_prefetch(addr)
__builtin_prefetch(addr+64)
and explicitely comments this as a way to fetch 64-bytes. But this is useless if addr is already 64-byte aligned (which it clearly should in this case, otherwise a TT entry will be mixed in two cache lines). So it's probably a mistake in Stockfish.
bob wrote: Speed increases are difficult to predict. When you prefetch A, you replace B. If you need B before you need A, you get yet another cache line fill and hurt performance. If you don't need A, but prefetch it anyway, you replaced something you might need, and burned some memory bandwidth unnecessarily. It can be a mixed bag unless done carefully.
The problem I have it's that it's very hard to measure. Prefetching TT and Pawn Cache entries gains 0.4% speed in total. But I can't isolate the prefetch effect (and if I tried the result wqould be meaningless, as it depends on how much stuff is being done before prefetch and usage of the cache line).
Basically you need to do enough work between the prefetch and the use of the cache line so that the CPU will have had time to fetch the cache line before you use it, but too much work and the cache line might already have been overridden. So it's hard to know how much without being able to measure precisely...

The Intel CPU MSR registers can be used to measure this accurately. You can program one counter to count cache misses. Then you can tell what your pre-fetch does. You don't have to have enough code between the prefetch and the actual usage, the idea is that any code between those two operations is effectively "free". Normally you'd stall at the point where you actually use the data, but now when you get there, some cycles have elapsed and some progress to bringing in the memory has already been made.

bob · Post by **bob** » Mon Mar 18, 2013 5:07 pm

hgm wrote:It is easy enough to figure this out: just print out the address of an object that you thus allocated.

The problem with malloc() is that the garbage collection system uses the firt two words of any allocated area to store info about the size, which is needed by free(), and then returns the address just after that. The actual allocations are usually aligned to very high powers of two, but the offset of 8 or 16 bytes is guaranteed to spoil that.

This is the reason that you need to save the ORIGINAL address malloc() returns, as well as the (addr+63) & 0xffffffc0 aligned address. You need the original if you ever want to free() that block to change the size.

hgm · Post by **hgm** » Mon Mar 18, 2013 5:25 pm

bob wrote: You don't have to have enough code between the prefetch and the actual usage, the idea is that any code between those two operations is effectively "free". Normally you'd stall at the point where you actually use the data, but now when you get there, some cycles have elapsed and some progress to bringing in the memory has already been made.

That is not entirely true, as the out-of-order execution would allow your code to continue some 100 instructions past the point where you use the data (and loading it into a register already counts as usage).

If you have only 50 instructions worth of work between the prefetch and the point where your program cannot sensibly predict control flow without knowing the data, you might as well have done the load directly in stead of the prefetch. In both cases the 50 instructions after it would be executed before the CPU starts to work on a mispredicted nonsense path, which will later be flushed. So it would not appear as if moving them to before the real load made them 'free'.

And testing a number of entries in your hash bucket is a very effective way to cause a branch misprediction.

petero2 · Post by **petero2** » Mon Mar 18, 2013 8:32 pm

lucasart wrote:Another question related to 64-byte alignment. Actually this is more a C++ question.

Suppose I define a struct TTable::Entry that is 64-bytes long. If I do something like
Code: Select all
TTable::Entry *p = new TTable::Entry[count];
Can I assume (void *)p to be divisible by 64 ? Is this specified by the C++ standard, or compiler specific ? If the latter, is there a portable way to make sure ? 64-byte alignment is crucial here, not doing it defeats the purpose of prefetching in the first place.

If you use STL containers you could write a custom allocator that ensures 64-byte alignment. In my current development version of texel I use this: alignedAlloc.hpp. You could then write:

Code: Select all

template <typename T> class vector_aligned : public std::vector<T, AlignedAllocator<T> > { };
vector_aligned<int> someVector;

Or if you use a new enough C++11 compiler, you could write:

Code: Select all

template <typename T> using vector_aligned = std::vector<T, AlignedAllocator<T>>;
vector_aligned<int> someVector;

I am not sure my AlignedAllocator class is currently portable because I have only tested it in 64 bit linux so far. However, custom allocators are part of the C++ standard, so it should be quite easy to make it portable if it is not already.

syzygy · Post by **syzygy** » Tue Mar 19, 2013 1:21 am

bob wrote:
lucas wrote:and explicitely comments this as a way to fetch 64-bytes.
That comment is wrong. That is a way to prefetch TWO 64 byte blocks of memory that are contiguous in the virtual address space. You only need to prefetch one byte within a 64 byte block to get the entire block copied into cache. Not only is that comment misleading, but you are accessing two bytes that have 64 bytes BETWEEN them, which guarantees two consecutive cache blocks. No idea why that would be written that way, unless their intent was to prefetch 128 bytes.

The full code fragment is this:

Code: Select all

#  if defined(__INTEL_COMPILER) || defined(_MSC_VER)
  _mm_prefetch(addr, _MM_HINT_T0);
  _mm_prefetch(addr+64, _MM_HINT_T0); // 64 bytes ahead
#  else
  __builtin_prefetch(addr);
  __builtin_prefetch(addr+64);
#  endif

That second prefetch reads (another) 64 bytes ahead, so the comment on that line is strictly speaking not wrong. Maybe it is done intentionally in order to prefetch whole pawn hash entries which might be larger than 64 bytes (I did not check). It is a bit strange though that the same routine is also used for prefetching TT entries.

lucasart · Post by **lucasart** » Wed Mar 20, 2013 4:44 am

Evert wrote:
lucasart wrote: Can I assume (void *)p to be divisible by 64 ? Is this specified by the C++ standard, or compiler specific ? If the latter, is there a portable way to make sure ? 64-byte alignment is crucial here, not doing it defeats the purpose of prefetching in the first place.
You cannot assume. In practice it's often, but not always the case (in my experience, particularly with that other operating system).

The only portable way that I know of (and this is C rather than C++) is to allocate the memory, with a bit extra, then test whether the pointer is aligned properly. If it isn't, increment the pointer to the point where it is aligned (this is why you allocate extra memory).

As I said, this works with C malloc'ed memory. I don't think you can (should) do the same with memory allocated with C++'s new.

Having said that, there are probably compiler pragma's or platform specific functions that do it for you.

new/deleta can be replaced by malloc/free in this context. I'm just allocating plain old data, and there are no constructors or destructors involved in TTable::Entry elements. C++ kool-aid would have you use a default constructor for each TTable::Entry element that zeroes the data. But I find that zeroing out the memory explicitly with a memset() is not only more efficient but also more self-documenting, than relying on a C++ default constructors (being called in a loop, behind the scene as you call operator new[]).

I've found what seems to be a POSIX standard function:

Code: Select all

#include <stdlib.h>
int posix_memalign(void **memptr, size_t alignment, size_t size);

(returns an error code instead of a pointer, the pointer is passed in argument modified by the function, quite a weird behaviour, but anyway).

Is there a Windows equivalent ? If I cover Windows and POSIX, I should have all modern operating systems covered.

ZirconiumX · Post by **ZirconiumX** » Wed Mar 20, 2013 8:35 am

posix_memalign() is not even declared in some versions of Mac OS X headers, so that won't work.

I know that that is broken at least on Mac OS X Tiger. Newer versions may have it though - Julien, we need you!

http://lmgtfy.com/?q=posix_memalign+mac+os+x contains 9 bug reports about that function on the first page alone.

Matthew:out

lucasart · Post by **lucasart** » Wed Mar 20, 2013 11:32 am

ZirconiumX wrote:posix_memalign() is not even declared in some versions of Mac OS X headers, so that won't work.

I know that that is broken at least on Mac OS X Tiger. Newer versions may have it though - Julien, we need you!

http://lmgtfy.com/?q=posix_memalign+mac+os+x contains 9 bug reports about that function on the first page alone.

Matthew:out

If some versions of MacOSX are not POSIX compliant, I really don't care. I write code according to the standard, and it's Apple's job to comply with the POSIX standard. If they decide not to, then DiscoCheck won't run on MacOSX: I don't think the world will stop spinning for that

prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions

Re: prefetch questions