cache alignment of tt

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 5:49 am

I was actually looking at the post you made sometime back comparing tt probes with and without alignment.
I was comparing nps of a 18 ply search since I didn't have a profiler but now I did QueryPerformanceCounter() and got a tiny tiny speed up BUT it still is slower nps wise when I take out the timing routine which is actually what I saw before.

Query performance of hash_probe 3 runs each

Code: Select all

No alignment&#58;

microsecond/probe         NPS
209989                         999109
210978                         998072
210551                        1000685

Alignment

206678                        1002805
206402                        1001222
205490                        1003311

That gives an improvement of 0.2% if you average the three runs but without the timing routing

Code: Select all

Without alignment
1275611
1275611

With alignment
1262112
1263824

Now the one without alignment is faster. And the behavior is pretty consistent. It is also better to compare overall behavior but I really don't have any idea what is changing when I force the alignment.

Anyway it is not a very good optimization.

cheers

bob · Post by **bob** » Mon Mar 12, 2012 6:21 am

Daniel Shawul wrote:Do you use aligned malloc for your hash tables (main,pawn and eval) to put the starting address at 64 byte boundaries. Even though I had a bucket with 4 slots each 16 byte in size, I forgot to actually do _aligned_malloc() !? weird. But after I make the change it actually got slower nps wise on windows. On linux systems profiling shows there is some gain for the main tt but not for the eval cache (16 bytes entries). I don't use buckets for the eval cache so I guess there is not much to gain from there...
Also does pre-fetching gain anything or is it another optimization myths ?
Why I am asking this is these things are crucial for gpu computation even though I never bothered to implement them for scorpio.

I've always done it, but I did not use aligned_malloc() since it is not portable. I did my own. The gain is a very small one. And the gain comes in two ways.

(1) a single entry won't split two cache lines, which would not be very efficient.

(2) Assuming you use buckets, the first entry in the bucket has latency, but the other three (assuming 4 TT entries per bucket/cache-line) will be free.

Pre-fetching is an unclear issue. If you use the prefetch opcode to load a hash bucket prior to it being used, you hide part of the fetch latency. But usually a very small part because you don't know where the entry is until just before you need it. And the down-side is that when you bring something in, you have to kick something out. Perhaps right before you use it...

rbarreira · Post by **rbarreira** » Mon Mar 12, 2012 9:25 am

bob wrote: Pre-fetching is an unclear issue.

From what I've heard, it's quite clear that prefetch speeds up many engines. I guess Crafty might not benefit much from it since it doesn't access the tt in Qsearch?

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 1:49 pm

I guess it depends also on whether you probe in qsearch as Ricardo pointed out. I forgot about that thinking I have an 85% tt miss on all nodes when it actually is on small subset. I see what you mean with not having enough time after prefecthing. There is really not much code for me after my do_move and a tt probe. I have static pruning testing, clearing node and testing for draws (probably biggest consumer). I maybe able to add legality testing in there too.
Multiple probes for the eval cache and using a smaller pawn tt that could fit in L3 is something I need to test. Eval cache takes more time than the main cache in my case but I didn't use buckets.

bob · Post by **bob** » Mon Mar 12, 2012 4:54 pm

rbarreira wrote:
bob wrote: Pre-fetching is an unclear issue.
From what I've heard, it's quite clear that prefetch speeds up many engines. I guess Crafty might not benefit much from it since it doesn't access the tt in Qsearch?

Hard to say. I'd expect prefetch to work in the large majority of the cases, unless it is a speculative pre-fetch. For example, if you do a repetition test before a hash probe, there is a risk in pre-fetching the tt entry because you might not need it. And in the right kinds of positions, it will certainly hurt. Probably not when averaged over many games of course...

However, most prefetch at a sub-optimal place, such as at the top of search. Ideally you would prefetch right after making a move, where you just updated the hash signature. And you could perhaps improve that with a hash signature update done outside of make move, where the hash signature update is responsible for the prefetch, as early as is possible... But now you have changed the actual structure of the program and I have not noticed anyone going that far, although it is possible.

I'll try to run some tests to see if it helps today...

Cardoso · Post by **Cardoso** » Tue Mar 13, 2012 1:33 am

Hi Bob, sorry, to go back to the AlignedMalloc.
As you might remember I posted here a question about your AlignedMalloc not working on Windows 7, x64 using the MS visual C++ 2010.
I made a compile of crafty 23.4 and couldn't allocante more than 1Gb of hash.
I wonder if the bug is on the "(long)" casting in your code.
Shouldn't in x64 Windows be an _int64?

best regards,
Alvaro

bob · Post by **bob** » Thu Mar 15, 2012 3:15 pm

Cardoso wrote:Hi Bob, sorry, to go back to the AlignedMalloc.
As you might remember I posted here a question about your AlignedMalloc not working on Windows 7, x64 using the MS visual C++ 2010.
I made a compile of crafty 23.4 and couldn't allocante more than 1Gb of hash.
I wonder if the bug is on the "(long)" casting in your code.
Shouldn't in x64 Windows be an _int64?

best regards,
Alvaro

I believe you are correct. I think that "long" on 64 bit windows is STILL 32 bits for reasons I can't fathom...

I will fix this in 23.5 since I use the int64_t type anyway... Will break things for 32 bit machines it would appear, but those machines are pretty much going away anyway...

Rémi Coulom · Post by **Rémi Coulom** » Thu Mar 15, 2012 5:42 pm

bob wrote:
Cardoso wrote:Hi Bob, sorry, to go back to the AlignedMalloc.
As you might remember I posted here a question about your AlignedMalloc not working on Windows 7, x64 using the MS visual C++ 2010.
I made a compile of crafty 23.4 and couldn't allocante more than 1Gb of hash.
I wonder if the bug is on the "(long)" casting in your code.
Shouldn't in x64 Windows be an _int64?

best regards,
Alvaro
I believe you are correct. I think that "long" on 64 bit windows is STILL 32 bits for reasons I can't fathom...

I will fix this in 23.5 since I use the int64_t type anyway... Will break things for 32 bit machines it would appear, but those machines are pretty much going away anyway...

I am not sure exactly what this discussion is about, but it seems to me that using size_t would be the proper way.

Rémi

bob · Post by **bob** » Thu Mar 15, 2012 6:03 pm

Rémi Coulom wrote:
bob wrote:
Cardoso wrote:Hi Bob, sorry, to go back to the AlignedMalloc.
As you might remember I posted here a question about your AlignedMalloc not working on Windows 7, x64 using the MS visual C++ 2010.
I made a compile of crafty 23.4 and couldn't allocante more than 1Gb of hash.
I wonder if the bug is on the "(long)" casting in your code.
Shouldn't in x64 Windows be an _int64?

best regards,
Alvaro
I believe you are correct. I think that "long" on 64 bit windows is STILL 32 bits for reasons I can't fathom...

I will fix this in 23.5 since I use the int64_t type anyway... Will break things for 32 bit machines it would appear, but those machines are pretty much going away anyway...
I am not sure exactly what this discussion is about, but it seems to me that using size_t would be the proper way.

Rémi

I took a quick look at stdint.h, and it looks like "uintptr_t" is the correct thing to portably declare a pointer. I saw examples of 32 bit processors with > 32 bit address spaces, and even a 64 bit processor with a < 32 bit address space (Cray 1 series through the T90 in fact).

size_t isn't guaranteed to be as big as a pointer, just big enough to hold the largest sizeof() value that can be returned... There is also a ptrdiff.t that might work as well as it is guaranteed to be able to hold the difference between any two pointers, but it is signed and might cause an issue...

I tried uintptr_t and it worked on my linux box...

Actually after looking at this, I am not sure what I am currently doing is actually correct for all platforms. Here's my "AlignedMalloc()" function:

void AlignedMalloc(void **pointer, int alignment, size_t size) {
segments[nsegments][0] = malloc(size + alignment - 1);
segments[nsegments][1] =
(void *) (((uintptr_t) segments[nsegments][0] + alignment -
1) & ~(alignment - 1));
*pointer = segments[nsegments][1];
nsegments++;
}

You pass it 3 values. A pointer to a pointer where it should store the address of the malloc'ed memory, an alignment value (64 is currently used, and a size (how much memory to malloc).

I am not sure size_t is correct for the third argument after reading a bit. It almost seems that this should be uintptr_t as well, or at least ptrdiff_t. But a negative value makes no sense...

Will have to do some research on this (again)... but for the moment, the above works on a 12 gig 64 bit processor and I tested with hash=8192M with no problems...

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 6:08 pm

Yes uintptr_t is what I used too.

cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt

Re: cache alignment of tt