cache alignment of tt

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 11, 2012 11:22 pm

Do you use aligned malloc for your hash tables (main,pawn and eval) to put the starting address at 64 byte boundaries. Even though I had a bucket with 4 slots each 16 byte in size, I forgot to actually do _aligned_malloc() !? weird. But after I make the change it actually got slower nps wise on windows. On linux systems profiling shows there is some gain for the main tt but not for the eval cache (16 bytes entries). I don't use buckets for the eval cache so I guess there is not much to gain from there...
Also does pre-fetching gain anything or is it another optimization myths ?
Why I am asking this is these things are crucial for gpu computation even though I never bothered to implement them for scorpio.

hgm · Post by **hgm** » Sun Mar 11, 2012 11:39 pm

Are you sure the alignment works on Windows? Whether your appliation runs faster is a CPU matter, and should not be affected by what OS you use.

I guess even a non-aligned malloc would be aligned on a 16-byte boundary, so that you dannot see an effect on the eval cache would be logical.

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 12:06 am

Posted: Sun Mar 11, 2012 10:39 pm Post subject: Re: cache alignment of tt
Are you sure the alignment works on Windows? Whether your appliation runs faster is a CPU matter, and should not be affected by what OS you use.

I guess even a non-aligned malloc would be aligned on a 16-byte boundary, so that you dannot see an effect on the eval cache would be logical.

Yes it works. The alignment returned by malloc/new gurantees aligned access for the largest member variables (hashkey 8 byte), so I get multiples of 8. When I do alignement the remainder is 0 always. Also declaring some big tables like zobrist keys array and history table as cache aligned makes things slower on windows. I do not have a profiler there so I can not measure performance accurately.

Code: Select all


template<typename T>
void aligned_reserve&#40;T*& mem,const size_t& size&#41; &#123;
	if&#40;&#40;sizeof&#40;T&#41; & &#40;sizeof&#40;T&#41; - 1&#41;) == 0&#41; &#123;
#ifdef _MSC_VER
		mem = &#40;T*)_aligned_recalloc&#40;mem,size,sizeof&#40;T&#41;,CACHE_LINE_SIZE&#41;;
#else
		if&#40;mem&#41; free&#40;mem&#41;;
		posix_memalign&#40;&#40;void**)&mem,CACHE_LINE_SIZE,size * sizeof&#40;T&#41;);
		memset&#40;mem,0,size * sizeof&#40;T&#41;);
#endif
	&#125; else &#123;
		if&#40;mem&#41; free&#40;mem&#41;;
		mem = &#40;T*) calloc&#40;size,sizeof&#40;T&#41;);
	&#125;
	print&#40;"address %d remainder %d\n",&#40;uintptr_t&#41;mem,&#40;uintptr_t&#41;mem % 64&#41;;
&#125;

After alignment for main & eval cache. Pawn cache has a size of 24 so I do not align it even though doing so should not hurt I think.

Code: Select all

feature done=0
address 40697920 remainder 0
address 74317888 remainder 0
address 82772000 remainder 32
ht 1048576 X 64 = 64MB
eht 524288 X 16 = 8MB
pht 349525 X 24 = 7MB

jdart · Post by **jdart** » Mon Mar 12, 2012 1:43 am

I do align hash tables and other large structures.

I think it helps but you should not expect a large gain. Most times your threads are not accessing overlapping cache lines in the hash table, anyway.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 3:53 am

@Jon
Thanks. It probably is not worth it to cache align arrays. But I expected aligning the main hash table to be a significant improvement. I think the main rationale behaind aligning the tt is not to reduce cache contention but to get more probes at a cost of 1 read (f.i 3 free probes in a 4 slots bucket). If it is not cache aligned two or more reads are needed. Most of the time we have a miss on tt probes at a node (say 85%) so I expected some speed up there.

@Hgm
I missed some probably crucial info. My unix systems is 64 bits so it does 16 byte alignment. The 32 bit windows malloc does 8-byte alignment.

micron · Post by **micron** » Mon Mar 12, 2012 5:10 am

You don't need a profiler. The best and simplest method is time-to-depth, obtained with a few lines of test-harness code in main(). Reproducibility is greatly enhanced by taking the fastest of 3--5 searches.

The combination of alignment and prefetching gives my engine a speedup of about 1%.

Code: Select all

__builtin_prefetch&#40; &sTT&#91;hashCode & sTTIndexMask&#93; );

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 5:49 am

I was actually looking at the post you made sometime back comparing tt probes with and without alignment.
I was comparing nps of a 18 ply search since I didn't have a profiler but now I did QueryPerformanceCounter() and got a tiny tiny speed up BUT it still is slower nps wise when I take out the timing routine which is actually what I saw before.

Query performance of hash_probe 3 runs each

Code: Select all

No alignment&#58;

microsecond/probe         NPS
209989                         999109
210978                         998072
210551                        1000685

Alignment

206678                        1002805
206402                        1001222
205490                        1003311

That gives an improvement of 0.2% if you average the three runs but without the timing routing

Code: Select all

Without alignment
1275611
1275611

With alignment
1262112
1263824

Now the one without alignment is faster. And the behavior is pretty consistent. It is also better to compare overall behavior but I really don't have any idea what is changing when I force the alignment.

Anyway it is not a very good optimization.

cheers

bob · Post by **bob** » Mon Mar 12, 2012 6:21 am

Daniel Shawul wrote:Do you use aligned malloc for your hash tables (main,pawn and eval) to put the starting address at 64 byte boundaries. Even though I had a bucket with 4 slots each 16 byte in size, I forgot to actually do _aligned_malloc() !? weird. But after I make the change it actually got slower nps wise on windows. On linux systems profiling shows there is some gain for the main tt but not for the eval cache (16 bytes entries). I don't use buckets for the eval cache so I guess there is not much to gain from there...
Also does pre-fetching gain anything or is it another optimization myths ?
Why I am asking this is these things are crucial for gpu computation even though I never bothered to implement them for scorpio.

I've always done it, but I did not use aligned_malloc() since it is not portable. I did my own. The gain is a very small one. And the gain comes in two ways.

(1) a single entry won't split two cache lines, which would not be very efficient.

(2) Assuming you use buckets, the first entry in the bucket has latency, but the other three (assuming 4 TT entries per bucket/cache-line) will be free.

Pre-fetching is an unclear issue. If you use the prefetch opcode to load a hash bucket prior to it being used, you hide part of the fetch latency. But usually a very small part because you don't know where the entry is until just before you need it. And the down-side is that when you bring something in, you have to kick something out. Perhaps right before you use it...

rbarreira · Post by **rbarreira** » Mon Mar 12, 2012 9:25 am

bob wrote: Pre-fetching is an unclear issue.

From what I've heard, it's quite clear that prefetch speeds up many engines. I guess Crafty might not benefit much from it since it doesn't access the tt in Qsearch?

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 1:49 pm

I guess it depends also on whether you probe in qsearch as Ricardo pointed out. I forgot about that thinking I have an 85% tt miss on all nodes when it actually is on small subset. I see what you mean with not having enough time after prefecthing. There is really not much code for me after my do_move and a tt probe. I have static pruning testing, clearing node and testing for draws (probably biggest consumer). I maybe able to add legality testing in there too.
Multiple probes for the eval cache and using a smaller pawn tt that could fit in L3 is something I need to test. Eval cache takes more time than the main cache in my case but I didn't use buckets.