allowing more memory for transposition table always better?

cyberfish · Post by **cyberfish** » Wed Jan 30, 2008 9:58 am

This thought just hit me while I was tuning the transposition table for my engine - is it always better to let it use as much memory as possible?

What I have in mind is CPU-cache issue. For instance, I am now allowing my engine 1.5GB of memory for its transposition table. Which, I imagine, would involve many slow main-memory accesses due to the sparse nature of hash tables. On the other hand, I could make the transposition table just a bit smaller than the size of my CPU's L2 cache (4MB I believe). That way nearly no main-memory access would be needed. Can someone please shed some light on this issue?

Thank you

hgm · Post by **hgm** » Wed Jan 30, 2008 11:40 am

Funny you ask, as I just tried this this morning!

For micro-Max 4.8, limiting the cache size to 4MB (fitting in L2 of the Core 2 Duo) slowed it down nearly a factor 2 compared to using 128MB (10-ply search from the opening).

But uMax does have a very simple (and thus inefficient) replace algorithm (namely 'always replace'), and entirely relies on the hash table even for finding the PV. Engines specifically written to make optimal use of the hash table under huge overload conditions might do a lot better.

I did notice the following, though:

When doing a perft with hashing, hashing the counts in the frontier nodes was counter-productive. (A DRAM hash probe apparently took longer than generating and counting the moves.) But when I used a separate, small hash table (easily fitting in L2) to hash the frontier nodes, (all other nodes going to the big DRAM hash table, as usual), I could get a significant speed-up compared to not hashing them at all.

wgarvin · Post by **wgarvin** » Wed Jan 30, 2008 3:24 pm

Yes the cache misses are expensive, but redundant searching of any but the smallest of subtrees is probably more expensive.

hgm · Post by **hgm** » Wed Jan 30, 2008 4:05 pm

Indeed. For a 2-ply subtree it already payed to hash them in the main table. But probes into a small in-cache table are very fast, and that even payed off for the 1-ply subtrees (even if the moves wer not made, so just move generation). Although the hit rate was only half of what it was when I stored those nodes in the main table (20% in stead of 40%).

Edmund · Post by **Edmund** » Wed Jan 30, 2008 4:38 pm

As I am reading this I was wondering, if it was possible, to have 2 hashtables. A small one that only holds the first 4 plies or so (for more frequent access) And a larger one, for the rest of the tree.

pedrox · Post by **pedrox** » Wed Jan 30, 2008 5:00 pm

I guess it also has to do with the time of the game, is not the same play bullet and blitz that to play standard time, also, if the engine can ponder or not.

A few months ago I made a test in blitz, I think the time was 40 / 2, a version with 128 MB played better that with 256 MB, on the other hand I see that when my engine plays with 256 Mb in the CCRL 40/40 it makes well.

In the last tournament CCT I had doubts how much memory play (50 '+ 3 "), the first game I play with 256 MB and the rest decided to play with 128 MB, I am not sure if I did well.

wgarvin · Post by **wgarvin** » Wed Jan 30, 2008 11:01 pm

Codeman wrote:As I am reading this I was wondering, if it was possible, to have 2 hashtables. A small one that only holds the first 4 plies or so (for more frequent access) And a larger one, for the rest of the tree.

I think some programs already do this. The problem of course, is that the same L2 cache is used to service all of your hash tables, as well as all other memory accesses.

It may be possible to access main memory without touching the cache, or to "write through" in a way that flushes data from the L2 cache... but situations where it makes sense to do those kind of things are very rare (mostly related to streaming large linear regions of memory?). Chess programs probably should not be doing such things.

Since the tables can't realistically fit in L2, the most plausible way to speed up memory accesses is to figure out in advance the address you need to access, and prefetch it.

cyberfish · Post by **cyberfish** » Thu Jan 31, 2008 12:50 am

Code: Select all

How do you prefetch &#40;preferably in C&#41;.

That is compiler specific (non-portable).

cyberfish · Post by **cyberfish** » Thu Jan 31, 2008 1:31 am

I just experimented with my engine a bit and found this result:

a 7-ply search on
[D]r1bq1rk1/pp3pp1/n3pb1p/2p5/2BP4/P1N1PN2/1PQ2PPP/3RK2R b K - 0 11

(transposition table size) - (time in secs)
0k - 33.61
128k - 36.05
256k - 34.51
512k - 32.42

1m - 30.20
2m - 31.51
4m - 27.87
8m - 30.31
16m - 26.16
32m - 24.14
64m - 22.10
128m - 23.52
256m - 21.78
512m - 24.47
1024m - 23.58

my CPU is a 2GHz Core 2 Duo with 4MB L2 cache.
I have 2GB of main memory (DDR2-667).

The fluctuations are so big that I am reluctant to draw any conclusion. However, it seems like the best size is somewhere around 128-256MB for my engine.

wgarvin · Post by **wgarvin** » Thu Jan 31, 2008 1:39 am

cyberfish wrote:
Code: Select all
How do you prefetch &#40;preferably in C&#41;.
That is compiler specific (non-portable).

For x86 compilers, look up _mm_prefetch with Google. (Is it the same for x86-64? I'm not sure). For PowerPC, look up __dcbt. You'd want to hide this behind some common interface (a macro or inline function) and have a default implementation for other platforms that does nothing.

The basic idea, is that L1 cache is really fast, L2 is slow but still tolerable (dozens of cycles?) but main memory is horrible (hundreds?). So you tell the CPU well in advance that you will need the cache line which contains a certain address, and it will try to speculatively read it into the appropriate cache(s) for you. Then a little later when you use it for real, it will be nice and fast because its there in the L1 cache.

The problem is that L1 cache is very small, and even L2 is small when compared to main memory. Every cache line you fetch (whether its for a prefetch, or a regular read, or whatever) causes some *other* data to get discarded from the cache. If your program still needs that data, then you've just caused *another* cache miss to occur somewhere else. If your prefetching occurs way too early, then it is likely to cause extra cache misses like this. On the other hand, if you prefetch too late, then the prefetched data won't be ready in the L1 cache by the time you need it---so the CPU will still have to wait. On some types of CPU this is still better than not prefetching at all, but its not ideal.

Using these prefetch intrinsics correctly is a bit of a black art. Using them wrongly can actually hurt performance quite a bit, so make sure you know what you're doing, and make sure you measure the performance carefully before and after.

The easiest place to use them is if you had a loop that processes a lot of data in a linear fashion -- iterating over an array, or a pair of arrays, and doing something to them. Then you can prefetch 6 or 8 or 12 or however many elements "ahead" of your current position, so that 6 or 8 or 12 loop iterations later, the data is ready to actually be used.

I don't know whether they can really help for transposition table lookups---you probably don't know the hashkey early enough to make much difference?

allowing more memory for transposition table always better?

allowing more memory for transposition table always better?

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett

Re: allowing more memory for transposition table always bett