tablebase caching / mmap() / page cache

syzygy · Post by **syzygy** » Wed Oct 16, 2013 1:31 am

Rein Halbersma wrote:
syzygy wrote:I might have a look into this. At some point I was considering adding a run-length encoding option, but my first experiments weren't very promising.

A complication is that my WDL tables can have up to 5 values (because I distinguish between real wins/losses and wins/losses that are drawn by the 50-move rule). But of course not all tables are affected by the 50-move rule.
The Chinook project also pioneered partial information databases (at least a draw, at most a draw), that also have this type of values. There is a paper on their website about this. IIRC, compression suffers a bit, but it's the same principle.

Not really the same principle, but from the compression point of view it might indeed be the same.

Maybe I should read your code in more detail, but can you explain how you use Huffman coding to get a fixed-length 64 byte code block? I thought that Huffman is a form of fixed-to-variable encoding (each fixed-sized input character to a variable-sized bitstring)?

I just encode symbols into Huffman codes until the 64-byte block runs out of space.

I do use some additional tweaks. For example, the symbol that does not fit usually corresponds to the concatenation of a pair of other symbols. If the Huffman code for the first symbol of that pair does fit, and the Huffman code for the second symbol of that pair is shorter than the Huffman code of the original symbol, then I add the (Huffman code of the) first symbol of the pair to the block that I am filling and start the next block with the (Huffman code of the) second symbol. Another tweak is checking whether the symbol that does not fit anymore is the first symbol of a pair whose symbol does fit (which sort of adds random data to the end of the block, but that does not hurt). These tweaks only marginally improve compression, but the improvement is for free so I take it.

But you're right that this does give some truncation overhead. I think it's something like 5 bits per block on average or so.

syzygy · Post by **syzygy** » Wed Oct 16, 2013 2:05 am

For example KQRvKQ. The WDL tablebase file contains two tables: wtm and btm.
For wtm the generator outputs:

Code: Select all

tb_size = 114675120
(...)
compressing data for wtm / wdl
calc_size: 843828
bits = 6; blocks = 13309 (1); size = 881024
real_num_blocks = 13284; num_blocks = 13285
idxbits = 18
num_indices = 438

This means the raw compressed size is 843828 bytes. With 64 bytes per block (bits = 6), a first quick run (without tweaks) counts 13309 blocks. This means on average 843828/13309 = 63.40 effective bytes per 64-byte block.

A simple calculation shows that it is not necessary to try with 32-byte blocks with its extra indexing overhead (it would be different if the average was say 50 bytes, which would mean that the limit of 65536 positions per block is too limiting).

The compressor therefore settles on 64-byte blocks. It does another run, this time with tweaks. The slightly more efficient packing of symbols into blocks now leads to 13284 blocks, i.e. on average 843828/13284 = 63.52 effective bytes per 64-byte block.

The total number of positions (with the indexing scheme used) is 114675120, so on average 114675120/13284 = 8632.6 positions per 64-byte block.

idxbits = 18 means that there are index entries for positions 2^17 + k * 2^18 for k >= 0. Each index entry contains the block number for the position (4 bytes) and the location of the position's value within this block (2 bytes). Given a position to probe, we round to the nearest index entry and then add or subtract block sizes to find the right block to probe and the location within that block.

Each position is within 2^17 positions from a position with an index entry, i.e. the average distance from a position with an index entry is 2^16 = 65536, which means that the average number of additions or subtractions is 65536/8632.6 = 7.6.

The "(1)" and real_num_blocks vs num_blocks have to do with adding extra dummy block sizes in order for the last main index entry to work correctly. We can safely ignore that now.

Total indexing overhead = 2 * 13285 + 6 * 438 = 29198. Total table size (excluding the dictionary of at most 3*4095 bytes) = 29198 + 64 * 13284 = 879374 bytes, so 4.2% overhead.

Cardoso · Post by **Cardoso** » Wed Oct 16, 2013 5:01 pm

This is an eye opener post, many thanks for sharing.
I don't have experience with this technique. Do you think the Windows counterpart has identical performance as mmap from Linux?
Another question do you think it is feasable and superior using memory mapped files in EGTB generators?

best regards,
Alvaro

syzygy · Post by **syzygy** » Thu Oct 17, 2013 1:38 am

Cardoso wrote:I don't have experience with this technique. Do you think the Windows counterpart has identical performance as mmap from Linux?

They probably have similar advantages in similar situations.

Another question do you think it is feasable and superior using memory mapped files in EGTB generators?

If you mean generating a tablebase that does not fit in RAM, then the first three options that come to mind are:
1. Allocating an amount of virtual memory large than physical memory backed by the swap file or swap partition. When the generation has finished, the memory area is written to disk.
2. Creating a file of the right size on disk, mapping the file to memory, and working on the mapped memory area. This is similar to (1) but uses a regular disk file to back the area of virtual memory instead of the swap file or the swap partition, and when the generation has finished there is no need to explicitly write the area to disk.
3. Creating a file of the right size on disk and "manually" reading parts of the file into memory and writing parts from memory back to the file.

Options (1) and (2) are similar. It might be that using a swap partition as backing store is more efficient than using a regular disk file and that this outweighs the advantage of not having to write the generated table back to disk. Both options rely on the OS to handle most of the work.

Option (3) requires more work from the programmer, but also gives more control. When generating a tablebase of a size much greater than available RAM, it seems necessary to carefully plan the order in which various entries are accessed and which parts of the tablebase are preloaded so as to minimise disk I/O. Option (3) seems most suitable for this. But an algorithm that is designed to have good locality of access might do well with (1) and (2) as well.

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 17, 2013 9:37 am

syzygy wrote:
Option (3) requires more work from the programmer, but also gives more control. When generating a tablebase of a size much greater than available RAM, it seems necessary to carefully plan the order in which various entries are accessed and which parts of the tablebase are preloaded so as to minimise disk I/O. Option (3) seems most suitable for this. But an algorithm that is designed to have good locality of access might do well with (1) and (2) as well.

Regarding programmer control in probing during the search: with a manual cache you can do conditional lookups. I.e., if a position is present in the cache, you take its value, but if it's not, you can decided then and there (based on search depth, static eval and whatnot) whether or not to go to disk. With mmap(), every probe is done unconditionally. I'm not saying this flexibility should be bought at every price (clearly a manual cache is a lot of code to maintain), but at least in draughts it has been reported to matter. If mmap() would offer an is_in_physical_ram() functionality, that would be awesome, but I don't think that's in the cards.

syzygy · Post by **syzygy** » Thu Oct 17, 2013 10:33 am

Rein Halbersma wrote:Regarding programmer control in probing during the search: with a manual cache you can do conditional lookups. I.e., if a position is present in the cache, you take its value, but if it's not, you can decided then and there (based on search depth, static eval and whatnot) whether or not to go to disk.

True, I also mentioned this in the opening post.

With mmap(), every probe is done unconditionally. I'm not saying this flexibility should be bought at every price (clearly a manual cache is a lot of code to maintain), but at least in draughts it has been reported to matter. If mmap() would offer an is_in_physical_ram() functionality, that would be awesome, but I don't think that's in the cards.

I think it should somehow be possible by installing a page fault handler, but the complexity and overhead might well be prohibitive.

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 17, 2013 10:44 am

syzygy wrote:
Rein Halbersma wrote:Regarding programmer control in probing during the search: with a manual cache you can do conditional lookups. I.e., if a position is present in the cache, you take its value, but if it's not, you can decided then and there (based on search depth, static eval and whatnot) whether or not to go to disk.
True, I also mentioned this in the opening post.

Ah now I see that you meant Gaviota's soft probes with that, I hadn't made the connection upon first reading.

Cardoso · Post by **Cardoso** » Thu Oct 17, 2013 11:59 pm

Rein Halbersma wrote:
syzygy wrote:
Rein Halbersma wrote:Regarding programmer control in probing during the search: with a manual cache you can do conditional lookups. I.e., if a position is present in the cache, you take its value, but if it's not, you can decided then and there (based on search depth, static eval and whatnot) whether or not to go to disk.
True, I also mentioned this in the opening post.

Ah now I see that you meant Gaviota's soft probes with that, I hadn't made the connection upon first reading.

So with mmap() and since we can't know in advance if the data is in cache, we can only use the conditions depth, score, static eval...
Let's hope OS designers improve this, but we have to tell them otherwise they might never know this necessity.
Maybe Nalimov can pass the word inside MS (sorry I don't program in Linux).

best regards,
Alvaro

syzygy · Post by **syzygy** » Fri Oct 18, 2013 11:31 pm

Cardoso wrote:So with mmap() and since we can't know in advance if the data is in cache, we can only use the conditions depth, score, static eval...

Actually, we can. I just discovered that Linux has mincore():

Code: Select all

       mincore()  returns a vector that indicates whether pages of the calling
       process's virtual memory are resident in core (RAM), and  so  will  not
       cause  a  disk  access  (page fault) if referenced.  The kernel returns
       residency information about the pages starting at the address addr, and
       continuing for length bytes.

       The  addr  argument  must  be  a multiple of the system page size.  The
       length argument need not be a multiple of the page size, but since res‐
       idency  information  is returned for whole pages, length is effectively
       rounded up to the next multiple of the page size.  One may  obtain  the
       page size (PAGE_SIZE) using sysconf(_SC_PAGESIZE).

       The   vec   argument  must  point  to  an  array  containing  at  least
       (length+PAGE_SIZE-1) / PAGE_SIZE bytes.  On return, the least  signifi‐
       cant  bit  of  each  byte will be set if the corresponding page is cur‐
       rently resident in memory, and be clear otherwise.   (The  settings  of
       the  other bits in each byte are undefined; these bits are reserved for
       possible later use.)  Of course the information returned in vec is only
       a  snapshot: pages that are not locked in memory can come and go at any
       moment, and the contents of vec may already be stale by the  time  this
       call returns.

Windows has VirtualQuery().

These functions are undoubtedly too expensive to call on every probe, but maybe some balance can be found.

What would be ideal is an x86-64 load instruction that would not result in a page fault if the required memory page is not in RAM, but that would set an error flag. I don't think such an instruction exists in the x86-64 instruction set, but I see no reason why it could not be added. So we have to ask Intel and AMD.

Cardoso · Post by **Cardoso** » Fri Nov 08, 2013 12:45 pm

Just an additional question:
Since there can be many TB files, when you load (mmap) a subdatabase you keep it mmaped until the engine is shut down or until a new game is started or do you use the LRU principle to unmmap them and free up memory, during normal engine play?

Also when mmaping a subdatabase the size of the mmaped memory cache is the same for all subdatabases of is it scaled accordingly the subdatabase size?

best regards,
Alvaro

tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache