New Database Storage Method

Discussion of chess software programming and technical issues.

Moderator: Ras

Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: New Database Storage Method

Post by Edmund »

Thanks for the clarification.

As always I really appreciate reading your comments. In this case I very much like your idea of indexing by hash-key.

You could trade off the complexity of the overflowtable by some additional computation.
You use the 4byte per entry like this: the upper flag informs whether > 1 games pass through this position. The rest is an index to one particular game passing through this position.
You could then quickly build a list of matching games by doing a treesearch collecting game indexes as long as the flag is set. Each game needs to be verified however.
User avatar
hgm
Posts: 28443
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New Database Storage Method

Post by hgm »

Indeed, on a hit you would have to verify if the position actually occurs in the game, because there is a 50% chance that you will hit something, while in fact it is likely that you are searching a position that is not in the tablebase at all. If every user query would have to search a big tree of possibly similar positions, this could become a problem.

The amount of workload going into this could be reduced by storing a few bits of signature. If the number of games is significantly smaller than 2^31, any bits not needed for the game number could be used for that. Both when the game number is directly stored in the entry, as in overflow cases (where they would be stored in the overflow area). What I did in WinBoard in cases of overflow was to allocate a cell of three 4-byte words, which could hold either three game numbers, or two game numbers plus a pointer to another cell, again the sign bit indicating which was the case.

Of course the 2^31 is a quite arbitrarily chosen number; due to machine word-length there will always be database sizes for which it works more efficiently as with others. Databases encountered in practice are quite likely to have 10-100M games, which does naturally leave some 4-6 spare bits if you want to byte-align the entries.
User avatar
jshriver
Posts: 1371
Joined: Wed Mar 08, 2006 9:41 pm
Location: Morgantown, WV, USA

Re: New Database Storage Method

Post by jshriver »

I would be interested in working on this. Maybe the one thing coding wise I could contribute to the community.

Have been developing a subset of your wants as part of my golden tree project.
User avatar
hgm
Posts: 28443
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New Database Storage Method

Post by hgm »

I was thinking that if you are really designing this for being disk-based rather than RAM based, it would be advantageous to make way-larger buckets, adapted to the system cluster size. The disadvantage of large buckets is that you have to count bits in the hit/miss bitmap to know where the actual data is stored in the bucket, in case of a hit. But in case of disk access even counting the set bits in a bitmap of 1000 bits is negligible work compared to reading the data. (Especially if you can use POPCNT.) Another disavantage is that you have to sort new entries in between, and thus move the tail of the bucket to create space, but that applies only during building, which is not the main concern.

The advantage of large buckets is that the standard deviation in their filling fraction is smaller, so that it is easier to create a 2-sigma or even 5-sigma above-average capacity without reserving too much extra size.

E.g. if your count on disk clusters of 4KB, (1024 entries with 4-byte game numbers), you could aim for 841 (= 29^2) filled entries, and the SD of the number of entries mapping in them would be 29. You would need a bitmap for 2x841 bits (half of them set, on average), which would take 52 32-bit words. That would leave on average 131 unoccupied entries in the bucket, about 4.5 sigma, so that the chances to overflow are very small.

A way to handle overflow could be to just wrap around inside the bucket. E.g. when we would have 24-entry buckets (RAM-based design again), and a 32-bit hit/miss bitmap, and there happen to map 26 positions to the bucket, the 25th set bit could correspond to the first entry again. The linked list for this entry will then contain both positions with indexes corresponding to the first and the 25th set bit of the hit/miss bitmap. This would be a problem during building, if another entry in between gets populated, because then the positions corresponding to the key of the 25th set bit, which now becomes the 26th set bit, should be moved from entry 0 to entry 1, but those corresponding to the first bit should stay.

But a linked-list format with 3 entries per cell can easily accommodate that. The assumption is that a single bit in each entry indicates if the entry contains a game number or a pointer to another cell, and a single code (presumably 0) is reserved for indicating an empty entry (implying list end). If there are entries from two different keys in the list, the entry for the key that mapped there through wrapping around can by convention be stored in the third element of the cell. So three game numbers in the primary cell would all be taken as belonging to the principal key for that list (and imply no wrapping occurred). A cell (number, number, empty) would imply a list of two for the primary key and no wrapping, while (number, empty, number) would imply one item for the primary and one for the wrapped. The wrapped key would never get more than a single entry in the first cell; if it has more than one game, the third element of the first cell would be a pointer. Like (number, empty, pointer) if the list for the primary key had only a single element, and (number, pointer, pointer) if the it had more. Finally (number, pointer, number) would be used for a primary list of more than one, and a single wrapped entry.

The retrieval would still be very simple: you just loop through the entries in the cells, until you reach cell end or empty (which both terminates the list), but if you encounter a number, you continue at the start of the cell it points to. The only difference between a primary hit and a wrapped hit is that in the latter case the entry in the bucket will always contain a pointer, and you have to start the list-walk at the 3rd element of the cell it points to, rather than in the first.