New Database Storage Method

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

New Database Storage Method

Post by Edmund »

I want to propose the following theoretical model for a new chess database format.

I would appreciate your thoughts and suggestions for optimization.

The criterions I had in mind when designing this format:
  • Fast import/export of a large number of games
  • Instant lookup up certain positions
  • Compactness, compared to other formats that feature instant position lookup
The format has the following merit:
  • Slower than other formats in exporting individual games
  • Slower than other formats in (fully) deleting individual games
  • Slower than other formats in answering questions like: “How often does Player x castle queenside?”
The format uses three main tables:
  • Game Metadata
  • Position-Table
  • Endgame-Table
Game Metadata
One line per game; each line holds the data from the pgn tags (players, result, elo, etc.).
Additionally it stores an Opening ID. This ID references into a general table with the most common opening lines.
It stores the number of moves in the game.
It stores the position of where the game converted into a 5-men ending or the final position if the game terminated earlier.
It stores a game-hash. The game hash is generated by xoring every move ror(hash(move), ply)

Position-Table
The position table is a collection of zobrist keys of positions that have been reached by any game in the database between leaving the opening table and entering the 5-men ending.

Endgame-Table
It holds a collection of move-lines of games that go beyond a 5 men-ending. It references into the Game Metadata table.

Additional tables might be needed for variation-pointers and comments.

Use-case 1: Find all games that feature a certain position:
  1. Check if the position is present in the Position-Table
  2. Check if an entry from the Game Metadata table points to the position
  3. Generate all moves from the current position
  4. For each move continue at 1 (guard against endless loops)
Use-case 2: Load an individual game
  1. Load the ending from the Endgame-Table
  2. Start with the final position
  3. Check if the position is present in the Position Table
  4. Check if the position equals the opening position
  5. Check if the game hash is correct
  6. Generate all retro-moves from the current position
  7. For each move continue at 3 (guard against endless loops)
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New Database Storage Method

Post by hgm »

What size hask key did you have in mind? I am a bit skeptical as to whether this could ever be compact, and it also seems to stretch the meaning of 'instant'. It seems you have to do a sequential search through the game metadata multiple times to identify the game.

Wouldn't it be better to just store he games in a 1-byte/move format, and then make a table hashed by position that contains the list of game numbers in which the game occurs?
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: New Database Storage Method

Post by Edmund »

64 bit hash keys should do for the position table. For the game hash probably longer.

The game metadata would be sorted by end position.

Indeed I am adding some computational overhead at the benefit of reducing storage requirements.
Storing the 1-byte per move sequence takes up much more space. Additionally adding pointers to the position table also increases storage requirements considerably (these pointers won't compress very well). The 64 bit hash values in an ordered list would compress significantly better.

Having said that I would consider it much more important to quickly determine whether a position is present in the database at all, as to instantly retrieve all the game lines.

I have the following use case in mind:
You have a certain position on the board and want to find all games in the database with minor changes in the piece configuration.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New Database Storage Method

Post by hgm »

Edmund wrote:Storing the 1-byte per move sequence takes up much more space.
Uhh? Your hask keys are 8 byte, an 8 > 1.
Additionally adding pointers to the position table also increases storage requirements considerably (these pointers won't compress very well).
Why would you add such pointers if the games already have the moves in 1-byte format?
The 64 bit hash values in an ordered list would compress significantly better.
But not storing them at all would take less space than even the best compression...
I have the following use case in mind:
You have a certain position on the board and want to find all games in the database with minor changes in the piece configuration.
That is a bit fuzzy. What do you consider 'minor change'? One piece in a different location? Two pieces? A Pawn on d6 in stead of e6? A Pawn missing?
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: New Database Storage Method

Post by Edmund »

hgm wrote:
Edmund wrote:Storing the 1-byte per move sequence takes up much more space.
Uhh? Your hask keys are 8 byte, an 8 > 1.
Additionally adding pointers to the position table also increases storage requirements considerably (these pointers won't compress very well).
Why would you add such pointers if the games already have the moves in 1-byte format?
The 64 bit hash values in an ordered list would compress significantly better.
But not storing them at all would take less space than even the best compression...
Lets take an average game; 40 moves (lets disregard for the moment that I suggest to treat opening and ending moves differently)

I suggest to store a hash-key for the whole game (16 byte) and for each position an entry in the position-table (40 * 8 byte). A total of 336 bytes.

Now there is the alternative option to only store the movelist (40 * 1 byte). For a total of 40 bytes.

And the other option which allows position lookup you suggested "Wouldn't it be better to just store he games in a 1-byte/move format, and then make a table hashed by position that contains the list of game numbers in which the game occurs?" You have the movelist (again 40 bytes) and the position table where each entry consists of the position hash and the index of the game. Lets say there is a maximum of 2^32 games in the database. The size of the position table would be (40 * (8 + 4) bytes). In total this system requires 520 bytes.
hgm wrote:
I have the following use case in mind:
You have a certain position on the board and want to find all games in the database with minor changes in the piece configuration.
That is a bit fuzzy. What do you consider 'minor change'? One piece in a different location? Two pieces? A Pawn on d6 in stead of e6? A Pawn missing?
I define this as a position that can be reached from the other position within n consecutive alterations. An alteration can be:
- any move
- any nullmove
- any non uncapture retro-move
- removal of pawns and pieces that get not moved for m consecutive moves
- adding of pawns that are backward, not passed, not threatening any opponent pieces and not adjacent to the own king
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New Database Storage Method

Post by hgm »

Edmund wrote: where each entry consists of the position hash and the index of the game. Lets say there is a maximum of 2^32 games in the database.
Well, I would not store the hash key. Just 4 bytes for the game number. I was designing for a maximum of 10^31 games. Because that is only half the size of the hash-keys, I could make it a hash table with 50% load factor and still use the same space as a sorted list of hash keys. Only 9% of the entries would suffer collisions, and I would put those in an overflow table.

This is how I first did it in WinBoard (before I abandoned the idea of searching by hash keys, as it would not work in most search modes). It was dimensioned a bit smaller, of course, as I intended it to be contained entirely in RAM. (So something like 8M games and 500M positions.)

I might still use that scheme if I am going to equip WinBoard with a book-building function.
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: New Database Storage Method

Post by Dave_N »

I think I'm planning to store the range of sorted list of games that pass through the position in the hash node. And pointers. I think it wouldn't fit entirely in RAM.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: New Database Storage Method

Post by Edmund »

hgm wrote:
Edmund wrote: where each entry consists of the position hash and the index of the game. Lets say there is a maximum of 2^32 games in the database.
Well, I would not store the hash key. Just 4 bytes for the game number. I was designing for a maximum of 10^31 games. Because that is only half the size of the hash-keys, I could make it a hash table with 50% load factor and still use the same space as a sorted list of hash keys. Only 9% of the entries would suffer collisions, and I would put those in an overflow table.

This is how I first did it in WinBoard (before I abandoned the idea of searching by hash keys, as it would not work in most search modes). It was dimensioned a bit smaller, of course, as I intended it to be contained entirely in RAM. (So something like 8M games and 500M positions.)

I might still use that scheme if I am going to equip WinBoard with a book-building function.
I don't understand how you get to 10^31 games.
I don't understand that you store a "sorted list of hash keys", when you first say "I would not store the hash key".

Do I understand correctly, that your hash keys are only 4 bytes long?

So you are suggesting to use a fixed size position table with 2^31 entries á 4 bytes that holds the game-index and is itself indexed by the hash key?
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: New Database Storage Method

Post by Edmund »

Dave_N wrote:I think I'm planning to store the range of sorted list of games that pass through the position in the hash node. And pointers. I think it wouldn't fit entirely in RAM.
How do you deal with multiple games passing through the same node? What sizes do you chose for hash-size and pointer-size?
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: New Database Storage Method

Post by hgm »

Edmund wrote:I don't understand how you get to 10^31 games.
I need a spare bit to indicate whether the entry contains a single game number, or if there is a collission, and the reemaining 31 bit contain some sort of pointer to the overflow area where the game numbers are.
I don't understand that you store a "sorted list of hash keys", when you first say "I would not store the hash key".
I don't. But I thought you did. I just use a hash table of the same size as when I would do that.
Do I understand correctly, that your hash keys are only 4 bytes long?
No, the hash keys need to be longer, because presumably with 2^31 games there will be more than 2^32 positions. The hash key needs to be at least long enough to derive an index from it that can address the entire hash table.
So you are suggesting to use a fixed size position table with 2^31 entries á 4 bytes that holds the game-index and is itself indexed by the hash key?
Not 2^31, but bigger. Namely twice the number of moves in those games.

In fact I think one could even save on the storage needed for that, because in a half-full table the empty entries can be Hufman-coded as a single bit. So you could add a bitmap that has a single bit per entry, indicating if the entry is empty or not. And then pack short stretches of filled entries in a smaller space. E.g. if you work in groups of 32 entries, there would be on average 16 filled, with a standard deviation of 4. So when you reserve 24 4-byte words for storing those 32 entries, that would be 2 sigma, and in 97.5% of those 'buckets' this would be enough. In the other 2.5% you could use an overflow bucket shared by, say, 20 buckets.