I couldn't find anything talking about the theory behind how to implement a chess database. I know that there are open source implementations such as Scid already out there.
But if you were to implement a modern chess database from scratch today what would be the architectural decisions and why? And how would it be different and better from just trying to shoehorn MySql or Postgresql to work.
How to implement a chess games database?
Moderator: Ras
-
hgm
- Posts: 28480
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to implement a chess games database?
I had the impression that these so-called Chess databases are in fact nothing more than just a collection of games, in some compressed format tailored to Chess. Through which you then simply do a linear search. The number of games you are likely to encounter is in the millions, not billions. So with present-day hardware you can afford to step through all of them (e.g. to search a position that satisfies some criteria) without causing unacceptable delays.
But I am no expert in this.
But I am no expert in this.
-
ccherng
- Posts: 11
- Joined: Thu Sep 13, 2012 3:18 am
Re: How to implement a chess games database?
So nothing based around Btrees or anything to make searching faster? Its really the case that Chessbase's database product does a simple linear scan?
-
hgm
- Posts: 28480
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to implement a chess games database?
I don't know how Chessbase works (and it is probably company secret), but I though a little bit about the problem for the 'database' functionality of WinBoard. I couldn't really think of a way to make it more efficient than linear search. The problem is that the user can search for almost any characteristic of a position. If he was just looking for exact positions, it would be easy, and you would just compute hash keys and sort the positions by those. But he can also search on material, Pawn structure + material, material range, material difference and partial position. Especially the latter is a pain. (Where he specifies the pieces on some squares, but the content of other squares ahould be ignored. Such as w: Kg1, Rf1 b: Kc8, Rd8 to retrieve all games where white castled King-side, and black Queen-side.)
So what I do is store the moves of all games in a machine-friendly format, and then run through all the games to play through them, and subject the positions to the comparison routine.
I guess it should be possible to speed up some of the search modes by sorting positions by material composition. Inexact material (i.e. range or fixed difference) can always be done by looping over all material compositions that satisfy the requirements, and you could retrieve them one at the time. The problem, however, is that I don't really store positions, but moves (as these can be stored so much more compactly), so that knowing that position number M in game number N satisfies the material requirement still requires you to 'expand' the game to get the position (e.g. to see if it has the sought Pawn structure). You would not want to expand the same game multiple times when it satisfies several material compositions in a requested range.
The problem is that any form of tabulating (e.g. by position hash key, Pawn hash key, material key) takes up far more space than just storing the moves (which WinBoard does in two bytes, but for a dedicated Chess encoding can be done in a single byte). The game number alone would not fit in 3 bytes, so likely 6 bytes would be required to point to a position in a game. Tabulating several aspects thus easily drives up the storage requirement by two orders of magnitude.
So what I do is store the moves of all games in a machine-friendly format, and then run through all the games to play through them, and subject the positions to the comparison routine.
I guess it should be possible to speed up some of the search modes by sorting positions by material composition. Inexact material (i.e. range or fixed difference) can always be done by looping over all material compositions that satisfy the requirements, and you could retrieve them one at the time. The problem, however, is that I don't really store positions, but moves (as these can be stored so much more compactly), so that knowing that position number M in game number N satisfies the material requirement still requires you to 'expand' the game to get the position (e.g. to see if it has the sought Pawn structure). You would not want to expand the same game multiple times when it satisfies several material compositions in a requested range.
The problem is that any form of tabulating (e.g. by position hash key, Pawn hash key, material key) takes up far more space than just storing the moves (which WinBoard does in two bytes, but for a dedicated Chess encoding can be done in a single byte). The game number alone would not fit in 3 bytes, so likely 6 bytes would be required to point to a position in a game. Tabulating several aspects thus easily drives up the storage requirement by two orders of magnitude.
-
jdart
- Posts: 4429
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: How to implement a chess games database?
Chessbase databases are indexed for faster searches, like most databases.
I don't know much about the internal representation though.
If you were starting from scratch you should probably look at existing commercial and open-source embeddable databases (see http://en.wikipedia.org/wiki/Embedded_database).
--Jon
I don't know much about the internal representation though.
If you were starting from scratch you should probably look at existing commercial and open-source embeddable databases (see http://en.wikipedia.org/wiki/Embedded_database).
--Jon
-
hgm
- Posts: 28480
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to implement a chess games database?
Well, I guess indexing games by the various PGN tags is not very costly, as games requires some hundreds of bytes in storage anyway. OTOH, a stupid linear search for matches through 10 million tags would only take the blink of an eye anyway, so why bother to make it 10 microseconds in stead of 100 msec?
The problem is thow to index the positions, so that all possible flavors of position search become easy. Perhaps it would be an idea to make an index for the occupant of every square: a list of positions that have a Pawn on a2, a list of positions that have a Knight there, etc. Then searches for a partial position could simply intersect the lists of positions for each of the given square occupants. The storage requirement would be enormous, however: a position number needs to be at least 4 bytes, and about 25% of the positions will have something on a particular square (assuming the filling fraction of the board ranges from 0% to 50%). So that means every position will occur in 16 index lists, requiring 64 bytes.
What is optimal depends a lot on the typical usage that you target. If the entire database would fit in memory with 1 byte per position, but not with 64 bytes per position, you would probably be better off using a stupid search algorithm that is fast because it uses only memory, than a smart algorithm that has to search through a disk file.
The problem is thow to index the positions, so that all possible flavors of position search become easy. Perhaps it would be an idea to make an index for the occupant of every square: a list of positions that have a Pawn on a2, a list of positions that have a Knight there, etc. Then searches for a partial position could simply intersect the lists of positions for each of the given square occupants. The storage requirement would be enormous, however: a position number needs to be at least 4 bytes, and about 25% of the positions will have something on a particular square (assuming the filling fraction of the board ranges from 0% to 50%). So that means every position will occur in 16 index lists, requiring 64 bytes.
What is optimal depends a lot on the typical usage that you target. If the entire database would fit in memory with 1 byte per position, but not with 64 bytes per position, you would probably be better off using a stupid search algorithm that is fast because it uses only memory, than a smart algorithm that has to search through a disk file.
-
kbhearn
- Posts: 411
- Joined: Thu Dec 30, 2010 4:48 am
Re: How to implement a chess games database?
Consecutive positions however are usually almost the same. if you use a byte array for your index, 13 states are used for the pieces and an empty square. that leaves 243 states to specify run lengths of 3-245. so an index might be something like: whitepawn run20 empty run30 whiteknight blackknight whitepawn whitepawn empty run60
the end result being that a new position for a normal move (not en passant or castling) usually adds 2 bytes to 2 indexes for the squares that actually changed for a total of 4 bytes.
this still of course requires some way to convert integer offsets from your search of the indices into positions in however you're storing the actual games. accordingly i tend to agree that for current typical database sizes and search needs a format that sequentially searches games is probably as efficient as anything.
edit: fixed basic math
the end result being that a new position for a normal move (not en passant or castling) usually adds 2 bytes to 2 indexes for the squares that actually changed for a total of 4 bytes.
this still of course requires some way to convert integer offsets from your search of the indices into positions in however you're storing the actual games. accordingly i tend to agree that for current typical database sizes and search needs a format that sequentially searches games is probably as efficient as anything.
edit: fixed basic math
-
phhnguyen
- Posts: 1541
- Joined: Wed Apr 21, 2010 4:58 am
- Location: Australia
- Full name: Nguyen Hong Pham
Re: How to implement a chess games database?
You don't need any complicated things such as btree. Few simple lists are enough. The largest chess database (for human games) just around 5 million games, quite small for searching even on an iphone (you can check my apps on AppStore named Banksia chess database).
I have spent majority of time for compression and management instead.
I have spent majority of time for compression and management instead.
-
rreagan
- Posts: 102
- Joined: Sun Sep 09, 2007 6:32 am
Re: How to implement a chess games database?
You would have to make a space-time trade off based on the requirements you decide upon beforehand. You can't offer every search parameter AND have all of it be fast. It's not hard to do what you want if you use extra space. Add 64 columns to your database table, one for each square, then it's a simple SQL SELECT to get exactly what you want (Kg1, Rf1, kc8, rd8, ignoring all other squares).hgm wrote:The problem is that the user can search for almost any characteristic of a position... But he can also search on material, Pawn structure + material, material range, material difference and partial position. Especially the latter is a pain. (Where he specifies the pieces on some squares, but the content of other squares ahould be ignored. Such as w: Kg1, Rf1 b: Kc8, Rd8 to retrieve all games where white castled King-side, and black Queen-side.)
-
hgm
- Posts: 28480
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How to implement a chess games database?
Well, using run-length coding on that extra information was definitely a good idea, and I will consider it for WinBoard.