Speed up arbitrary position search in database

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
dkl
Posts: 13
Joined: Wed Jan 14, 2015 4:55 pm

Speed up arbitrary position search in database

Post by dkl » Sat Mar 17, 2018 11:25 pm

Suppose we have a chess database, like SCID or Chessbase, i.e. where games are stored as a sequence of moves in a binary file on disk.

Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The naive search strategy would then be to linearly search the binary file, i.e. load game #1, execute the game by applying move by move, and comparing each position to the search pattern, load game #2, and continue until all games have been visited.

Obviously only very few games will match. So to speed up search, it makes sense to store additional information, to rule out non-matching games as quickly as possible. I am aware that Chessbase uses a "search booster" file, but no information about the stored information is available.

SCID apprently stores two kind of information: 1.) The sequence on how pawns left their home fields (apparently to speed up opening position search, i.e. if 1e4 e5 is indexed information stored for a game and we search for a position where black's e-pawn is still at e7, then we can immeditately rule out this game. 2.) The material of the final position. Suppose we search for a position where there is no black bishop, but in the final position of a game there is (still) a black bishop. Skipping the case for previous pawn promotions for a moment, we can again rule out this game immediately.

So to come to my question: Is this optimal? What other information should be stored to speed up position search? Are there any investigations/material/statistics/practical experience on what works, and what doesn't work? Also considering from a perspective of tradeoff of required storage/implementational complexity vs gained speed.

Any input is greatly appreciated!

jdart
Posts: 3842
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Speed up arbitrary position search in database

Post by jdart » Sun Mar 18, 2018 12:49 am

The solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon

CheckersGuy
Posts: 273
Joined: Wed Aug 24, 2016 7:49 pm

Re: Speed up arbitrary position search in database

Post by CheckersGuy » Sun Mar 18, 2018 2:09 am

he solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon
Would be my first choice too. I might be wrong but the reason this may not be used is that hashtables arent well suited for storing something on a harddrive/disk.
Suppose we have a chess database, like SCID or Chessbase, i.e. where games are stored as a sequence of moves in a binary file on disk.

Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The naive search strategy would then be to linearly search the binary file, i.e. load game #1, execute the game by applying move by move, and comparing each position to the search pattern, load game #2, and continue until all games have been visited.

Obviously only very few games will match. So to speed up search, it makes sense to store additional information, to rule out non-matching games as quickly as possible. I am aware that Chessbase uses a "search booster" file, but no information about the stored information is available.

SCID apprently stores two kind of information: 1.) The sequence on how pawns left their home fields (apparently to speed up opening position search, i.e. if 1e4 e5 is indexed information stored for a game and we search for a position where black's e-pawn is still at e7, then we can immeditately rule out this game. 2.) The material of the final position. Suppose we search for a position where there is no black bishop, but in the final position of a game there is (still) a black bishop. Skipping the case for previous pawn promotions for a moment, we can again rule out this game immediately.

So to come to my question: Is this optimal? What other information should be stored to speed up position search? Are there any investigations/material/statistics/practical experience on what works, and what doesn't work? Also considering from a perspective of tradeoff of required storage/implementational complexity vs gained speed.

Any input is greatly appreciated!

I had a similiar problem for a different dataset and used a B-Tree to store the information https://en.wikipedia.org/wiki/B-tree. The set needs to be ordered (position A < position B). Some information, like the things u mentioned, could be used for that I guess[/quote]

syzygy
Posts: 4458
Joined: Tue Feb 28, 2012 10:56 pm

Re: Speed up arbitrary position search in database

Post by syzygy » Sun Mar 18, 2018 2:33 am

dkl wrote:Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).
The simple case might be too simple. Being able to look up a specific position quickly is of course useful, but that ability is not going to be of any help once you want to support slightly more complicated queries.

User avatar
phhnguyen
Posts: 457
Joined: Wed Apr 21, 2010 2:58 am
Location: Australia
Full name: Nguyen Hong Pham
Contact:

Re: Speed up arbitrary position search in database

Post by phhnguyen » Sun Mar 18, 2018 2:48 am

jdart wrote:The solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon
Hash code is perfect for searching "exact" positions. However, from my experience, finding exact positions is not useful and there are so few games have exact positions to other games, specially from middle games. Users prefer to search similar positions which few pieces may be omitted and/or in diffident cells. Other problem is that hash data may take too much space to store themselves and some pointers to point back games.

For me, I have used material counters / signatures. They don't take too much space and can help to limit number of involving games and positions.

Dann Corbit
Posts: 10204
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Speed up arbitrary position search in database

Post by Dann Corbit » Sun Mar 18, 2018 6:48 am

Use an actual SQL database.
Store the games as a sequence of EPD positions that are uniquely indexed.
FastDB and Gigabase are good at caching.
Or MonetDB is another high speed option.

Even standard SQL stores like SQL Server, PostgreSQL would work well.

Eg:
Game has game header and then a child table like this:
unsigned short ply;
unsigned long long PositionID;
<whatever else you want to store for the position>
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

AlvaroBegue
Posts: 922
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Speed up arbitrary position search in database

Post by AlvaroBegue » Sun Mar 18, 2018 7:09 am

The first thing that comes to mind is to use a Bloom filter to encode the positions a game has gone through, and quickly discard games where the position didn't appear.

The idea is to compute m bits (say m=1024) per game. Start with all bits set to 0, compute k different hash functions (say k=7) of each position and mark the bits that correspond to the values of the hash functions modulo m for all the positions the game goes through. When you are looking for a specific position, you just need to compute the k hash functions modulo m and go through the games, checking if the corresponding k bits are set.

This procedure will not miss any games where the position did appear, and it will give a rate of false positives which is

p = (1 - (1 - 1/m)^(k * n))^k

for a game that goes through n positions. If n=100, m=1024 and k=7, then p=.00732. So of the games that don't contain the position, you only have to check move by move one in 137 or so (I don't know if n=100 is typical or not; it probably depends on the source of the games in your database).

Post Reply