What to do about hash collisions?
64 bit hash values are not large enough for a perfect hash because a chess position cannot be encoded in that many bits.
I think you must key on the EPD position, though you would not have to use a character string if you are pushed for space.
I would want the system to be able to store 20 years of lichess data and guaranty correct results.
A 32 or 64 bit hash cannot do that.
Open Chess Game Database Standard
Moderator: Ras
-
- Posts: 12777
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Open Chess Game Database Standard
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 391
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: Open Chess Game Database Standard
I was using 72 bit hashes for my db software. With 72 bit hashes the expected number of collisions with 100B positions is 1 (around the number I got last time I was doing lichess, though it was ~1.3B games only), for 1T positions 105. I think it's a reasonable tradeoff. Good luck storing 72 bits in an sql column though. 64 bits might be okay for dbs up to ~20B positions.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Open Chess Game Database Standard
Yes we had that discussion too. Just add a second zobrist hash with a different base (IE random seeds for each piece and square).Dann Corbit wrote: ↑Wed Dec 01, 2021 12:52 am What to do about hash collisions?
64 bit hash values are not large enough for a perfect hash because a chess position cannot be encoded in that many bits.
I think you must key on the EPD position, though you would not have to use a character string if you are pushed for space.
I would want the system to be able to store 20 years of lichess data and guaranty correct results.
A 32 or 64 bit hash cannot do that.
As this will only affect INSERT speed everything is fine! With the birthday paradox it is 50% fine for up to 2^64 positions. So realistically 2^40+ positions are as safe as 0.5^14 so universally good enough.
So you have hash1 and hash 2 with 128 bit. And positions stay at 32bit and can grow to 48 if needed.
Quote for notification:
What do you say phhnguyen?phhnguyen wrote:
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 12777
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Open Chess Game Database Standard
It is possible though very unlikely, for the problem to occur on the second position.
It should be possible to make a dictionary where the actual EPD position is stored, and assigned an 8 bit unsigned int, because 1.8quadrillion positions are plenty for anyone, even me. The ID value from this table defines the position and is an alternate index.
It should be possible to make a dictionary where the actual EPD position is stored, and assigned an 8 bit unsigned int, because 1.8quadrillion positions are plenty for anyone, even me. The ID value from this table defines the position and is an alternate index.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Open Chess Game Database Standard
I dont know how many bits does one full EPD take?Dann Corbit wrote: ↑Wed Dec 01, 2021 1:44 am It is possible though very unlikely, for the problem to occur on the second position.
It should be possible to make a dictionary where the actual EPD position is stored, and assigned an 8 bit unsigned int, because 1.8quadrillion positions are plenty for anyone, even me. The ID value from this table defines the position and is an alternate index.
Zobrist would take 128 and collision for 1.0995116e+12 games is very very very unlikely - and with 1.8446744e+19 it is 50%.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1524
- Joined: Wed Apr 21, 2010 4:58 am
- Location: Australia
- Full name: Nguyen Hong Pham
Re: Open Chess Game Database Standard
I can’t say much since I don’t have enough experience with this issue. Thus I mainly listen to you all.dangi12012 wrote: ↑Wed Dec 01, 2021 1:29 am
Quote for notification:What do you say phhnguyen?phhnguyen wrote:
On one hand, in the present as well as coming time I have been focusing on “medium” database sizes, say, between 3-10 million games since those sizes are the most popular for users. I think they (those sizes) won’t bring much trouble about collisions.
On the other hand, one of the main advantages of SQL databases over binary ones is that it is SO EASY to change data structures and types. We don’t have much trouble adding, removing fields or tables. Changing data types/sizes is just a matter of a few code lines too. I am quite happy with the speed of converting from the PGN file - it can convert 3.45 million games within 0.5 minutes. With that speed, we may change a lot of things then re-create the new database almost without suffering/penalty.
Finally, it is an open-working project. We all can try our own ideas. Just create some new branches, PRs and push them to the repository. Other people can examine/verify.
Last edited by phhnguyen on Wed Dec 01, 2021 2:36 am, edited 1 time in total.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
The most features chess GUI, based on opensource Banksia - the chess tournament manager
-
- Posts: 12777
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Open Chess Game Database Standard
Here is a thread on EPD/FEN compression:
forum3/viewtopic.php?f=7&t=76892&hilit=epd+compression
I am not sure how important it is to minimize, since I would create a dictionary.
One 8 byte serial unsigned integer alternate key for each Epd position.
That gives 1.8 quadrillion of them, which is unlikely to become exhausted in real world usage.
forum3/viewtopic.php?f=7&t=76892&hilit=epd+compression
I am not sure how important it is to minimize, since I would create a dictionary.
One 8 byte serial unsigned integer alternate key for each Epd position.
That gives 1.8 quadrillion of them, which is unlikely to become exhausted in real world usage.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Open Chess Game Database Standard
Ah now after sleeping on it I see what you mean. Its better to have a non colliding key column like EPD.Dann Corbit wrote: ↑Wed Dec 01, 2021 2:35 am Here is a thread on EPD/FEN compression:
forum3/viewtopic.php?f=7&t=76892&hilit=epd+compression
I am not sure how important it is to minimize, since I would create a dictionary.
One 8 byte serial unsigned integer alternate key for each Epd position.
That gives 1.8 quadrillion of them, which is unlikely to become exhausted in real world usage.
SQLite does use the right size automatically - so a INTEGER PRIMARY KEY AUTOINCREMENT for position key is perfect.
Code: Select all
CREATE TABLE Position (
pos_id INTEGER PRIMARY KEY AUTOINCREMENT,
pos_board text NOT NULL
pos_eval REAL
);
I would recommend pos_board - with some format behind it that needs the least amount of characters to describe a full position.
Maybe a uint64_t for occupancy - and then a 4bit character string to descibe each squares occupant. A binary blob is also ok.
To be clear - 50 move and halfmoveclock are not part of a position - they are part of the game. Since you can link to one position from multiple games.
The pos_board only is used during inserting and all existing games refer to the pos_id (and in fact are a list of positions)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Open Chess Game Database Standard
Yes but in the moment you dont parse the SAN but copy it over to the db? What would be needed for that extensions is to parse that and make positions its own table.phhnguyen wrote: ↑Wed Dec 01, 2021 2:34 amOn the other hand, one of the main advantages of SQL databases over binary ones is that it is SO EASY to change data structures and types. We don’t have much trouble adding, removing fields or tables.dangi12012 wrote: ↑Wed Dec 01, 2021 1:29 am
Quote for notification:What do you say phhnguyen?phhnguyen wrote:
So it add one table:
mDb->exec("DROP TABLE IF EXISTS Info");
mDb->exec("CREATE TABLE Info (Name TEXT UNIQUE NOT NULL, Value TEXT)");
mDb->exec("INSERT INTO Info (Name, Value) VALUES ('Version', '0.1')");
mDb->exec("INSERT INTO Info (Name, Value) VALUES ('Variant', 'standard')");
mDb->exec("INSERT INTO Info (Name, Value) VALUES ('License', 'free')");
mDb->exec("DROP TABLE IF EXISTS Events");
mDb->exec("CREATE TABLE Events (ID INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT UNIQUE)");
mDb->exec("INSERT INTO Events (Name) VALUES (\"\")"); // default empty
mDb->exec("DROP TABLE IF EXISTS Sites");
mDb->exec("CREATE TABLE Sites (ID INTEGER PRIMARY KEY AUTOINCREMENT, Name TEXT UNIQUE)");
mDb->exec("INSERT INTO Sites (Name) VALUES (\"\")"); // default empty
mDb->exec("DROP TABLE IF EXISTS Players");
mDb->exec("CREATE TABLE Players (ID INTEGER PRIMARY KEY, Name TEXT UNIQUE, Elo INTEGER)");
mDb->exec("INSERT INTO Players (ID, Name) VALUES (1, \"\")"); // default empty
mDb->exec("DROP TABLE IF EXISTS Games");
mDb->exec("CREATE TABLE Games (ID INTEGER PRIMARY KEY AUTOINCREMENT, EventID INTEGER, SiteID INTEGER, Date TEXT, Round INTEGER, WhiteID INTEGER, WhiteElo INTEGER, BlackID INTEGER, BlackElo INTEGER, Result INTEGER, Timer TEXT, ECO TEXT, PlyCount INTEGER, FEN TEXT, Moves TEXT, FOREIGN KEY(EventID) REFERENCES Events, FOREIGN KEY(SiteID) REFERENCES Sites, FOREIGN KEY(WhiteID) REFERENCES Players, FOREIGN KEY(BlackID) REFERENCES Players)");
...
Moves TEXT will die and a new table is created that contains all the moves in the form:
CREATE TABLE MOVES ...
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1524
- Joined: Wed Apr 21, 2010 4:58 am
- Location: Australia
- Full name: Nguyen Hong Pham
Re: Open Chess Game Database Standard
No, since it depends on the purposes. Sometimes, users just want to read PGN files, sort, search for players, events, results, calculate points, Elo for players, and/or watch only a few games. It is no point to parse all moves of all games.dangi12012 wrote: ↑Wed Dec 01, 2021 11:44 am Moves TEXT will die and a new table is created that contains all the moves in the form:
CREATE TABLE MOVES ...
I think the field Moves has some reasons to exist. We may add some more fields (or new tables) when needed, say, MovesInCoordinate, BinaryMoves,...
BTW, all are possible, nothing fixed! All decisions are based on testing/benchmarks.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
The most features chess GUI, based on opensource Banksia - the chess tournament manager