SCID4 database

Fulvio · Post by **Fulvio** » Wed Oct 13, 2021 5:07 pm

I noticed that I have forgotten two limits:

8) The variations/comments/NAGs counts are approximations, stored in only 4-bits and mapped to only 16 values: 0,1,2,3,4,5,6,7,8,9,10,15,20,30,40,50.
( https://sourceforge.net/p/scid/code/ci/ ... try.h#l293 )
The real value is rounded to the nearest available.
For example:
16 is rounded to 15
12 is rounded to 10
26 to 30

9) The half moves count is stored in 10-bits, so with a max value of 1023.

dangi12012 · Post by **dangi12012** » Wed Oct 13, 2021 7:05 pm

R. Tomasi wrote: ↑Wed Oct 13, 2021 4:41 pm
dangi12012 wrote: ↑Wed Oct 13, 2021 4:30 pm
Rein Halbersma wrote: ↑Wed Oct 13, 2021 2:04 pm
dangi12012 wrote: ↑Mon Oct 11, 2021 10:38 pm As I told you - with SQL you get O(1) lookup performance on primary keys. You dont have to take it from me - literally anyone working in IT industry with some SQL experience can verify.
Lookup on indices is O(log N), not O(1).
100% Wrong.
https://medium.com/@JasonWyatt/squeezin ... e175f3c346

You are wrong also. Test it please if you dont believe it...
https://www.geeksforgeeks.org/sql-query-complexity/

O(log(N)) is for a binary search. Which you would have to do for a primary STRING key.
But not nor a primary INTEGER key. Which is a hashtable lookup like a C++ map. Its O(1).

Please dont invent a binary database format. It will be slower - debugging will take forever and at the end of the day you dont have a standard like sql everyone can rely on - and have experience with. Im not even starting with data integrity even during unexpected crashes... Thats what databases do for you

R. Tomasi · Post by **R. Tomasi** » Wed Oct 13, 2021 7:21 pm

dangi12012 wrote: ↑Wed Oct 13, 2021 7:05 pm
R. Tomasi wrote: ↑Wed Oct 13, 2021 4:41 pm
dangi12012 wrote: ↑Wed Oct 13, 2021 4:30 pm
Rein Halbersma wrote: ↑Wed Oct 13, 2021 2:04 pm
dangi12012 wrote: ↑Mon Oct 11, 2021 10:38 pm As I told you - with SQL you get O(1) lookup performance on primary keys. You dont have to take it from me - literally anyone working in IT industry with some SQL experience can verify.
Lookup on indices is O(log N), not O(1).
100% Wrong.
https://medium.com/@JasonWyatt/squeezin ... e175f3c346
You are wrong also. Test it please if you dont believe it...
https://www.geeksforgeeks.org/sql-query-complexity/

O(log(N)) is for a binary search. Which you would have to do for a primary STRING key.
But not nor a primary INTEGER key. Which is a hashtable lookup like a C++ map. Its O(1).

You suggest encoding positions as integers?

Thank's for the link, though. Didn't know integral keys were O(1)!

Fulvio · Post by **Fulvio** » Wed Oct 13, 2021 7:59 pm

dangi12012 wrote: ↑Wed Oct 13, 2021 7:05 pm You are wrong also. Test it please if you dont believe it...
https://www.geeksforgeeks.org/sql-query-complexity/

O(log(N)) is for a binary search. Which you would have to do for a primary STRING key.
But not nor a primary INTEGER key. Which is a hashtable lookup like a C++ map. Its O(1).

Please dont invent a binary database format. It will be slower - debugging will take forever and at the end of the day you dont have a standard like sql everyone can rely on - and have experience with. Im not even starting with data integrity even during unexpected crashes... Thats what databases do for you

Directly from the sqlite website:
"One technique for avoiding a full table scan is to do lookups by rowid (or by the equivalent INTEGER PRIMARY KEY)."
"Since the information is stored in the table in rowid order, SQLite can find the correct row using a binary search. If the table contains N elements, the time required to look up the desired row is proportional to logN rather than being proportional to N as in a full table scan."
https://www.sqlite.org/queryplanner.html

Now, your posts are useless, offtopic and full of unbacked claims.
So, can you please stop trolling? (and for the others please do not feed the troll)
Didn't you already wrote: "I wont post here anymore".
So, please, stick to that.

dangi12012 · Post by **dangi12012** » Wed Oct 13, 2021 8:23 pm

R. Tomasi wrote: ↑Wed Oct 13, 2021 7:21 pm
dangi12012 wrote: ↑Wed Oct 13, 2021 7:05 pm
R. Tomasi wrote: ↑Wed Oct 13, 2021 4:41 pm
dangi12012 wrote: ↑Wed Oct 13, 2021 4:30 pm
Rein Halbersma wrote: ↑Wed Oct 13, 2021 2:04 pm
dangi12012 wrote: ↑Mon Oct 11, 2021 10:38 pm As I told you - with SQL you get O(1) lookup performance on primary keys. You dont have to take it from me - literally anyone working in IT industry with some SQL experience can verify.
Lookup on indices is O(log N), not O(1).
100% Wrong.
https://medium.com/@JasonWyatt/squeezin ... e175f3c346
You are wrong also. Test it please if you dont believe it...
https://www.geeksforgeeks.org/sql-query-complexity/

O(log(N)) is for a binary search. Which you would have to do for a primary STRING key.
But not nor a primary INTEGER key. Which is a hashtable lookup like a C++ map. Its O(1).
You suggest encoding positions as integers?

Thank's for the link, though. Didn't know integral keys were O(1)!

Excatly OP wants good performance. You store the zobrist hash as Integer primary key - and a zobrist with a different seed to eliminate any and all collisions. This gives you O(1)

I know it works because I used it in my last years project to have a lookup for any and all positions from 2013 - 2020 that have ever been played on lichess. Including SF rating - outcome and all usernames.
Everything at once doesnt fit but up to 120GB or so sqlite is blazingly fast to lookup any position + how often it was reached..

I made my point clear that reinventing a custom DB will be a painful slow process and at the end of the day data integrity and features like deduplication and compression are much much harder to implement not to mention the advanced B tree implentation you need when looking stuff up.

What you will end up will be slower - error prone and no developer on the world can maintain it for you.

Maybe I will release my parser - since it transforms any of these .pgn.bz2 into batched highly compatible sql insert statements.
https://database.lichess.org/

mvanthoor · Post by **mvanthoor** » Thu Oct 14, 2021 1:50 pm

R. Tomasi wrote: ↑Wed Oct 13, 2021 7:21 pm You suggest encoding positions as integers?

No; he suggested to just save the Zobrist of each position in each game. An 80 move game will have 160 positions. An 8 million game database will have 160 * 8.000.000 = 1.28 billion position rows. Storing the game_id and the zobrist_key (both 64-bit, or 16 bytes total) will come down to 20.48 billion bytes for this table, or about 19 GB if I'm not miscounting.

Then I haven't mentioned relations, or other information needed. That's the reason why chess databases mostly use binary formats. I think storing 8 million games in an SQL database will work fine, and it'll probably be fast enough for most tasks, but I'm not sure if it will be usable for a database containing millions of games. Chessbase databases (also a binary format) are definitely much smaller.

So it could be that performance is fine, but the space requirement will be comparatively huge.

dangi12012 · Post by **dangi12012** » Thu Oct 14, 2021 3:32 pm

mvanthoor wrote: ↑Thu Oct 14, 2021 1:50 pm
R. Tomasi wrote: ↑Wed Oct 13, 2021 7:21 pm You suggest encoding positions as integers?
No; he suggested to just save the Zobrist of each position in each game. An 80 move game will have 160 positions. An 8 million game database will have 160 * 8.000.000 = 1.28 billion position rows. Storing the game_id and the zobrist_key (both 64-bit, or 16 bytes total) will come down to 20.48 billion bytes for this table, or about 19 GB if I'm not miscounting.

Then I haven't mentioned relations, or other information needed. That's the reason why chess databases mostly use binary formats. I think storing 8 million games in an SQL database will work fine, and it'll probably be fast enough for most tasks, but I'm not sure if it will be usable for a database containing millions of games. Chessbase databases (also a binary format) are definitely much smaller.

So it could be that performance is fine, but the space requirement will be comparatively huge.

Your math is totally off. Its not about how many total positions but DISTINCT positions thats why we have a zobrist hash in the first place. 8 Million games dont need much storage at all. A few 100 MB. You dont store every position of every game. You store every distinct position once.
For the OP: Lookup the lichess sourcecode its open source.

For a good binary implementation read the github for Syzygy
"The number of unique legal 7-piece positions is 423,836,835,667,331. Syzygy tablebases store all their information in 18.4 TB, so at around
0.35 bits per position. This is much more compact than the proprietary 100 TB Lomonosov tablebases."

mvanthoor · Post by **mvanthoor** » Thu Oct 14, 2021 3:41 pm

dangi12012 wrote: ↑Thu Oct 14, 2021 3:32 pm Your math is totally off. Its not about how many total positions but DISTINCT positions thats why we have a zobrist hash in the first place. 8 Million games dont need much storage at all. A few 100 MB. You dont store every position of every game. You store every distinct position once.

But you do have to store or reference a collection of those positions to know which of them occured in a game. That will still take a fair amount of storage. Or, you'll have to play through the game, and then look into that position table for every move, which will not be very fast if you have 8 million games to go.

Don't get me wrong; in the future I want to look into writing my own user interface (to use with my own engine), and I'd rather use an SQL-database than my own binary format, or even SCID's format, because I already hit some of SCID's limits when testing it. But I'm still not convinced that SQL is the best choice.

odomobo · Post by **odomobo** » Thu Oct 14, 2021 6:18 pm

As far as I'm concerned, a custom solution can always be made to outperform a generic one, and this is a case where performance matters a lot.

Instead of trying to make an open standard that works for everyone, I think it's more practical to make a very permissive standalone library which can then be used in other projects. That should lower its barrier of use (without sacrificing performance), which I think is the most important concern.

dangi12012 · Post by **dangi12012** » Thu Oct 14, 2021 6:44 pm

odomobo wrote: ↑Thu Oct 14, 2021 6:18 pm As far as I'm concerned, a custom solution can always be made to outperform a generic one, and this is a case where performance matters a lot.

Instead of trying to make an open standard that works for everyone, I think it's more practical to make a very permissive standalone library which can then be used in other projects. That should lower its barrier of use (without sacrificing performance), which I think is the most important concern.

Endgame Tablebases in SQL? Hell no. You are correct there.

OP asked about "similar performance to how lichess can lookup a position" for 8 Million games. This is a perfect job for a sql database. Not only is it fast enough for user interaction - but if you think about a new statistic you want evaluated you can write a query in 1 minute. ie. what is the most common opening etc.
Writing a new query that is efficient in a custom binary format - requires at least expert knowledge about the layout etc - and extra code just for a simple question.

SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database

Re: SCID4 database