Open Chess Game Database Standard

phhnguyen · Post by **phhnguyen** » Wed Nov 03, 2021 11:36 am

dangi12012 wrote: ↑Mon Nov 01, 2021 2:00 pm
PGNs may contain the evaluation and it would be great to have a lookup for every move in the db possible.
For example lichess has something like this:
1. e4 { [%eval 0.24] [%clk 0:03:00] } 1... e5 { [%eval 0.12] [%clk 0:03:00] } 2. Nf3 { [%eval 0.17] [%clk 0:02:59] } 2..
This can be used to select all games of a certain player and find the openings where he consistently plays a slight mistaken move.

On one hand, I agreed we can use those fields (evals, scores) for selecting moves as an opening book. A straightforward database is not good enough for that task, thus we need extra fields. More data could help to solve more problems.

However, there are questions about consistency, efficiency, and if it is worth doing. Clearly, with other databases with games that are not from Lichess or from human players, we don't have that info for each move. Speed is another problem. Typically to find out a good move as opening book one you may query several times. That is slow, not efficient, and may take totally several seconds with huge databases, too long for that task. Using only computing scores for opening books is another problem since they may be good at tactics but not strategies.

The simple question is that why not just convert databases into “normal” opening books, using traditional ways?

dangi12012 wrote: ↑Mon Nov 01, 2021 2:00 pm
Do you plan to write a wrapper as well?
So that we have a C++ header that we can use and get some already implemented methods?

For example "Chess_Database.h" - which already countains a Load(char* path) and Count() Methods.
I want to add that sqlites slower count(*) is deliberate and can be done fast by:
Code: Select all
SELECT MAX(_ROWID_) FROM "table" LIMIT 1;
I dont know if there is a shared language that can emit code for python, c++, c# and java with sqlite backends. Then we only would need to write simple query code once - and it will code for all languages and we can have a central github for that.

That way may not be useful for some reasons:
- It still takes time (even faster than COUNT) since the command “LIMIT”
- It is supposed the numbers in the ID field are continuing. That may not be that if some records are removed when using
- Counting all records in a table (so far we have only a few tables) is just only a few queries we need. We still have many questions with COUNT, involving from 1 to all tables. For example, count all games of Carlsen in 2021, count all games with event ABC… For answering, SQL engines usually scan the whole database which takes time when the number of records is large

Dann Corbit · Post by **Dann Corbit** » Sat Nov 06, 2021 10:30 pm

I had to make some changes to get the code to compile on windows, so I include the code I used in the archive.
I took the Lichess database files and filtered for games with both players at or above 2500 Elo.
The result was 6,471,823 games (out of the raw total of 2,728,370,048 games). About 18.5 hours to build the book.
I am not coming back. I saw the project on github but got no feedback there, so I thought I would contribute because it is an interesting project.
I won't leave the archives up forever, so if you want them, come and get it.

code:

Book:

Creation log:
C:\lichess>ocgdb -pgn li.pgn -db li.d3
Open Chess Game Database Standard, (C) 2021
SQLite database file 'li.d3' opened successfully
Processing PGN file: 'li.pgn'
#games: 65536, #errors: 0, elapsed: 398102 ms, 06:38, speed: 164 games/s
#games: 131072, #errors: 0, elapsed: 1073303 ms, 17:53, speed: 122 games/s
#games: 196608, #errors: 0, elapsed: 1746922 ms, 29:06, speed: 112 games/s
#games: 262144, #errors: 0, elapsed: 2426217 ms, 40:26, speed: 108 games/s
#games: 327680, #errors: 0, elapsed: 3101821 ms, 51:41, speed: 105 games/s
#games: 393216, #errors: 0, elapsed: 3801506 ms, 1:03:21, speed: 103 games/s
#games: 458752, #errors: 0, elapsed: 4471533 ms, 1:14:31, speed: 102 games/s
#games: 524288, #errors: 0, elapsed: 5156917 ms, 1:25:56, speed: 101 games/s
#games: 589824, #errors: 0, elapsed: 5838528 ms, 1:37:18, speed: 101 games/s
#games: 655360, #errors: 0, elapsed: 6536315 ms, 1:48:56, speed: 100 games/s
#games: 720896, #errors: 0, elapsed: 7207051 ms, 2:00:07, speed: 100 games/s
#games: 786432, #errors: 0, elapsed: 7873655 ms, 2:11:13, speed: 99 games/s
#games: 851968, #errors: 72, elapsed: 8439536 ms, 2:20:39, speed: 100 games/s
#games: 917504, #errors: 72, elapsed: 9113979 ms, 2:31:53, speed: 100 games/s
#games: 983040, #errors: 72, elapsed: 9896626 ms, 2:44:56, speed: 99 games/s
#games: 1048576, #errors: 72, elapsed: 10591610 ms, 2:56:31, speed: 99 games/s
#games: 1114112, #errors: 72, elapsed: 11272166 ms, 3:07:52, speed: 98 games/s
#games: 1179648, #errors: 72, elapsed: 11968237 ms, 3:19:28, speed: 98 games/s
#games: 1245184, #errors: 72, elapsed: 12649853 ms, 3:30:49, speed: 98 games/s
#games: 1310720, #errors: 72, elapsed: 13326645 ms, 3:42:06, speed: 98 games/s
#games: 1376256, #errors: 73, elapsed: 13894009 ms, 3:51:34, speed: 99 games/s
#games: 1441792, #errors: 73, elapsed: 14574325 ms, 4:02:54, speed: 98 games/s
#games: 1507328, #errors: 127, elapsed: 15243624 ms, 4:14:03, speed: 98 games/s
#games: 1572864, #errors: 128, elapsed: 15915758 ms, 4:25:15, speed: 98 games/s
#games: 1638400, #errors: 129, elapsed: 16590066 ms, 4:36:30, speed: 98 games/s
#games: 1703936, #errors: 150, elapsed: 17186459 ms, 4:46:26, speed: 99 games/s
#games: 1769472, #errors: 173, elapsed: 17772061 ms, 4:56:12, speed: 99 games/s
#games: 1835008, #errors: 209, elapsed: 18449052 ms, 5:07:29, speed: 99 games/s
#games: 1900544, #errors: 232, elapsed: 19127577 ms, 5:18:47, speed: 99 games/s
#games: 1966080, #errors: 265, elapsed: 19801247 ms, 5:30:01, speed: 99 games/s
#games: 2031616, #errors: 313, elapsed: 20370950 ms, 5:39:30, speed: 99 games/s
#games: 2097152, #errors: 353, elapsed: 21042069 ms, 5:50:42, speed: 99 games/s
#games: 2162688, #errors: 377, elapsed: 21716208 ms, 6:01:56, speed: 99 games/s
#games: 2228224, #errors: 441, elapsed: 22392387 ms, 6:13:12, speed: 99 games/s
#games: 2293760, #errors: 472, elapsed: 23069720 ms, 6:24:29, speed: 99 games/s
#games: 2359296, #errors: 508, elapsed: 23746480 ms, 6:35:46, speed: 99 games/s
#games: 2424832, #errors: 591, elapsed: 24421036 ms, 6:47:01, speed: 99 games/s
#games: 2490368, #errors: 613, elapsed: 25095505 ms, 6:58:15, speed: 99 games/s
#games: 2555904, #errors: 663, elapsed: 25775527 ms, 7:09:35, speed: 99 games/s
#games: 2621440, #errors: 693, elapsed: 26450804 ms, 7:20:50, speed: 99 games/s
#games: 2686976, #errors: 714, elapsed: 27130592 ms, 7:32:10, speed: 99 games/s
#games: 2752512, #errors: 732, elapsed: 27808160 ms, 7:43:28, speed: 98 games/s
#games: 2818048, #errors: 761, elapsed: 28484349 ms, 7:54:44, speed: 98 games/s
#games: 2883584, #errors: 789, elapsed: 29162518 ms, 8:06:02, speed: 98 games/s
#games: 2949120, #errors: 826, elapsed: 29846404 ms, 8:17:26, speed: 98 games/s
#games: 3014656, #errors: 853, elapsed: 30523011 ms, 8:28:43, speed: 98 games/s
#games: 3080192, #errors: 867, elapsed: 31209665 ms, 8:40:09, speed: 98 games/s
#games: 3145728, #errors: 880, elapsed: 31949305 ms, 8:52:29, speed: 98 games/s
#games: 3211264, #errors: 1062, elapsed: 32631659 ms, 9:03:51, speed: 98 games/s
#games: 3276800, #errors: 1079, elapsed: 33313481 ms, 9:15:13, speed: 98 games/s
#games: 3342336, #errors: 1094, elapsed: 33994066 ms, 9:26:34, speed: 98 games/s
#games: 3407872, #errors: 1192, elapsed: 34673492 ms, 9:37:53, speed: 98 games/s
#games: 3473408, #errors: 1207, elapsed: 35356877 ms, 9:49:16, speed: 98 games/s
#games: 3538944, #errors: 1215, elapsed: 36037082 ms, 10:00:37, speed: 98 games/s
#games: 3604480, #errors: 1234, elapsed: 36719751 ms, 10:11:59, speed: 98 games/s
#games: 3670016, #errors: 1249, elapsed: 37401816 ms, 10:23:21, speed: 98 games/s
#games: 3735552, #errors: 1265, elapsed: 38078877 ms, 10:34:38, speed: 98 games/s
#games: 3801088, #errors: 1288, elapsed: 38754382 ms, 10:45:54, speed: 98 games/s
#games: 3866624, #errors: 1309, elapsed: 39432327 ms, 10:57:12, speed: 98 games/s
#games: 3932160, #errors: 1324, elapsed: 40108755 ms, 11:08:28, speed: 98 games/s
#games: 3997696, #errors: 1336, elapsed: 40785585 ms, 11:19:45, speed: 98 games/s
#games: 4063232, #errors: 1350, elapsed: 41474393 ms, 11:31:14, speed: 97 games/s
#games: 4128768, #errors: 1360, elapsed: 42152638 ms, 11:42:32, speed: 97 games/s
#games: 4194304, #errors: 1363, elapsed: 42829281 ms, 11:53:49, speed: 97 games/s
#games: 4259840, #errors: 1383, elapsed: 43504469 ms, 12:05:04, speed: 97 games/s
#games: 4325376, #errors: 1390, elapsed: 44182156 ms, 12:16:22, speed: 97 games/s
#games: 4390912, #errors: 1567, elapsed: 44852579 ms, 12:27:32, speed: 97 games/s
#games: 4456448, #errors: 1582, elapsed: 45524093 ms, 12:38:44, speed: 97 games/s
#games: 4521984, #errors: 1597, elapsed: 46194895 ms, 12:49:54, speed: 97 games/s
#games: 4587520, #errors: 1622, elapsed: 46865679 ms, 13:01:05, speed: 97 games/s
#games: 4653056, #errors: 1651, elapsed: 47538553 ms, 13:12:18, speed: 97 games/s
#games: 4718592, #errors: 1883, elapsed: 48208785 ms, 13:23:28, speed: 97 games/s
#games: 4784128, #errors: 1899, elapsed: 48882032 ms, 13:34:42, speed: 97 games/s
#games: 4849664, #errors: 1913, elapsed: 49560679 ms, 13:46:00, speed: 97 games/s
#games: 4915200, #errors: 1938, elapsed: 50251596 ms, 13:57:31, speed: 97 games/s
#games: 4980736, #errors: 1956, elapsed: 50929071 ms, 14:08:49, speed: 97 games/s
#games: 5046272, #errors: 1965, elapsed: 51609620 ms, 14:20:09, speed: 97 games/s
#games: 5111808, #errors: 2160, elapsed: 52254743 ms, 14:30:54, speed: 97 games/s
#games: 5177344, #errors: 2182, elapsed: 52933593 ms, 14:42:13, speed: 97 games/s
#games: 5242880, #errors: 2194, elapsed: 53611943 ms, 14:53:31, speed: 97 games/s
#games: 5308416, #errors: 2213, elapsed: 54284927 ms, 15:04:44, speed: 97 games/s
#games: 5373952, #errors: 2240, elapsed: 54960221 ms, 15:16:00, speed: 97 games/s
#games: 5439488, #errors: 2254, elapsed: 55641717 ms, 15:27:21, speed: 97 games/s
#games: 5505024, #errors: 2274, elapsed: 56333271 ms, 15:38:53, speed: 97 games/s
#games: 5570560, #errors: 2278, elapsed: 57008077 ms, 15:50:08, speed: 97 games/s
#games: 5636096, #errors: 2288, elapsed: 57697688 ms, 16:01:37, speed: 97 games/s
#games: 5701632, #errors: 2310, elapsed: 58377538 ms, 16:12:57, speed: 97 games/s
#games: 5767168, #errors: 2317, elapsed: 59081800 ms, 16:24:41, speed: 97 games/s
#games: 5832704, #errors: 2333, elapsed: 59756076 ms, 16:35:56, speed: 97 games/s
#games: 5898240, #errors: 2340, elapsed: 60431986 ms, 16:47:11, speed: 97 games/s
#games: 5963776, #errors: 2353, elapsed: 61101629 ms, 16:58:21, speed: 97 games/s
#games: 6029312, #errors: 2681, elapsed: 61770305 ms, 17:09:30, speed: 97 games/s
#games: 6094848, #errors: 2693, elapsed: 62444472 ms, 17:20:44, speed: 97 games/s
#games: 6160384, #errors: 2702, elapsed: 63121631 ms, 17:32:01, speed: 97 games/s
#games: 6225920, #errors: 2708, elapsed: 63799645 ms, 17:43:19, speed: 97 games/s
#games: 6291456, #errors: 2720, elapsed: 64474163 ms, 17:54:34, speed: 97 games/s
#games: 6356992, #errors: 2733, elapsed: 65148550 ms, 18:05:48, speed: 97 games/s
#games: 6422528, #errors: 2741, elapsed: 65822507 ms, 18:17:02, speed: 97 games/s
#games: 6471823, #errors: 2750, elapsed: 66333833 ms, 18:25:33, speed: 97 games/s
Completed!

dangi12012 · Post by **dangi12012** » Sat Nov 06, 2021 10:51 pm

Dann Corbit wrote: ↑Sat Nov 06, 2021 10:30 pm I had to make some changes to get the code to compile on windows, so I include the code I used in the archive.
I took the Lichess database files and filtered for games with both players at or above 2500 Elo.

Thank you that looks very cool. One thing to keep in mind is that sqlite does not do bulk operations per default. You can wrap all inserts in a few bulk transactions and reuse parameters too.

That way inserts would be ~50x faster. I have a similar tool (which converts PGN to INSERT INTO statements) which works at 200k games/s for lichess db - and takes 120 seconds per 80 gb file or so.

Could you please please if possible also include the EVAL tag to each move?? That would be so very great!

Thank you for sharing. I recommend also using this tool:
https://sqlitestudio.pl/
I also mirrored this compressed it into a smaller format and will keep it up longer -
https://1drv.ms/u/s!AoDIyNlDG2QTg8YR_DX ... Q?e=rAqNr5

There seems to be a problem with date:
select date from game yields ??????

phhnguyen · Post by **phhnguyen** » Sat Nov 06, 2021 11:21 pm

Thanks all for the feedback.

I have started working on the speed of the converter. Will make a detailed report when finished.

phhnguyen · Post by **phhnguyen** » Mon Nov 08, 2021 5:36 am

I have started the next lap to improve the project. At a quick glance, the focus should be the speed of converting.

Compare with SCID4 (stats by Fulvio), the speed of converting from PGN into SQL databases is about 247 times slower. I need to wait near 4 hours to convert 3.45 million games and the speed is 259 games/s. @Dann Corbit spent 18.5 hours to convert 2.7 billion games with a speed under 100 games/s. Too long and the speed is so low! It may make someone give up before trying. Thus I decided to improve it first.

I didn’t know yet what make the converter that slow nor the limit of some libraries I used. I don’t think I can reach the speed of SCID4, at least for this lap, since it is a binary one and has been optimized for such a long time. I use the speed of SCID4 as the upper bound anyway. For this attempt, I think it may be reasonable to target the speed to about 3-5 times faster.

The converter has two main parts. The first one is the PGN parser and the second one is the database inserter. I will profile them independently to find out hot spots.

Again, I test with the database MillionBase 3.45 (mb-3.45.pgn, downloaded from Rebel website https://rebel13.nl/download/data.html). The database contains over 3.45 million games, 284 thousand players, 31 thousand events. Last time it took 3h42’ to convert on my machine (Quad-Core i7, 3.6 GHz, 16 GB), using 1 thread only.

1. PGN parser
The parser reads text from a PGN file, splits it into games, finds tags, moves, converts, and check move by moves then store each game in a record. The code is C++, “borrowed” from Banksia project (an open-source) thus its code has been tested for a while, low chance to get bugs and the speed should not be too bad.

This part is the most suspicious (to cause slow). It allocates frequently many small objects (strings). When parsing a game, it parses move by move, verifies their legal statuses. If there are comments, it parses comments for computing info (such as depth/time score nodes). For the current SQL structure, game bodies (move lists) are just strings (stored in field "moves"). Thus at the end, this part has to convert the move list (in its internal binary format) back into strings. The sound is redundant but I want to make sure all game moves are valid and can use some computing for the next tasks.

I think in this time I can optimize some steps to speed up. I know many PGN parsers parse only tags and keep pointers to game bodies without parsing them. Whenever the user wants to view a game, they will use those pointers to read and parse a game body instantly.

To profile this part, I turn off the second part. The parser just works without calling the inserter. Games are read and parsed without writing out the database.

1.1. Parse games with parsing/checking for legal moves and comments (convert move strings from PGN into internal move lists, then convert back those lists into move strings)

Code: Select all

#games: 3456762, elapsed: 1920397 ms, 32:00, speed: 1800 games/s

1.2 Parse games without parsing moves nor comments (keep move strings from PGN without any converting/checking)

Code: Select all

#games: 3456762, elapsed: 31361 ms, 00:31, speed: 110224 games/s

(1.2) is really fast, took only 31 seconds. It is comparable with SCID4. (1.1) is much slower, 32 minutes. However, it still took only a small part of the total 3h42’.

This parser still has space to improve. But from the above result, it is not the main reason of being low speed thus I am not rushing to improve it now.

2. Inserter

From a record of a game, this part converts it into some SQL insert statements and runs them to insert the game into the SQL database. For example, it inserts events’ names into event table, players’ names into player table, insert the game records with names (of event, players) as IDs into game table.

Since part 1 showed it didn’t take much time, this part now becomes a suspicious one.

I profile this part without using the first one. To do that, I use fake game data. Several game bodies (move lists) are taken from real ones and kept in an array and picked up randomly. Names (players, events) are just prefixes with random numbers in specific ranges (to make sure there are no more than 284 thousand player names and 31 thousand event names - similar on numbers as the real ones.

Below is the profile of this part:

Code: Select all

#games: 3450777, elapsed: 11468046 ms, 3:11:08, speed: 300 games/s

Bingo! This part takes almost exactly the missing period: 3h11’. So slow! The hot spot is found. I was so surprised since some reports show the insert speed with SQLite should be very high, say a million inserts per minute, thus 3.45 million games should take minutes, not hours. I might do something wrong!

BTW, it is time to continue and see what I could do.

Reading some tips here, there, and doing some homework, I found the below post about speeding up the insert statements for SQLite:

https://stackoverflow.com/questions/171 ... -of-sqlite

Following the post, I decided to try three main things: transactions, prepared statements, and in-memory databases.

2.1 Fake input data, transaction
Last time I have tried using transactions already for each insert statement. I did not see the improvement in speed thus in the end I have commented out the code. This time I wrap the main loop inside the transaction statements as the suggestion from the above link! I continue using fake data to test:

Code: Select all

#games: 3450777, elapsed: 100899 ms, 01:40, speed: 34200 games/s

Oh, what? 1:40 (near 2 minutes) only! 114 times faster! Much higher what I expected! How dumb I was with the first implementation!

With a bit overjoyed, I managed to complete the rest.

2.2 Fake input data, a transaction with prepared statements (using binding statements):

Code: Select all

#games: 3450777, elapsed: 36857 ms, 00:36, speed: 93626 games/s

Another improvement about 2.7 times faster, lead totally 311 times faster.

OK, enough with the fake data. Now it is time to change the converter to use the new method and test with the real data (MillionBase). Below are results with new implementations:

2.3 Real input data, transaction with prepared statements, parse and verify all moves of games

Code: Select all

#games: 3456762, #errors: 5985, elapsed: 2256903 ms, 37:36, speed: 1531 games/s

This is 6 times faster.

2.4 Real input data, transaction with prepared statements without parsing nor verifying game moves

Code: Select all

#games: 3456762, #errors: 372, elapsed: 102455 ms, 01:42, speed: 33739 games/s

This is 131 times faster.

2.5 Real input data, transaction with prepared statements with parsing/verifying game moves, in-memory database

Code: Select all

#games: 3456762, #errors: 5985, elapsed: 1905023 ms, 31:45, speed: 1814 games/s

7 times faster.

2.6 Real input data, a transaction with prepared statements without parsing/verifying game moves, in-memory database

Code: Select all

#games: 3456762, #errors: 372, elapsed: 57998 ms, 00:57, speed: 59601 games/s

This is 247 times faster.

(2.3) is 6 times faster than the old implementation when (2.4) is 131 times faster. (2.4) is so closed to SCID4.

(2.5), (2.6) are extremely fast. The database is created and stored in memory without writing into files. In 2.5, almost all the time was spent parsing and verifying moves. Without doing that task, 2.6 need under a minute to convert all 3.45 million games. That speed is totally comparable with SCID4. A bit funny is that 2.6 gave me the exact speed of SCID4: 54s, 247 times faster. Of course, my computer differs from Fulvio's one thus if we test them all in the same computer, they may be not the same - but should be closed/similar.

I have imagined how to use in-memory databases. Whenever an app works with PGN databases, it may convert it into an in-memory SQLite database. The converting is very fast, comparable with popular PGN readers/viewers. Developers can save a lot of time to develop their new internal database structures and write their own lot of functions just for those data structures. Just let SQLite manage its database with tones of features and functions. None can beat SQL abilities of querying, searching, indexing...

If you take care of details, the number of errors when turning off parsing/verifying moves is much smaller than one with turning on. Simply, many games have illegal moves.

All above tests run with 1 threat only and I see we can still improve, say using more threats, loading data by blocks, using C strings instead of C++ strings (avoid reallocating)...

The new code has been pushed already into GitHub (https://github.com/nguyenpham/ocgdb).

All are so good now

dangi12012 · Post by **dangi12012** » Mon Nov 08, 2021 2:17 pm

phhnguyen wrote: ↑Mon Nov 08, 2021 5:36 am I have started the next lap to improve the project. At a quick glance, the focus should be the speed of converting.

Compare with SCID4 (stats by Fulvio), the speed of converting from PGN into SQL databases is about 247 times slower. I need to wait near 4 hours to convert 3.45 million games and the speed is 259 games/s. @Dann Corbit spent 18.5 hours to convert 2.7 billion games with a speed under 100 games/s. Too long and the speed is so low! It may make someone give up before trying. Thus I decided to improve it first.

I didn’t know yet what make the converter that slow nor the limit of some libraries I used. I don’t think I can reach the speed of SCID4, at least for this lap, since it is a binary one and has been optimized for such a long time. I use the speed of SCID4 as the upper bound anyway. For this attempt, I think it may be reasonable to target the speed to about 3-5 times faster.

The converter has two main parts. The first one is the PGN parser and the second one is the database inserter. I will profile them independently to find out hot spots.

Again, I test with the database MillionBase 3.45 (mb-3.45.pgn, downloaded from Rebel website https://rebel13.nl/download/data.html). The database contains over 3.45 million games, 284 thousand players, 31 thousand events. Last time it took 3h42’ to convert on my machine (Quad-Core i7, 3.6 GHz, 16 GB), using 1 thread only.

1. PGN parser
The parser reads text from a PGN file, splits it into games, finds tags, moves, converts, and check move by moves then store each game in a record. The code is C++, “borrowed” from Banksia project (an open-source) thus its code has been tested for a while, low chance to get bugs and the speed should not be too bad.

This part is the most suspicious (to cause slow). It allocates frequently many small objects (strings). When parsing a game, it parses move by move, verifies their legal statuses. If there are comments, it parses comments for computing info (such as depth/time score nodes). For the current SQL structure, game bodies (move lists) are just strings (stored in field "moves"). Thus at the end, this part has to convert the move list (in its internal binary format) back into strings. The sound is redundant but I want to make sure all game moves are valid and can use some computing for the next tasks.

I think in this time I can optimize some steps to speed up. I know many PGN parsers parse only tags and keep pointers to game bodies without parsing them. Whenever the user wants to view a game, they will use those pointers to read and parse a game body instantly.

To profile this part, I turn off the second part. The parser just works without calling the inserter. Games are read and parsed without writing out the database.

1.1. Parse games with parsing/checking for legal moves and comments (convert move strings from PGN into internal move lists, then convert back those lists into move strings)
Code: Select all
#games: 3456762, elapsed: 1920397 ms, 32:00, speed: 1800 games/s
1.2 Parse games without parsing moves nor comments (keep move strings from PGN without any converting/checking)
Code: Select all
#games: 3456762, elapsed: 31361 ms, 00:31, speed: 110224 games/s
(1.2) is really fast, took only 31 seconds. It is comparable with SCID4. (1.1) is much slower, 32 minutes. However, it still took only a small part of the total 3h42’.

This parser still has space to improve. But from the above result, it is not the main reason of being low speed thus I am not rushing to improve it now.

2. Inserter

From a record of a game, this part converts it into some SQL insert statements and runs them to insert the game into the SQL database. For example, it inserts events’ names into event table, players’ names into player table, insert the game records with names (of event, players) as IDs into game table.

Since part 1 showed it didn’t take much time, this part now becomes a suspicious one.

I profile this part without using the first one. To do that, I use fake game data. Several game bodies (move lists) are taken from real ones and kept in an array and picked up randomly. Names (players, events) are just prefixes with random numbers in specific ranges (to make sure there are no more than 284 thousand player names and 31 thousand event names - similar on numbers as the real ones.

Below is the profile of this part:
Code: Select all
#games: 3450777, elapsed: 11468046 ms, 3:11:08, speed: 300 games/s 
Bingo! This part takes almost exactly the missing period: 3h11’. So slow! The hot spot is found. I was so surprised since some reports show the insert speed with SQLite should be very high, say a million inserts per minute, thus 3.45 million games should take minutes, not hours. I might do something wrong!

BTW, it is time to continue and see what I could do.

Reading some tips here, there, and doing some homework, I found the below post about speeding up the insert statements for SQLite:

https://stackoverflow.com/questions/171 ... -of-sqlite

Following the post, I decided to try three main things: transactions, prepared statements, and in-memory databases.

2.1 Fake input data, transaction
Last time I have tried using transactions already for each insert statement. I did not see the improvement in speed thus in the end I have commented out the code. This time I wrap the main loop inside the transaction statements as the suggestion from the above link! I continue using fake data to test:
Code: Select all
#games: 3450777, elapsed: 100899 ms, 01:40, speed: 34200 games/s
Oh, what? 1:40 (near 2 minutes) only! 114 times faster! Much higher what I expected! How dumb I was with the first implementation!

With a bit overjoyed, I managed to complete the rest.

2.2 Fake input data, a transaction with prepared statements (using binding statements):
Code: Select all
#games: 3450777, elapsed: 36857 ms, 00:36, speed: 93626 games/s
Another improvement about 2.7 times faster, lead totally 311 times faster.

OK, enough with the fake data. Now it is time to change the converter to use the new method and test with the real data (MillionBase). Below are results with new implementations:

2.3 Real input data, transaction with prepared statements, parse and verify all moves of games
Code: Select all
#games: 3456762, #errors: 5985, elapsed: 2256903 ms, 37:36, speed: 1531 games/s
This is 6 times faster.

2.4 Real input data, transaction with prepared statements without parsing nor verifying game moves
Code: Select all
#games: 3456762, #errors: 372, elapsed: 102455 ms, 01:42, speed: 33739 games/s
This is 131 times faster.

2.5 Real input data, transaction with prepared statements with parsing/verifying game moves, in-memory database
Code: Select all
#games: 3456762, #errors: 5985, elapsed: 1905023 ms, 31:45, speed: 1814 games/s
7 times faster.

2.6 Real input data, a transaction with prepared statements without parsing/verifying game moves, in-memory database
Code: Select all
#games: 3456762, #errors: 372, elapsed: 57998 ms, 00:57, speed: 59601 games/s
This is 247 times faster.

(2.3) is 6 times faster than the old implementation when (2.4) is 131 times faster. (2.4) is so closed to SCID4.

(2.5), (2.6) are extremely fast. The database is created and stored in memory without writing into files. In 2.5, almost all the time was spent parsing and verifying moves. Without doing that task, 2.6 need under a minute to convert all 3.45 million games. That speed is totally comparable with SCID4. A bit funny is that 2.6 gave me the exact speed of SCID4: 54s, 247 times faster. Of course, my computer differs from Fulvio's one thus if we test them all in the same computer, they may be not the same - but should be closed/similar.

I have imagined how to use in-memory databases. Whenever an app works with PGN databases, it may convert it into an in-memory SQLite database. The converting is very fast, comparable with popular PGN readers/viewers. Developers can save a lot of time to develop their new internal database structures and write their own lot of functions just for those data structures. Just let SQLite manage its database with tones of features and functions. None can beat SQL abilities of querying, searching, indexing...

If you take care of details, the number of errors when turning off parsing/verifying moves is much smaller than one with turning on. Simply, many games have illegal moves.

All above tests run with 1 threat only and I see we can still improve, say using more threats, loading data by blocks, using C strings instead of C++ strings (avoid reallocating)...

The new code has been pushed already into GitHub (https://github.com/nguyenpham/ocgdb).

All are so good now

Very impressive - three more things to speed it up 10x at least:
1)
Use memory mapped files to access data. (the operating system will map pagefaults to the file and you can work in the file as if it were in memory)
Use the 3rd updated implementation here:
https://stackoverflow.com/questions/179 ... ading-in-c
The trick is that memchr is extremely well optimized and the OS abstracts the file away completely.

2)
You can easily disable all journaling etc and go for bulk inserts with this:
#define DATABASE ":memory:" -> and bulk copy to disk DB every 100MB or so

3)
Define the database without indices initally - declare them after insert only.

Code: Select all

sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);
//Insert loop using sqlite3_bind_text
sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "CREATE  INDEX 'TTC_Stop_Index' ON 'TTC' ('Stop')", NULL, NULL, &sErrMsg);

You know the best part about SQLITE? You get almost the same speed with all backends. So if someone uses this DB in Java - its exactly as fast. C#? Same. Python? Same.
I am very impressed keep up the good work - it helps the community a lot!

Dann Corbit · Post by **Dann Corbit** » Mon Nov 08, 2021 11:17 pm

The result field seems to be a single character in width.

200 million rows per minute, using rust:
https://github.com/avinassh/fast-sqlite3-inserts
blog:
https://avi.im/blag/2021/fast-sqlite-inserts/

phhnguyen · Post by **phhnguyen** » Sat Nov 13, 2021 11:06 pm

This time I focus on the speed of reading text files (PGN). After removing all processing functions the reading code becomes so simple, straightforward C++ as the below:

Code: Select all

    std::string str;
    int64_t lineCnt = 0;
    while (getline(inFile, str)) {
        lineCnt++;
    }

The test PGN file is MillionBase 3.45, mb-3.45.pgn, 2.4 GB. To read it all, line by line, the above reading function spent over 12 seconds:

Code: Select all

Method: 0, C++ simple, #lineCnt: 68342530, elapsed: 12369 ms, speed: 5525307 lines/s

I have tried 6 improvements, from using C simple style (1), to memory-mapped files (2), read the whole file into a (huge) memory block (3), similar 3 but with C style (4), (multi) read file into a small block of 4 MB (5), similar 5 but with C style (6).

Below is the test code:

Code: Select all

void Builder::testReadTextFile(const std::string& path, int method)
{
    startTime = getNow();

    int64_t lineCnt = 0;
    
    std::string methodName;
    
    switch (method) {
        case 0:
        {
            methodName = "C++ simple";
            std::string str;
            std::ifstream inFile(path);
            while (getline(inFile, str)) {
                lineCnt++;
            }
            inFile.close();
            break;
        }
        case 1:
        {
            methodName = "C way";
            std::ifstream inFile(path);
            const int MAX_LENGTH = 1024 * 16;
            char* line = new char[MAX_LENGTH];
            while (inFile.getline(line, MAX_LENGTH) && strlen(line) > 0) {
                lineCnt++;
            }
            delete[] line;
            inFile.close();
            break;
        }
        case 2:
        {
            methodName = "mmap (memory mapped files)";
            size_t length;
            auto f = map_file(path.c_str(), length);
            auto l = f + length;

            while (f && f != l) {
                if ((f = static_cast<const char*>(memchr(f, '\n', l - f)))) {
                    lineCnt++;
                    f++;
                }
            }
            break;
        }

        case 3:
        {
            methodName = "all to one block, C++ way";
            std::ifstream file(path, std::ios::binary | std::ios::ate);
            std::streamsize size = file.tellg();
            file.seekg(0, std::ios::beg);

            std::vector<char> buffer(size);
            if (file.read(buffer.data(), size)) {
                for(auto && ch : buffer) {
                    if (ch == '\n') lineCnt++;
                }
            }
            break;
        }
            
        case 4:
        {
            methodName = "all to one block, C way";
            FILE *stream = fopen(path.c_str(), "r");
            assert(stream != NULL);
            fseek(stream, 0, SEEK_END);
            long stream_size = ftell(stream);
            fseek(stream, 0, SEEK_SET);
            char *buffer = (char*)malloc(stream_size);
            fread(buffer, stream_size, 1, stream);
            assert(ferror(stream) == 0);
            fclose(stream);
            
            for(long i = 0; i < stream_size; i++) {
                if (buffer[i] == '\n') lineCnt++;
            }

            free((void *)buffer);
            break;
        }

        case 5:
        {
            methodName = "blocks of 4 MB";
            const size_t blockSz = 4 * 1024 * 1024;

            std::ifstream file(path, std::ios::binary | std::ios::ate);
            size_t size = file.tellg();
            file.seekg(0, std::ios::beg);

            size_t sz = 0;
            
            std::vector<char> buffer(blockSz);
            while (sz < size) {
                auto k = std::min(blockSz, size - sz);
                if (k == 0) {
                    break;
                }
                if (file.read(buffer.data(), k)) {
                    for(size_t i = 0; i < k; i++) {
                        if (buffer[i] == '\n') lineCnt++;
                    }
                }
                
                sz += k;
            }
            break;
        }
        case 6:
        {
            methodName = "blocks of 4 MB, C way";
            const size_t blockSz = 4 * 1024 * 1024;
            char *buffer = (char*)malloc(blockSz + 16);

            FILE *stream = fopen(path.c_str(), "r");
            assert(stream != NULL);
            fseek(stream, 0, SEEK_END);
            size_t size = ftell(stream);
            fseek(stream, 0, SEEK_SET);

            
            size_t sz = 0;
            
            while (sz < size) {
                auto k = std::min(blockSz, size - sz);
                if (k == 0) {
                    break;
                }
                if (fread(buffer, k, 1, stream)) {
                    for(size_t i = 0; i < k; i++) {
                        if (buffer[i] == '\n') lineCnt++;
                    }
                }
                sz += k;
            }

            free(buffer);
            break;
        }

        default:
            break;
    }
    
    int64_t elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(getNow() - startTime).count() + 1;

    std::cout << "Method: " << method << ", " << methodName
              << ", #lineCnt: " << lineCnt
              << ", elapsed: " << elapsed << " ms"
              << ", speed: " << lineCnt * 1000 / elapsed << " lines/s"
              << std::endl;

}

And here are the results.

Code: Select all

Method: 0, C++ simple, #lineCnt: 68342530, elapsed: 12369 ms, speed: 5525307 lines/s
Method: 1, C way, #lineCnt: 68342530, elapsed: 8633 ms, speed: 7916428 lines/s
Method: 2, mmap (memory mapped files), #lineCnt: 68342530, elapsed: 1660 ms, speed: 41170198 lines/s
Method: 3, all to one block, C++ way, #lineCnt: 68342530, elapsed: 2716 ms, speed: 25162934 lines/s
Method: 4, all to one block, C way, #lineCnt: 68342530, elapsed: 1819 ms, speed: 37571484 lines/s
Method: 5, blocks of 4 MB, C++ way, #lineCnt: 68342530, elapsed: 1462 ms, speed: 46745916 lines/s
Method: 6, blocks of 4 MB, C way, #lineCnt: 68342530, elapsed: 866 ms, speed: 78917471 lines/s

The memory-mapped file (method 2) is the second fast - 7 times faster than the original one (method 0). It is a surprise since I know reading in the sequence is not the strong point for this method. The implementation is still simple and there is potential to improve but I don’t go further because some improvements may depend on the OS and I prefer a simpler code.

The fastest way is reading by small blocks of 4 MB, using C way to allocate memory (method 6 - the last one), 14 times faster. It can read the whole file 2.4 GB under 1s. It is so simple, easy to understand, maintain, and it is easy to improve with multi-threads. Perhaps I will adopt it as the main method for later use.

Code has been pushed into the GitHub of the project, branch “testreadingtextfile”.

Sopel · Post by **Sopel** » Sun Nov 14, 2021 10:23 pm

You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.

dangi12012 · Post by **dangi12012** » Mon Nov 15, 2021 12:35 am

Sopel wrote: ↑Sun Nov 14, 2021 10:23 pm You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.

A Thread pool for IO? Asynchronous IO runs on a single thread.
If you look at Sopels history - he is a forum troll so better not engage with him.

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard