dangi12012 wrote: ↑Mon Nov 01, 2021 2:00 pm
PGNs may contain the evaluation and it would be great to have a lookup for every move in the db possible.
For example lichess has something like this:
1. e4 { [%eval 0.24] [%clk 0:03:00] } 1... e5 { [%eval 0.12] [%clk 0:03:00] } 2. Nf3 { [%eval 0.17] [%clk 0:02:59] } 2..
This can be used to select all games of a certain player and find the openings where he consistently plays a slight mistaken move.
On one hand, I agreed we can use those fields (evals, scores) for selecting moves as an opening book. A straightforward database is not good enough for that task, thus we need extra fields. More data could help to solve more problems.
However, there are questions about consistency, efficiency, and if it is worth doing. Clearly, with other databases with games that are not from Lichess or from human players, we don't have that info for each move. Speed is another problem. Typically to find out a good move as opening book one you may query several times. That is slow, not efficient, and may take totally several seconds with huge databases, too long for that task. Using only computing scores for opening books is another problem since they may be good at tactics but not strategies.
The simple question is that why not just convert databases into “normal” opening books, using traditional ways?
dangi12012 wrote: ↑Mon Nov 01, 2021 2:00 pm
Do you plan to write a wrapper as well?
So that we have a C++ header that we can use and get some already implemented methods?
For example "Chess_Database.h" - which already countains a Load(char* path) and Count() Methods.
I want to add that sqlites slower count(*) is deliberate and can be done fast by:
I dont know if there is a shared language that can emit code for python, c++, c# and java with sqlite backends. Then we only would need to write simple query code once - and it will code for all languages and we can have a central github for that.
That way may not be useful for some reasons:
- It still takes time (even faster than COUNT) since the command “LIMIT”
- It is supposed the numbers in the ID field are continuing. That may not be that if some records are removed when using
- Counting all records in a table (so far we have only a few tables) is just only a few queries we need. We still have many questions with COUNT, involving from 1 to all tables. For example, count all games of Carlsen in 2021, count all games with event ABC… For answering, SQL engines usually scan the whole database which takes time when the number of records is large
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
I had to make some changes to get the code to compile on windows, so I include the code I used in the archive.
I took the Lichess database files and filtered for games with both players at or above 2500 Elo.
The result was 6,471,823 games (out of the raw total of 2,728,370,048 games). About 18.5 hours to build the book.
I am not coming back. I saw the project on github but got no feedback there, so I thought I would contribute because it is an interesting project.
I won't leave the archives up forever, so if you want them, come and get it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit wrote: ↑Sat Nov 06, 2021 10:30 pm
I had to make some changes to get the code to compile on windows, so I include the code I used in the archive.
I took the Lichess database files and filtered for games with both players at or above 2500 Elo.
Thank you that looks very cool. One thing to keep in mind is that sqlite does not do bulk operations per default. You can wrap all inserts in a few bulk transactions and reuse parameters too.
That way inserts would be ~50x faster. I have a similar tool (which converts PGN to INSERT INTO statements) which works at 200k games/s for lichess db - and takes 120 seconds per 80 gb file or so.
Could you please please if possible also include the EVAL tag to each move?? That would be so very great!
I have started the next lap to improve the project. At a quick glance, the focus should be the speed of converting.
Compare with SCID4 (stats by Fulvio), the speed of converting from PGN into SQL databases is about 247 times slower. I need to wait near 4 hours to convert 3.45 million games and the speed is 259 games/s. @Dann Corbit spent 18.5 hours to convert 2.7 billion games with a speed under 100 games/s. Too long and the speed is so low! It may make someone give up before trying. Thus I decided to improve it first.
I didn’t know yet what make the converter that slow nor the limit of some libraries I used. I don’t think I can reach the speed of SCID4, at least for this lap, since it is a binary one and has been optimized for such a long time. I use the speed of SCID4 as the upper bound anyway. For this attempt, I think it may be reasonable to target the speed to about 3-5 times faster.
The converter has two main parts. The first one is the PGN parser and the second one is the database inserter. I will profile them independently to find out hot spots.
Again, I test with the database MillionBase 3.45 (mb-3.45.pgn, downloaded from Rebel website https://rebel13.nl/download/data.html). The database contains over 3.45 million games, 284 thousand players, 31 thousand events. Last time it took 3h42’ to convert on my machine (Quad-Core i7, 3.6 GHz, 16 GB), using 1 thread only.
1. PGN parser
The parser reads text from a PGN file, splits it into games, finds tags, moves, converts, and check move by moves then store each game in a record. The code is C++, “borrowed” from Banksia project (an open-source) thus its code has been tested for a while, low chance to get bugs and the speed should not be too bad.
This part is the most suspicious (to cause slow). It allocates frequently many small objects (strings). When parsing a game, it parses move by move, verifies their legal statuses. If there are comments, it parses comments for computing info (such as depth/time score nodes). For the current SQL structure, game bodies (move lists) are just strings (stored in field "moves"). Thus at the end, this part has to convert the move list (in its internal binary format) back into strings. The sound is redundant but I want to make sure all game moves are valid and can use some computing for the next tasks.
I think in this time I can optimize some steps to speed up. I know many PGN parsers parse only tags and keep pointers to game bodies without parsing them. Whenever the user wants to view a game, they will use those pointers to read and parse a game body instantly.
To profile this part, I turn off the second part. The parser just works without calling the inserter. Games are read and parsed without writing out the database.
1.1. Parse games with parsing/checking for legal moves and comments (convert move strings from PGN into internal move lists, then convert back those lists into move strings)
(1.2) is really fast, took only 31 seconds. It is comparable with SCID4. (1.1) is much slower, 32 minutes. However, it still took only a small part of the total 3h42’.
This parser still has space to improve. But from the above result, it is not the main reason of being low speed thus I am not rushing to improve it now.
2. Inserter
From a record of a game, this part converts it into some SQL insert statements and runs them to insert the game into the SQL database. For example, it inserts events’ names into event table, players’ names into player table, insert the game records with names (of event, players) as IDs into game table.
Since part 1 showed it didn’t take much time, this part now becomes a suspicious one.
I profile this part without using the first one. To do that, I use fake game data. Several game bodies (move lists) are taken from real ones and kept in an array and picked up randomly. Names (players, events) are just prefixes with random numbers in specific ranges (to make sure there are no more than 284 thousand player names and 31 thousand event names - similar on numbers as the real ones.
Bingo! This part takes almost exactly the missing period: 3h11’. So slow! The hot spot is found. I was so surprised since some reports show the insert speed with SQLite should be very high, say a million inserts per minute, thus 3.45 million games should take minutes, not hours. I might do something wrong!
BTW, it is time to continue and see what I could do.
Reading some tips here, there, and doing some homework, I found the below post about speeding up the insert statements for SQLite:
Following the post, I decided to try three main things: transactions, prepared statements, and in-memory databases.
2.1 Fake input data, transaction
Last time I have tried using transactions already for each insert statement. I did not see the improvement in speed thus in the end I have commented out the code. This time I wrap the main loop inside the transaction statements as the suggestion from the above link! I continue using fake data to test:
Another improvement about 2.7 times faster, lead totally 311 times faster.
OK, enough with the fake data. Now it is time to change the converter to use the new method and test with the real data (MillionBase). Below are results with new implementations:
2.3 Real input data, transaction with prepared statements, parse and verify all moves of games
(2.3) is 6 times faster than the old implementation when (2.4) is 131 times faster. (2.4) is so closed to SCID4.
(2.5), (2.6) are extremely fast. The database is created and stored in memory without writing into files. In 2.5, almost all the time was spent parsing and verifying moves. Without doing that task, 2.6 need under a minute to convert all 3.45 million games. That speed is totally comparable with SCID4. A bit funny is that 2.6 gave me the exact speed of SCID4: 54s, 247 times faster. Of course, my computer differs from Fulvio's one thus if we test them all in the same computer, they may be not the same - but should be closed/similar.
I have imagined how to use in-memory databases. Whenever an app works with PGN databases, it may convert it into an in-memory SQLite database. The converting is very fast, comparable with popular PGN readers/viewers. Developers can save a lot of time to develop their new internal database structures and write their own lot of functions just for those data structures. Just let SQLite manage its database with tones of features and functions. None can beat SQL abilities of querying, searching, indexing...
If you take care of details, the number of errors when turning off parsing/verifying moves is much smaller than one with turning on. Simply, many games have illegal moves.
All above tests run with 1 threat only and I see we can still improve, say using more threats, loading data by blocks, using C strings instead of C++ strings (avoid reallocating)...
phhnguyen wrote: ↑Mon Nov 08, 2021 5:36 am
I have started the next lap to improve the project. At a quick glance, the focus should be the speed of converting.
Compare with SCID4 (stats by Fulvio), the speed of converting from PGN into SQL databases is about 247 times slower. I need to wait near 4 hours to convert 3.45 million games and the speed is 259 games/s. @Dann Corbit spent 18.5 hours to convert 2.7 billion games with a speed under 100 games/s. Too long and the speed is so low! It may make someone give up before trying. Thus I decided to improve it first.
I didn’t know yet what make the converter that slow nor the limit of some libraries I used. I don’t think I can reach the speed of SCID4, at least for this lap, since it is a binary one and has been optimized for such a long time. I use the speed of SCID4 as the upper bound anyway. For this attempt, I think it may be reasonable to target the speed to about 3-5 times faster.
The converter has two main parts. The first one is the PGN parser and the second one is the database inserter. I will profile them independently to find out hot spots.
Again, I test with the database MillionBase 3.45 (mb-3.45.pgn, downloaded from Rebel website https://rebel13.nl/download/data.html). The database contains over 3.45 million games, 284 thousand players, 31 thousand events. Last time it took 3h42’ to convert on my machine (Quad-Core i7, 3.6 GHz, 16 GB), using 1 thread only.
1. PGN parser
The parser reads text from a PGN file, splits it into games, finds tags, moves, converts, and check move by moves then store each game in a record. The code is C++, “borrowed” from Banksia project (an open-source) thus its code has been tested for a while, low chance to get bugs and the speed should not be too bad.
This part is the most suspicious (to cause slow). It allocates frequently many small objects (strings). When parsing a game, it parses move by move, verifies their legal statuses. If there are comments, it parses comments for computing info (such as depth/time score nodes). For the current SQL structure, game bodies (move lists) are just strings (stored in field "moves"). Thus at the end, this part has to convert the move list (in its internal binary format) back into strings. The sound is redundant but I want to make sure all game moves are valid and can use some computing for the next tasks.
I think in this time I can optimize some steps to speed up. I know many PGN parsers parse only tags and keep pointers to game bodies without parsing them. Whenever the user wants to view a game, they will use those pointers to read and parse a game body instantly.
To profile this part, I turn off the second part. The parser just works without calling the inserter. Games are read and parsed without writing out the database.
1.1. Parse games with parsing/checking for legal moves and comments (convert move strings from PGN into internal move lists, then convert back those lists into move strings)
(1.2) is really fast, took only 31 seconds. It is comparable with SCID4. (1.1) is much slower, 32 minutes. However, it still took only a small part of the total 3h42’.
This parser still has space to improve. But from the above result, it is not the main reason of being low speed thus I am not rushing to improve it now.
2. Inserter
From a record of a game, this part converts it into some SQL insert statements and runs them to insert the game into the SQL database. For example, it inserts events’ names into event table, players’ names into player table, insert the game records with names (of event, players) as IDs into game table.
Since part 1 showed it didn’t take much time, this part now becomes a suspicious one.
I profile this part without using the first one. To do that, I use fake game data. Several game bodies (move lists) are taken from real ones and kept in an array and picked up randomly. Names (players, events) are just prefixes with random numbers in specific ranges (to make sure there are no more than 284 thousand player names and 31 thousand event names - similar on numbers as the real ones.
Bingo! This part takes almost exactly the missing period: 3h11’. So slow! The hot spot is found. I was so surprised since some reports show the insert speed with SQLite should be very high, say a million inserts per minute, thus 3.45 million games should take minutes, not hours. I might do something wrong!
BTW, it is time to continue and see what I could do.
Reading some tips here, there, and doing some homework, I found the below post about speeding up the insert statements for SQLite:
Following the post, I decided to try three main things: transactions, prepared statements, and in-memory databases.
2.1 Fake input data, transaction
Last time I have tried using transactions already for each insert statement. I did not see the improvement in speed thus in the end I have commented out the code. This time I wrap the main loop inside the transaction statements as the suggestion from the above link! I continue using fake data to test:
Another improvement about 2.7 times faster, lead totally 311 times faster.
OK, enough with the fake data. Now it is time to change the converter to use the new method and test with the real data (MillionBase). Below are results with new implementations:
2.3 Real input data, transaction with prepared statements, parse and verify all moves of games
(2.3) is 6 times faster than the old implementation when (2.4) is 131 times faster. (2.4) is so closed to SCID4.
(2.5), (2.6) are extremely fast. The database is created and stored in memory without writing into files. In 2.5, almost all the time was spent parsing and verifying moves. Without doing that task, 2.6 need under a minute to convert all 3.45 million games. That speed is totally comparable with SCID4. A bit funny is that 2.6 gave me the exact speed of SCID4: 54s, 247 times faster. Of course, my computer differs from Fulvio's one thus if we test them all in the same computer, they may be not the same - but should be closed/similar.
I have imagined how to use in-memory databases. Whenever an app works with PGN databases, it may convert it into an in-memory SQLite database. The converting is very fast, comparable with popular PGN readers/viewers. Developers can save a lot of time to develop their new internal database structures and write their own lot of functions just for those data structures. Just let SQLite manage its database with tones of features and functions. None can beat SQL abilities of querying, searching, indexing...
If you take care of details, the number of errors when turning off parsing/verifying moves is much smaller than one with turning on. Simply, many games have illegal moves.
All above tests run with 1 threat only and I see we can still improve, say using more threats, loading data by blocks, using C strings instead of C++ strings (avoid reallocating)...
Very impressive - three more things to speed it up 10x at least:
1)
Use memory mapped files to access data. (the operating system will map pagefaults to the file and you can work in the file as if it were in memory)
Use the 3rd updated implementation here: https://stackoverflow.com/questions/179 ... ading-in-c
The trick is that memchr is extremely well optimized and the OS abstracts the file away completely.
2)
You can easily disable all journaling etc and go for bulk inserts with this:
#define DATABASE ":memory:" -> and bulk copy to disk DB every 100MB or so
3)
Define the database without indices initally - declare them after insert only.
sqlite3_exec(db, "BEGIN TRANSACTION", NULL, NULL, &sErrMsg);
//Insert loop using sqlite3_bind_text
sqlite3_exec(db, "END TRANSACTION", NULL, NULL, &sErrMsg);
sqlite3_exec(db, "CREATE INDEX 'TTC_Stop_Index' ON 'TTC' ('Stop')", NULL, NULL, &sErrMsg);
You know the best part about SQLITE? You get almost the same speed with all backends. So if someone uses this DB in Java - its exactly as fast. C#? Same. Python? Same.
I am very impressed keep up the good work - it helps the community a lot!
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
This time I focus on the speed of reading text files (PGN). After removing all processing functions the reading code becomes so simple, straightforward C++ as the below:
I have tried 6 improvements, from using C simple style (1), to memory-mapped files (2), read the whole file into a (huge) memory block (3), similar 3 but with C style (4), (multi) read file into a small block of 4 MB (5), similar 5 but with C style (6).
Method: 0, C++ simple, #lineCnt: 68342530, elapsed: 12369 ms, speed: 5525307 lines/s
Method: 1, C way, #lineCnt: 68342530, elapsed: 8633 ms, speed: 7916428 lines/s
Method: 2, mmap (memory mapped files), #lineCnt: 68342530, elapsed: 1660 ms, speed: 41170198 lines/s
Method: 3, all to one block, C++ way, #lineCnt: 68342530, elapsed: 2716 ms, speed: 25162934 lines/s
Method: 4, all to one block, C way, #lineCnt: 68342530, elapsed: 1819 ms, speed: 37571484 lines/s
Method: 5, blocks of 4 MB, C++ way, #lineCnt: 68342530, elapsed: 1462 ms, speed: 46745916 lines/s
Method: 6, blocks of 4 MB, C way, #lineCnt: 68342530, elapsed: 866 ms, speed: 78917471 lines/s
The memory-mapped file (method 2) is the second fast - 7 times faster than the original one (method 0). It is a surprise since I know reading in the sequence is not the strong point for this method. The implementation is still simple and there is potential to improve but I don’t go further because some improvements may depend on the OS and I prefer a simpler code.
The fastest way is reading by small blocks of 4 MB, using C way to allocate memory (method 6 - the last one), 14 times faster. It can read the whole file 2.4 GB under 1s. It is so simple, easy to understand, maintain, and it is easy to improve with multi-threads. Perhaps I will adopt it as the main method for later use.
Code has been pushed into the GitHub of the project, branch “testreadingtextfile”.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.