Open Chess Game Database Standard

mvanthoor · Post by **mvanthoor** » Mon Nov 15, 2021 12:58 am

dangi12012 wrote: ↑Mon Nov 15, 2021 12:35 am A Thread pool for IO? Asynchronous IO runs on a single thread.
If you look at Sopels history - he is a forum troll so better not engage with him.

Maybe, before labeling someone a troll, you should also take into account that he has contributed a great deal to Stockfish's NNUE development, especially with regard to optimization.

Sopel · Post by **Sopel** » Mon Nov 15, 2021 1:10 am

dangi12012 wrote: ↑Mon Nov 15, 2021 12:35 am
Sopel wrote: ↑Sun Nov 14, 2021 10:23 pm You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.
A Thread pool for IO? Asynchronous IO runs on a single thread.
If you look at Sopels history - he is a forum troll so better not engage with him.

Real async IO is a pain in the ass to do in a portable way. A dedicated thread pool for IO works well enough and is simple to set up.

Guenther · Post by **Guenther** » Mon Nov 15, 2021 10:20 am

dangi12012 wrote: ↑Mon Nov 15, 2021 12:35 am
Sopel wrote: ↑Sun Nov 14, 2021 10:23 pm You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.
...
If you look at Sopels history - he is a forum troll so better not engage with him.

Quite the opposite, which was clear since your first posts here...

phhnguyen · Post by **phhnguyen** » Mon Nov 15, 2021 12:31 pm

Sopel wrote: ↑Sun Nov 14, 2021 10:23 pm You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.

Could you tell me more? Do you have ideas to use them efficiently? It’s best if you could provide some codes or requests (for GitHub).

I have tested already memory map files (mmap). It is good, surprising me but it is just the second-best, behind the fastest one with a significant margin when its code is more complicated.

I don’t have problems with having some ideas and/or implementing them in general. However, sometimes I struggled with how to apply them efficiently. In this case, it is multi-threading. Both reading input (the PGN file) and writing to the database (inserting records) are almost sequencings. Thus logically multi-threading won't help much, especially when other processes (such as parsing PGN tags) are very fast too (not much work to share between threads).

Sopel · Post by **Sopel** » Mon Nov 15, 2021 2:47 pm

phhnguyen wrote: ↑Mon Nov 15, 2021 12:31 pm
Sopel wrote: ↑Sun Nov 14, 2021 10:23 pm You want to read the PGNs asynchronously. So either use a dedicated IO thread pool or mmap.
Could you tell me more? Do you have ideas to use them efficiently? It’s best if you could provide some codes or requests (for GitHub).

I have tested already memory map files (mmap). It is good, surprising me but it is just the second-best, behind the fastest one with a significant margin when its code is more complicated.

I don’t have problems with having some ideas and/or implementing them in general. However, sometimes I struggled with how to apply them efficiently. In this case, it is multi-threading. Both reading input (the PGN file) and writing to the database (inserting records) are almost sequencings. Thus logically multi-threading won't help much, especially when other processes (such as parsing PGN tags) are very fast too (not much work to share between threads).

The thing is your benchmark does no work other than reading, so synchronous solutions will not look worse than asynchronous ones. You should add some processing in between the reads to see the issue. For easy asynchronous IO std::async (https://en.cppreference.com/w/cpp/thread/async) is an easy solution that's good enough.

In this case, it is multi-threading. Both reading input (the PGN file) and writing to the database (inserting records) are almost sequencings.

I highly doubt this is the case. You have a pipeline with 4 stages. 1. Reading the file. 2. Parsing the file. 3. Creating the import statements. 4. Importing into DB

dangi12012 · Post by **dangi12012** » Mon Nov 15, 2021 3:12 pm

phhnguyen wrote: ↑Mon Nov 15, 2021 12:31 pm I don’t have problems with having some ideas and/or implementing them in general. However, sometimes I struggled with how to apply them efficiently. In this case, it is multi-threading. Both reading input (the PGN file) and writing to the database (inserting records) are almost sequencings. Thus logically multi-threading won't help much, especially when other processes (such as parsing PGN tags) are very fast too (not much work to share between threads).

As usual very bad trolling advice from sopel:

Sopel wrote: ↑Mon Nov 15, 2021 2:47 pm I highly doubt this is the case. You have a pipeline with 4 stages. 1. Reading the file. 2. Parsing the file. 3. Creating the import statements. 4. Importing into DB

Heres the answer:
Of course you can parse a file multithreaded and very fast - and here is how:
You have to index the offsets and lengths of all games in a pgn without parsing. Just the raw offset with json or a text seeker. This can be insanely fast because maybe you only search for {} tokens or double newlines.
For Lichess DB I did this already and there i just search for doubled newlines with C++ memchr() on a memory mapped file.

Once you have this vector of offsets and lengths - you can easily spawn 32 Threads and parse each offset and length seperately in via a mapped file.

The trick is that seeking for simple tokens in phase 0 is faster than parsing pgn fully - so you just remember the pointeroffset where each game starts and on a second pass you parse in parallel. If your game is a class you can even generate fuctions to prepare a sql statement for insert. This can also be done in parallel.

DB inserts cannot be done in parallel with sqlite (with proper server sql dbs its faster) - but sqlite is threadsafe anyways so no worries and just commit transactions from multiple threads. Should also be a little bit faster because inserts dont stall.

Sopel · Post by **Sopel** » Mon Nov 15, 2021 3:49 pm

dangi12012 wrote: ↑Mon Nov 15, 2021 3:12 pm
phhnguyen wrote: ↑Mon Nov 15, 2021 12:31 pm I don’t have problems with having some ideas and/or implementing them in general. However, sometimes I struggled with how to apply them efficiently. In this case, it is multi-threading. Both reading input (the PGN file) and writing to the database (inserting records) are almost sequencings. Thus logically multi-threading won't help much, especially when other processes (such as parsing PGN tags) are very fast too (not much work to share between threads).
As usual very bad trolling advice from sopel:
Sopel wrote: ↑Mon Nov 15, 2021 2:47 pm I highly doubt this is the case. You have a pipeline with 4 stages. 1. Reading the file. 2. Parsing the file. 3. Creating the import statements. 4. Importing into DB
Heres the answer:
Of course you can parse a file multithreaded and very fast - and here is how:
You have to index the offsets and lengths of all games in a pgn without parsing. Just the raw offset with json or a text seeker. This can be insanely fast because maybe you only search for {} tokens or double newlines.
For Lichess DB I did this already and there i just search for doubled newlines with C++ memchr() on a memory mapped file.

Once you have this vector of offsets and lengths - you can easily spawn 32 Threads and parse each offset and length seperately in via a mapped file.

The trick is that seeking for simple tokens in phase 0 is faster than parsing pgn fully - so you just remember the pointeroffset where each game starts and on a second pass you parse in parallel. If your game is a class you can even generate fuctions to prepare a sql statement for insert. This can also be done in parallel.

DB inserts cannot be done in parallel with sqlite (with proper server sql dbs its faster) - but sqlite is threadsafe anyways so no worries and just commit transactions from multiple threads. Should also be a little bit faster because inserts dont stall.

You really like to assume that memory is infinite and the IO is free, don't you. You DO REQUIRE asynchronicity aside from parallelism to make it work for files of all sizes in an efficient way.

dangi12012 · Post by **dangi12012** » Mon Nov 15, 2021 4:24 pm

Sopel wrote: ↑Mon Nov 15, 2021 3:49 pm You really like to assume that memory is infinite and the IO is free, don't you. You DO REQUIRE asynchronicity aside from parallelism to make it work for files of all sizes in an efficient way.

Last time I will reply since its obvious you are a forum troll. Please lookup how memory mapped files work and how "memory is infinite" is wrong. You obviously dont know that memory mapped files dont get loaded into RAM. Each thread can read via a pointer and all IO is handled via pagefaults. You dont even need to copy from/to any buffers like with streaming IO - the OS maps the "buffer" = 1 page of memory directly into your virtual adress space.
Its the most efficient way for files on windows/linux - and you can even give hints to the OS if you will access randomly or sequentially.

Read both carefully - but I wont reply to your troll attempts anymore.
https://en.wikipedia.org/wiki/Memory_management_unit
https://docs.microsoft.com/en-us/dotnet ... pped-files

Sopel · Post by **Sopel** » Mon Nov 15, 2021 4:34 pm

dangi12012 wrote: ↑Mon Nov 15, 2021 4:24 pm
Sopel wrote: ↑Mon Nov 15, 2021 3:49 pm You really like to assume that memory is infinite and the IO is free, don't you. You DO REQUIRE asynchronicity aside from parallelism to make it work for files of all sizes in an efficient way.
Last time I will reply since its obvious you are a forum troll. Please lookup how memory mapped files work and how "memory is infinite" is wrong. You obviously dont know that memory mapped files dont get loaded into RAM. Each thread can read via a pointer and all IO is handled via pagefaults. You dont even need to copy from/to any buffers like with streaming IO - the OS maps the "buffer" = 1 page of memory directly into your virtual adress space.
Its the most efficient way for files on windows/linux - and you can even give hints to the OS if you will access randomly or sequentially.

Read both carefully - but I wont reply to your troll attempts anymore.
https://en.wikipedia.org/wiki/Memory_management_unit
https://docs.microsoft.com/en-us/dotnet ... pped-files

I'm done with your straw man arguments. You're unable to hold a discussion.

Fulvio · Post by **Fulvio** » Mon Nov 15, 2021 6:55 pm

Sopel wrote: ↑Mon Nov 15, 2021 2:47 pm I highly doubt this is the case. You have a pipeline with 4 stages. 1. Reading the file. 2. Parsing the file. 3. Creating the import statements. 4. Importing into DB

In my experience (SCID reads PGN files in 128kb chunks, automatically doubling the buffer up to 128MB if it encounters larger games) O.S. are pretty good at optimizing point 1. Moving it to a separate thread may increase complexity without improving the performance.

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard