handling huge pgn databases

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

Jonathan003
Posts: 244
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: handling huge pgn databases

Post by Jonathan003 »

KLc wrote: Wed Feb 02, 2022 8:38 am There's the Lichess Elite Database https://database.nikonoel.fr. Otherwise, you have to filter yourself (e.g. with pgn-extract).
I have downloaded the SCID files from here:
https://database.nikonoel.fr/#:~:text=A ... ith%20scid.
There is a problem that tournament games are not included for some reason:
https://database.nikonoel.fr/#:~:text=I ... cluded)%3A
I have to check if the tournament games from Lichess are included in the download file for scid until the March 2021 issue
Jonathan003
Posts: 244
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: handling huge pgn databases

Post by Jonathan003 »

phhnguyen wrote: Wed Feb 02, 2022 2:21 pm
Jonathan003 wrote: Wed Feb 02, 2022 1:13 am
phhnguyen wrote: Tue Feb 01, 2022 5:58 am The program can create databases in SQL format which are the same sizes, and same speeds as SCID but it can work with very huge numbers of games. In this test, it works very well with 94 million games of Lichess, need about 1.5 hours for processing that file (on my 5-year-old 4-cores computer). We have estimated it could work with billion games too.
Thanks for the information, I will try this 'Open Chess Game Database Standard (OCGDB)' tool.
Can I download these 94 million games database from Lichess somewhere?
I know I can download Lichess databases here: https://database.lichess.org/
But I'm looking for collections of high quality Lichess databases.
Like all standard human games played on Lichess.org, where one of the players have a minimum rating of 2000 Elo.
I have used that link too to download Lichess games and don't know anywhere for better ones.

You can create yourself a high quality with OCGDB by adding para -elo 200 when creating as below:

Code: Select all

ocgdb -db bigdb.ocgdb.db3 -pgn file1.pgn  -pgn file2.pgn -pgn file3.pgn -o moves2 -cpu 4 -elo 2000
OCGDB could run much faster if you filter out more games.
Can I specify somehow that only one player have to have a elo rating of minimum 2000 elo?
So games where one player has for example 1400 elo and the other player has 2100 elo would also be included.
And can I know for sure that it will be all human games without any engine games included?
Do you have some idea if I do a search with these setting, how long it is gone take to download the games? And how many games will it be approximately?
KLc
Posts: 140
Joined: Wed Jun 03, 2020 6:46 am
Full name: Kurt Lanc

Re: handling huge pgn databases

Post by KLc »

Jonathan003 wrote: Wed Feb 02, 2022 4:40 pm There is a problem that tournament games are not included for some reason:
https://database.nikonoel.fr/#:~:text=I ... cluded)%3A
I have to check if the tournament games from Lichess are included in the download file for scid until the March 2021 issue
As far as I know, the Lichess databases do not contain tournament games. I don't know why.

Edit: Ah, on the Elite website it's said: "I published the Lichess Elite Database from 2013 until May 2020 as a torrent which you can get below (known issue: tournament games from lichess are not included)". I don't know if this only holds for the Elite or the whole Lichess base.
User avatar
phhnguyen
Posts: 1526
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: handling huge pgn databases

Post by phhnguyen »

Jonathan003 wrote: Wed Feb 02, 2022 5:12 pm Can I specify somehow that only one player have to have a elo rating of minimum 2000 elo?
So games where one player has for example 1400 elo and the other player has 2100 elo would also be included.
OCGDB doesn’t have that feature yet. We may implement it in the coming time.

BTW, if you know SQL, it is just a doable task for yourself. A database of OCGDB is just a typical SQL/SQLite database. You could study it with any SQLite browser/studio/GUI. You may also select and create new sub-databases with your own conditions, using some SQL statements.

Image
Using a SQL browser to query with SQL statements

Jonathan003 wrote: Wed Feb 02, 2022 5:12 pm And can I know for sure that it will be all human games without any engine games included?
I think bots may have the prefix BOT. Of course, you can't be sure for cheating accounts that use engines under human names.
Jonathan003 wrote: Wed Feb 02, 2022 5:12 pm Do you have some idea if I do a search with these setting, how long it is gone take to download the games? And how many games will it be approximately?
Not sure about this question. Do you mean downloading PGN files from Lichess? It depends much on Internet speed. The file of 94 million games is about 25 GB in zip format, perhaps it took me about a night (or a day - I can't remember exactly) to download. I knew the number (94 games) just after converting the file into an SQL database.

Now you can guess the number of games based on file sizes ;)

More information:

As I have mentioned, OCGDB databases could be as small and as fast as SCID’s similar ones, plus it could work with much larger numbers of games.

For so-so/popular sizes of database such as MillionBase 3.45 (3.45 million games) OCGDB could do all tasks within 2 minutes in my 5-years-old 4-cores computer, including creating/converting from a PGN file into a new database, performing approximately-position-searching (scan all records), checking/removing duplicates.

The 94 million games of Lichess require 1 hour 30 minutes to convert from a PGN file into an SQLite database. For tasks of scanning all records (such as for approximately-position-searching), it needs 1-2 hours.

Of course, many other tasks such as finding a game based on ID could run instantly. If your computer is faster and has more cores, it could run faster too.

IMHO, converting from a Lichess PGN file into an SQLite database is a win already:
  • You could save a lot of space: the PGN file of Lichess takes over 200 GB (unzip form) on my hard disk. Working with that file is a nightmare of being slow, inconvenient. In contrast, after converting (with removing all comments, site information which is not really useful for users - read more here) the SQLite one takes only 13 GB, so good/reasonable for storing, much faster and more convenient for any action
  • You could work with it without using our program or any specific chess program. Just use typical SQL GUIs. You have never worried if we/some developers stopped supporting their chess database programs
  • SQLite is built on a SQL engine that is very strong on databases and querying. Stand on the shoulder of a giant typically has many benefits :D
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1526
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: handling huge pgn databases

Post by phhnguyen »

We have just done a homework/small study about algorithms to detect duplicates. Some programs have implemented it as a simple, straightforward one. However, our study may surprise some people since some popular methods could be wrongly over 50% of cases in some tests.

You may read about it here.

A new version (Beta 5) of OCGDB with the new algorithm for detecting duplicates has been released. It could run correctly (as our tests) and very fast.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager