search huge database for lines in repertoire

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, chrisw, Rebel

Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

search huge database for lines in repertoire

Post by Jonathan003 »

Chessbase 16 has an option to search a big cbh database with million of games for games with opening lines in your white of black repertoire.
It doesn't work perfect but it douse a reasonable good job.
The problem is that it is very slow, for a database with 20,0000,000 games it wil take about 48 hours on my pc, to do a search this way.
Here is what the Chessbase manual says about this function:
https://help.chessbase.com/CBase/16/Eng ... 000063.htm
See at the bottom of this page, next to "In Repertoire".

And this is how to create these white and black repertoires in Chessbase 16:
https://help.chessbase.com/CBase/16/Eng ... tabase.htm

This function create merged games for every separate opening. And you can define these database as your white or your black repertoire.

I wonder if there exist any other tool or software with such functionality?
I want to do this with a database of 100,000,000 games, so I want it to be fast and accurate.
Cornfed
Posts: 511
Joined: Sun Apr 26, 2020 11:40 pm
Full name: Brian D. Smith

Re: search huge database for lines in repertoire

Post by Cornfed »

Jonathan003 wrote: Wed Feb 02, 2022 5:31 pm I want to do this with a database of 100,000,000 games, so I want it to be fast and accurate.
100 Million games?? Isn't that going to give you a lot of 'crap'?
I ask because I do mine only after curating a 'Quality Base'...games I've intentionally fed into it from different sources (usually with notes e-books and such) and games where either one or both players are over a certain rating.
So, just curious over what the benefit might be over my approach.
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: search huge database for lines in repertoire

Post by Jonathan003 »

I want to use the databases to search for tactics that can arise after the openings I play. That's why I need many many games. I, m planning to filter the databases. Just filtering the databases by Eco code is no sufficient because of manny possible transpositions. And because there is sil allot off variety within a certain Eco code.
User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: search huge database for lines in repertoire

Post by phhnguyen »

For creating huge databases of over 100 million games, AFAIK, you have only one choice: OCGDB ;) . None of the other chess database programs could create even close to that number!

To match a huge database with a set of games (repertoire) I could make it run very fast, almost linear time. The new feature for detecting duplicates (read here) has to make a lot of comparisons between games (each game has to compare with all other ones) and I could make it run with linear O(n) time. I could do the same easily for a small set of games. In my old computer, it needs about 1-2 hours to scan all games of a 94 million game database with my old computer, thus it may take a similar amount of time for the new feature.

The problem here is that it is not implemented yet. I don’t have any problem creating that function. However, I may have problems with the interface (how users use that) as well as being… lazy.

If you are keen, join us, think about the interface and help me to test it, I will implement it.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: search huge database for lines in repertoire

Post by Jonathan003 »

phhnguyen wrote: Fri Feb 04, 2022 3:46 am For creating huge databases of over 100 million games, AFAIK, you have only one choice: OCGDB ;) . None of the other chess database programs could create even close to that number!

To match a huge database with a set of games (repertoire) I could make it run very fast, almost linear time. The new feature for detecting duplicates (read here) has to make a lot of comparisons between games (each game has to compare with all other ones) and I could make it run with linear O(n) time. I could do the same easily for a small set of games. In my old computer, it needs about 1-2 hours to scan all games of a 94 million game database with my old computer, thus it may take a similar amount of time for the new feature.

The problem here is that it is not implemented yet. I don’t have any problem creating that function. However, I may have problems with the interface (how users use that) as well as being… lazy.

If you are keen, join us, think about the interface and help me to test it, I will implement it.
I'm surely interested in testing the tool. But I have no programming knowledge at all.
It looks like someone already made a chess GUI that converts pgn databases to SQL
http://software-tecnico-libre.es/en/art ... er-guide-1
User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: search huge database for lines in repertoire

Post by phhnguyen »

Jonathan003 wrote: Fri Feb 04, 2022 8:34 am It looks like someone already made a chess GUI that converts pgn databases to SQL
http://software-tecnico-libre.es/en/art ... er-guide-1
Nice to know another program using SQL for chess databases!!!

Unfortunately, I can’t run that program (it keeps saying about missing an Oracle library), thus I don't know much about it. Perhaps, you or someone who could run, test, and check if it could work with huge databases (say, over 50 million games) and measure how long it takes for some vital actions such as creating, loading, getting a game, searching for a position…

Theoretically, databases in SQL could hold huge numbers of games, say, billions.

On one hand, building an SQL database is an easy, straightforward task. Many developers have tried and quickly succeed.

On the other hand, it is not easy to develop further to be a full feature/useful program, go deep inside/solve some chess-specific problems. I have read that some very high skill developers have tried and given up on SQL since they could not solve some problems such as reasonable speeds, position searching…

A chess database program may run smoothly, perfectly with a small number of games but the majority of them could not run with huge numbers of games at all or the processing time is not acceptable. We must test each program to know its limit.

For the program you gave the link above, I doubt if it could work with huge numbers of games in a reasonable time. Even I couldn’t run but I have found a few scripts of SQL statements. IMO, it may create too many records for each game. I guess it needs them for searching positions (as the name of the program). However, from my experience, that could take too much space and time, making the program be crawling or even stop forever when working with huge numbers of games.

BTW, we don’t know if we don’t test. I may be wrong. Hope someone/authors correct me!
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: search huge database for lines in repertoire

Post by Jonathan003 »

The new HIARCS Chess Explorer Pro claims to handle huge databases without problems, with there new HCE database format.
But the software is verry expensive.
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: search huge database for lines in repertoire

Post by Jonathan003 »

phhnguyen wrote: Fri Feb 04, 2022 3:46 am The new feature for detecting duplicates (read here) has to make a lot of comparisons between games (each game has to compare with all other ones) and I could make it run with linear O(n) time.
I have a question about the detection of duplicates with your tool.
I would like an option to delete duplicates based only on exact the same moves, or subsets of other games, and results of the games. Otherwise exact the same games but with incorrect spelled player names wil not be detected. (If one game has the correct spelling and the double doesn't). If you search for doubles with SCID these doubles also don't get detected. Chessbase 16 detect them but is verry slow. pgn-extract also detect these doubles and the Eman tool PgnTools 1.40 also detect these doubles. Only Chessbase 16 keeps the game with the correct spelling.
pgn-extract and Eman PgnTools detect the doubles but keep the games with the wrong spelled name for some reason.
To test this I added te database I find here:
https://database.nikonoel.fr/
I downloaded the SCID database and exported to pgn with SCID
https://lichess-elite-db.s3.eu-west-3.a ... 2021-03.7z

Together with the human vs human database between players of at least 2200 elo
I find here:
https://rebel13.nl/download/data.html