Duplicates
Typically a game is considered as a duplicate of another game if it has an identical move list with that game. The straightforward way is to compare move lists of games together. However that way is too slow because of loading, retrieving the whole game moves multi times with the complexity of O(n!). We have read that some applications are probably implemented that way and run painfully slow.
Detecting duplicate games in a database is a new feature of OCGDB. It is a “dangerous” one since it comes with the ability to remove duplicate games. Any mistake may lead to losing data.
In this experiment, we study some methods/solutions for detecting duplicates. It should work correctly and in a reasonable time. We focus on matching move lists and ignore other factors such as matching game results, names of events, players... We won’t talk about the correctness of games either.
As usual, we use MillionBase 3.45 database (3.45 million games) as a test set. It has been converted already into an SQLite database with column Moves2 (2 bytes encoding for a move). Because of using encoded moves, it can run tests very fast, saving a lot of time for our experiment.
1. Hash keys of last positions (last hash keys)
For each game, we retrieve and store the hash key of the last position to compare with the ones of other games. If two games have the same last hash keys, they are considerably duplicated.
This way requires a small amount of memory because each game needs only a hash key (and probably some extra info). Computing with hash keys (searching, storing) could be very fast too.
This method is quite popular. People have mentioned it many times. Some programs used it already.
We have implemented and the result brief is as below:
Code: Select all
#games: 3456399, elapsed: 306071ms 05:06, speed: 11292 games/s, #duplicates: 56626
Wow, there are too many games reported as duplicates: over 56 thousand of a 3.45 million games database. It is a big surprise! Frankly speaking, we expected Zero for such a good database. We have read that the author of that database (@Rebel) checked and verified it carefully before releasing it. Perhaps we did something wrong, didn’t we?
We verify each pair of duplicate games. OCGDB reported them with their IDs and it has a useful feature to extract and display game PGNs by their IDs:
Code: Select all
ocgdb -db 345.ocgdb.db3 -o printpgn -g 33 -g 35
Below is a reportedly duplicate pair:
ID: 93794
ID: 3453793
The two above games have identical move lists. They are true duplicates. However, typically we ignore those games as duplicates since they are too short.
2. Last hash keys, ignore short games
We implement the parameter -plycount to ignore any game with the move list shorter than plycount.
The result with PlyCount 20:
Code: Select all
#games: 3456399, elapsed: 356036ms 05:56, speed: 9708 games/s, #duplicates: 46774
The number of duplicates reduced significantly but was still too large IMHO.
We found some reportedly duplicate games as below:
ID: 759
ID: 67163
The above games have similar last hash keys/positions. However, they are completely different. If we delete any one of them, it is lost unreasonably.
Perhaps, if we count game lengths, the results may be better?
3. Last hash keys + game lengths
We encode a game length (using Zobrist random numbers) then combine (using XOR operator) it with the last hash key to create a new key and use it to compare. The plycount is still set as 20.
Code: Select all
#games: 3456399, elapsed: 409704ms 06:49, speed: 8436 games/s, #duplicates: 29524
The program reports the number of duplicates reduced significantly.
However, we found some pairs of games as below:
ID: 2953847
ID: 3423761
Those games have the same lengths, same last positions, same result too. But they are clearly not a double.
We should count all middle moves/positions.
4. Combine all hash keys
For each game, we combine all hash keys of all moves/positions together (using XOR) to be a new key to compare.
Code: Select all
#games: 3456399, elapsed: 376612ms 06:16, speed: 9177 games/s, #duplicates: 33706
The problem is that the reported number of duplicates increased. Below is a pair:
ID: 1780
ID: 3426054
Those games have some similar moves but in a different order. They are not double.
In one of the previous posts, someone quoted that the program PGN-extract has been using the method of the last hash key, plus another one from move lists. Let us re-quote again:
Code: Select all
Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If both values match then games are considered to be duplicates.
We believe it is somewhat similar to this step.
Perhaps, using hash keys from all positions may cause a kind of overflow bits. We are not really sure but try to reduce the involving number of hash keys.
5. Combination of hash keys from one of each 5 moves
We create a new key from the last hash key and one of every 5 moves. The result is as the below:
Code: Select all
#games: 3456399, elapsed: 361131ms 06:01, speed: 9571 games/s, #duplicates: 21772
The number of reported duplicates was reduced. However, we still find some of those pairs are not duplicates.
6. Compare move lists
Hash keys are good for matching between positions or their combinations. However, we can’t encode orders of positions into hash keys. In other words, using only the hash key can’t solve the problem of having different move orders. Thus, in the end, we decide to compare whole lists of moves to make sure a pair of games is a double.
The algorithm now works as follows:
- Compare the combinations of hash keys as above part (5) to find all potential duplicates
- Verifying each reported pair of games by comparing their whole move lists
When checking a given game if it is a double, we extract its move list already. If it is reportedly a duplicate with some other games, we load all those games to verify further. It is slowed down a bit but because the number of checking is not large thus the whole process is still fast in general.
Basically, we need only 1 time to be read and retrieve all games' move lists. The complexity is O(n) linear. Only a small number of games become suspected and reloaded to verify but they are just small to affect the complexity in general.
Code: Select all
#games: 3456399, elapsed: 122816ms 02:02, speed: 28142 games/s, #duplicates: 17344
The program now needs only 2 minutes to check the whole database.
After verifying, we think 17344 is the correct number of duplicates with plyCount of 20.
Looking back into part 2 (used only last hash keys) with the result of 46774, there are 29430 games (or 63%) that were reported wrongly as duplicates.
Part 4 has a smaller number of 33706, which means 16362 games (49%) may be deleted for nothing.
Below is the result when we increase the plycount from 20 to 40:
Code: Select all
#games: 3456399, elapsed: 130847ms 02:10, speed: 26415 games/s, #duplicates: 15162
Below is a detectedly duplicate pair:
ID: 19103
ID: 1505936
They are real duplicates and we could guess which one is the wrong clone (but it is another story
)!
6. Source code
All source code has been pushed already into the main branch with a new release (Beta 5)
7. Quick conclusion
Using only the last hash keys is not good enough to detect duplicate games since it could conclude wrongly for a large proportion (could be wrong of over 63% in our test). We should compare whole move lists to be correct.
In this experiment we developed a method to reduce the number of loading, retrieving move lists to the complexity of O(n). The implementation runs very fast in practice.