CookieCat's opening book implementation

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: CookieCat's opening book implementation

Post by Sven »

Dave_N wrote:
mar wrote:
Sven Schüle wrote:Martin and David,

can one of you please enlighten me why this "memory mapped file" issue is related to the opening book creation? For the latter you open a PGN file and scan it once from top to bottom, no need to store the whole PGN somehow, so I don't quite get the point. Why would you need any kind of indexing into the PGN file to create a book?

Sven
Hi Sven,
I was simply sharing my experience with memory mapping with David. I used to memory map the whole PGN into memory and then create a book from it. I sure don't need any index here and I think it's better not to use memory mapping at all. But I thought David is perhaps writing a PGN viewer or something like that. Sure it's a bit off topic so I'm sorry for that.

Martin
Yes I am writing (completing) a viewer project that I started 6 months ago, the emphasis now is to optimize for larger .pgn files and to convert them to opening books. I still need to implement "text before move" (this is currently all stored after the previous move) and the "TN" tag, and also the treatment of the "%" symbol, although these are trivial compared to the code I needed for accurately parsing the variations and traversing the tree, while validating the SAN notation and converting into "e1g1" (e0g0 in the internal case) format. I have completed a memory mapping effort for some larger .pgn files and stored indexes, however I am now having problems with '\r\n' characters in some non-windows created files, although this might be a simple syntactic error in my memory map code.
For a PGN viewer it may of course be reasonable to have such a concept. But at least for creating opening books you will most probably not need all information that can be part of a PGN file, just the moves that were actually played in a game are necessary. If creating a book directly from the viewer application shall be possible then I can understand that you obviously want to maintain the book creation code only once.

I still believe, though, that book creation itself can be done much more efficiently than through a viewer application that uses memory mapped files, for instance with an algorithm like the one Steven showed above for CookieCat.

Sven
User avatar
jshriver
Posts: 1342
Joined: Wed Mar 08, 2006 9:41 pm
Location: Morgantown, WV, USA

Re: CookieCat's opening book implementation

Post by jshriver »

Curious what kind of hash are you using and how are you dealing with collisions? This has been my biggest problem recently when dealing with large volumes of games (110gig pgn data set I usually trim down using pgn-extract).
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: CookieCat's opening book implementation

Post by Dave_N »

I am not actually finished implementing the hash, I was thinking about attempting perfect hashing for smaller books, and less than perfect hashing for larger books.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Rule of thumb

Post by sje »

The rule of thumb, backed up by much empirical evidence, is that the size of the hash space should be the square of the size of the domain space.

32 bit hash: 65,536 positions
48 bit hash: 16,777,216 positions
64 bit hash: 4,294,967,296 positions
128 bit hash: 1.84e+19 positions

I use a 64 bit hash for every domain expect for big perft runs. The latter use a 128 bit hash.
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Rule of thumb

Post by Dave_N »

I don't completely understand the numbers ...

for 64 bit hash the best table size is 2^32 ?
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Rule of thumb

Post by sje »

Dave_N wrote:for 64 bit hash the best table size is 2^32 ?
If your opening book has less than 2^32 distinct positions, then a 64 bit hash should be sufficient.
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Rule of thumb

Post by Dave_N »

Ok thanks ... I am thinking about avoiding bugs with the std::unordered_set. Basically I am not sure if the structure is suitable. There is probably a constructor that will set the size, and I can easily make another table if I want to use 32 bit or 128 bit.

I read somewhere that the sgi implementation definitely used chaining and I believe the std version is strongly based on the sgi version. The implementation is pretty simple (although code complete in VC10 is mostly useless) Here is some code:

Code: Select all


// I just typed these because there is a large object at the top of my //console testbed app.

#include <iostream>
#include <vector>
#include <functional>
#include <unordered_set>
#include <string>

string list_of_strings&#91;&#93;= &#123;/*... a list of 77 FEN strings ... */&#125;;

class HashEntry
&#123;
public&#58; 
    string data; // for testing purposes.
&#125;;

using namespace std;

int _tmain&#40;int argc, _TCHAR* argv&#91;&#93;)
&#123;
	HashEntry ITEM_NOT_FOUND;
	//ITEM_NOT_FOUND.key = 0;
	std&#58;&#58;unordered_map<uint64, HashEntry> hash_table;

	for&#40; int i = 0; i < 77; i++ )
	&#123;
		cout<< list_of_strings&#91;i&#93; << endl;
                // Michel Van den Bergh
		uint64 hashNum = hash_from_fen&#40;&#40;char*&#41;list_of_strings&#91;i&#93;.c_str&#40;)); 
		HashEntry hashEntry;
		hashEntry.data = list_of_strings&#91;i&#93;;
		//hashEntry.key = hashNum;
		hash_table.insert&#40;std&#58;&#58;make_pair&#40; hashNum, hashEntry ));
	&#125;

	for&#40; int i = 0; i < 77; i++ )
	&#123;
		uint64 hashNum = hash_from_fen&#40;&#40;char*&#41;list_of_strings&#91;i&#93;.c_str&#40;));  // Michel Van den Bergh
		HashEntry hashEntry;
		//hashEntry.key = hashNum;
		cout<< hash_table&#91;hashNum&#93;.data << "   HASH FOUND"<<endl;
	&#125;

	return 0;
&#125;;

Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Rule of thumb

Post by Dave_N »

After adding this code to the second loop...

Code: Select all

		if&#40; hash_table&#91;hashNum&#93;.data == list_of_strings&#91;i&#93; )
		&#123;
			cout<< hash_table&#91;hashNum&#93;.data << "   HASH FOUND"<<endl;
			numFound++;
		&#125;
with the int numFound = 0; before the loop, and this ...

Code: Select all

	cout<< "Num Found&#58; " << numFound << endl;
	//testing hashing 0

	cout<< "Testing Zero&#58; "<<hash_table&#91;0&#93;.data << endl;
I notice that it works and the "Testing Zero" will print nothing without
crashing.

So the table works, however now a) cleaning up is an issue (again) and b) saving the hash table is also a problem and c) I might need to generate my own Zobrist keys.