can one of you please enlighten me why this "memory mapped file" issue is related to the opening book creation? For the latter you open a PGN file and scan it once from top to bottom, no need to store the whole PGN somehow, so I don't quite get the point. Why would you need any kind of indexing into the PGN file to create a book?
Sven
Hi Sven,
I was simply sharing my experience with memory mapping with David. I used to memory map the whole PGN into memory and then create a book from it. I sure don't need any index here and I think it's better not to use memory mapping at all. But I thought David is perhaps writing a PGN viewer or something like that. Sure it's a bit off topic so I'm sorry for that.
Martin
Yes I am writing (completing) a viewer project that I started 6 months ago, the emphasis now is to optimize for larger .pgn files and to convert them to opening books. I still need to implement "text before move" (this is currently all stored after the previous move) and the "TN" tag, and also the treatment of the "%" symbol, although these are trivial compared to the code I needed for accurately parsing the variations and traversing the tree, while validating the SAN notation and converting into "e1g1" (e0g0 in the internal case) format. I have completed a memory mapping effort for some larger .pgn files and stored indexes, however I am now having problems with '\r\n' characters in some non-windows created files, although this might be a simple syntactic error in my memory map code.
For a PGN viewer it may of course be reasonable to have such a concept. But at least for creating opening books you will most probably not need all information that can be part of a PGN file, just the moves that were actually played in a game are necessary. If creating a book directly from the viewer application shall be possible then I can understand that you obviously want to maintain the book creation code only once.
I still believe, though, that book creation itself can be done much more efficiently than through a viewer application that uses memory mapped files, for instance with an algorithm like the one Steven showed above for CookieCat.
Curious what kind of hash are you using and how are you dealing with collisions? This has been my biggest problem recently when dealing with large volumes of games (110gig pgn data set I usually trim down using pgn-extract).
I am not actually finished implementing the hash, I was thinking about attempting perfect hashing for smaller books, and less than perfect hashing for larger books.
Ok thanks ... I am thinking about avoiding bugs with the std::unordered_set. Basically I am not sure if the structure is suitable. There is probably a constructor that will set the size, and I can easily make another table if I want to use 32 bit or 128 bit.
I read somewhere that the sgi implementation definitely used chaining and I believe the std version is strongly based on the sgi version. The implementation is pretty simple (although code complete in VC10 is mostly useless) Here is some code:
// I just typed these because there is a large object at the top of my //console testbed app.
#include <iostream>
#include <vector>
#include <functional>
#include <unordered_set>
#include <string>
string list_of_strings[]= {/*... a list of 77 FEN strings ... */};
class HashEntry
{
public:
string data; // for testing purposes.
};
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
HashEntry ITEM_NOT_FOUND;
//ITEM_NOT_FOUND.key = 0;
std::unordered_map<uint64, HashEntry> hash_table;
for( int i = 0; i < 77; i++ )
{
cout<< list_of_strings[i] << endl;
// Michel Van den Bergh
uint64 hashNum = hash_from_fen((char*)list_of_strings[i].c_str());
HashEntry hashEntry;
hashEntry.data = list_of_strings[i];
//hashEntry.key = hashNum;
hash_table.insert(std::make_pair( hashNum, hashEntry ));
}
for( int i = 0; i < 77; i++ )
{
uint64 hashNum = hash_from_fen((char*)list_of_strings[i].c_str()); // Michel Van den Bergh
HashEntry hashEntry;
//hashEntry.key = hashNum;
cout<< hash_table[hashNum].data << " HASH FOUND"<<endl;
}
return 0;
};
I notice that it works and the "Testing Zero" will print nothing without
crashing.
So the table works, however now a) cleaning up is an issue (again) and b) saving the hash table is also a problem and c) I might need to generate my own Zobrist keys.