jshriver wrote:Checked and right now my raw data streams from fics over the past 3 years is 61gigs. So up for options on what people would like to have grabbed from it.
I'd like to see it run through pgn-extract, very short games removed, and split into a few large chunks. I'd prefer split by elo, but split into openings would be good also. Pgn-extract claims to do both.
Hi Wes,
My step by step questions on :
How would you proceed with such huge amount of data?
chunk size: which is better? 250 MB? or 500 MB?
very short games: minimum number of moves? 4, 7 or 10?
type of games: standard, rapid and blitz together or separate?
split: by elo or eco? which is easier?
Hope this helps Joshua,
Best,
I don't have any strong feelings on any of these questions.
250 MB chunks might be better.
Minimum 7 moves + shorter games that end in mate, though a 2 move minimum gains nearly all the benefit. If there is a way to distinguish between resignation and disconnection, then I would also like to filter out games that end by disconnection up to 12 moves.
I would like games split by time control.
Splitting by eco is slightly easier, but pgn-extract supports both.
It should be. The logic is pretty simple it basically
* Looks to see how many games to put in each chunk.
* Read file one line at a time
* If it comes across a score that is not in a PGN "Result tag" increase slice counter.
** if you hit the counter, close the file, open a new one with the next slice id.
* until end of file
Not great but seems to work for me, I feed it a 2gig pgn and worked fine.
jshriver wrote:It should be. The logic is pretty simple it basically
* Looks to see how many games to put in each chunk.
* Read file one line at a time
* If it comes across a score that is not in a PGN "Result tag" increase slice counter.
** if you hit the counter, close the file, open a new one with the next slice id.
* until end of file
Not great but seems to work for me, I feed it a 2gig pgn and worked fine.
Excellent. Lets wait for the outcome. I predict some 24m games.
Basically I can add time controls and type now. (Just not sure where to put it).
I added move times, and also added a comment to say why the game ended. This was a request since it'll say "X resigned" "x ran out of time" or whatever.