Tool for cleaning EPD file

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

MikeGL
Posts: 1010
Joined: Thu Sep 01, 2011 2:49 pm

Tool for cleaning EPD file

Post by MikeGL »

Was searching the web for new EPD files to test random UCI engines, and encountered dirty EPD files.
Are there tools to clean these up?

p.s. Strange, but I figured out the solution after posting with CODE tags here in forum. It looks like CR/LF pair issues against the Linux/Unix world.
I told my wife that a husband is like a fine wine; he gets better with age. The next day, she locked me in the cellar.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Tool for cleaning EPD file

Post by chrisw »

MikeGL wrote: Tue Oct 08, 2019 1:57 pm Was searching the web for new EPD files to test random UCI engines, and encountered dirty EPD files.
Are there tools to clean these up?
well, I don’t know if there are tools that really really clean up epds to the every last possible fail, but I have code that has been broken in on about 2,000,000,000 epds, so quite likely detects what there is that can go wrong.

Presumably what you want is something that Integrity checks the FEN and colour information, does the FEN contain bad characters or is in bad format, is the position legal and possible, is there at least one legal move playable from it and the given bm is in the legal move list. Plus the castling and other statuses check out and the move counts, if there are any, check out. Btw, a bm can come in many weird and wonderful formats, all of which require detection and integrity testing.
If all that is good it spits out the FEN plus status plus bm plus all the crud after the first ;
And saves it all to clean-filename.epd, and prints or saves a bad epd list.

Is that it?
Full on legality tests are quite time consuming btw, so it takes a while to check everything.

Code: Select all

n1p/4pb2/8/2N1BB2/1rPR1QPP/R5K1 w - - bm Nd5; id "STS(v7.0) Simplification.029"; c0 "Nd5=10, Na4=8, Nd1=8, Rxa6=8";
2r3k1/p1q1bp1p/2p1p1p1/4P3/1p6/1P1QP1P1/PB3P1P/3R2K1 b - - bm Rd8; id "STS(v7.0) Simplification.030"; c0 "Rd8=10, Qa5=3, a5=1, c5=1";
r2q1r1k/ppp3bp/3pb3/6p1/2Pn4/1PN3BP/P2QBPP1/3RR1K1 w - - bm Bg4; id "STS(v7.0) Simplification.086"; c0 "Bg4=10, Bh5=6, Nb5=7, Nd5=6";
2r1r1k1/1b1q1pp1/4p2p/1Nnp3n/1N6/PP2PP1P/1BB2QP1/3R2K1 w - - bm a4; id "STS(v9.0) Advancement of a/b/c pawns.023"; c0 "a4=10, Na7=1";
5r2/5r1k/p2pN1p1/1p1Pn2p/2q1P2b/P4pR1/1PPQ1B1P/1K2R3 w - - bm b3; id "STS(v9.0) Advancement of a/b/c pawns.056"; c0 "b3=10, Nxf8+=2";
6k1/1b2bp1p/p3p3/1p2P1p1/2r5/2N1BP2/PP4PP/3R2K1 b - - bm b4; id "STS(v9.0) Advancement of a/b/c pawns.060"; c0 "b4=10, Bc6=4, Kf8=4";
6k1/5r2/p4pp1/Br1pp3/2b5/6NP/1P3RP1/2R4K w - - bm b4; id "STS(v9.0) Advancement of a/b/c pawns.065"; c0 "b4=10, Bc3=3, Bd8=5, Be1=6";
r4rk1/pb1qbppp/4pn2/1p1p2B1/2pP4/2P1P2Q/PPBN1PPP/R3K2R b KQ - am h6; hmvc 0; fmvc 13; c0 "?"; id "SwissTest2_Nr.19 - Zapletal-Hecko";
rn2r1k1/p1pq2pp/bp2p3/5p2/2PPQ3/B1PB4/P4PPP/R3K2R w KQ f6 am Qxa8; hmvc 0; fmvc 14; c0 "?"; id "SwissTest2_Nr.24 - Portisch-Fischer";
3q2k1/1p2brpp/1nn1b3/r4p2/p4R2/P1NP2P1/1B1NP1BP/2RQ2K1 w - - am Kh1; hmvc 0; fmvc 20; c0 "?"; id "SwissTest2_Nr.34 - Bacrot-Topalov";
1r1q1rk1/pbpnbppp/1p2p3/6N1/3P3P/2B3P1/PPQ1PPB1/R3K2R b KQ - am Nf6; hmvc 0; fmvc 12; c0 "?"; id "SwissTest2_Nr.45 - Dreev-Georgiev";
5k1q/5p2/5p2/4p3/3pB1QP/6P1/4PP2/b5K1 w - - bm Qc8+; id "Crafty Test Pos.10"; c0 "GK/DB Philadelphia 1996, Game 2, move 35W (Qc8+)";
8/4kp2/3q4/4pp2/2Bp3P/2b3P1/Q3PP2/6K1 w - - bm Bxf7; id "Crafty Test Pos.12"; c0 "GK/DB Philadelphia 1996, Game 2, move 49W (Bxf7)";
1Q6/4k3/5q2/1B3p2/3pp2P/6P1/3bPP2/6K1 w - - bm Qc7+; id "Crafty Test Pos.13"; c0 "GK/DB Philadelphia 1996, Game 2, move 56W (Qc7+)";
8/p4bpk/7p/3rq3/3N1P2/PPQR1P2/6KP/4q3 w - - bm fxe5; id "Crafty Test Pos.32"; c0 "DB/GK Philadelphia 1996, Game 5, move 35W (fxe5)";
2r1rnk1/5p1p/b2q2p1/p1p3Q1/4R1N1/5BP1/PP3P1P/2R3K1 b - - am Rxe4; c0 "Aljechin u.B.-Bogoljubow u.B. Budapest 1921"; id "AM_MG 0029";
2r2r2/3qbpkp/p3n1p1/2ppP3/6Q1/1P1B3R/PBP3PP/5R1K w - - bm Rxh7+; c0 "level: med-9"; c1 "Boros - Szabo, Budapest 1937"; id "MG 3003";
2r2r2/4qpkp/3p1np1/pb1Pp2P/1p2P1P1/5PN1/PP1QN3/1KR4R w - - bm Nf5+; c0 "level: med-13"; c1 "Keller-Stehlik Wien 1952"; id "MG 3590";
2r2rk1/1bqnbpp1/1p1ppn1p/pP6/N1P1P3/P2B1N1P/1B2QPP1/R2R2K1 b - - bm Bxe4; c0 "level: med-8"; c1 "Najdorf - Reshevsky"; id "MG 2802";
3r2k1/rp1b2p1/2pR1bP1/p1P2p2/2q5/P3P1P1/5PB1/3Q2KR w - - bm Bd5+; c0 "level: med-9"; c1 "Keworkow-Tarasow UdSSR 1950"; id "MG 3006";
3rr1k1/1b1q1p1p/p2b1npB/2pP4/1p2n3/4N1P1/PPQ1NPBP/R2R2K1 b - - bm Nxf2; c0 "level: med-14"; c1 "Hoogovens-Zviaginsev"; id "MG 3669";
4r1k1/pp3ppp/5n2/7q/3b2b1/2NB4/PPQB1PPP/5R1K b - - bm Bf3; c0 "level: hard-16"; c1 "Machate-Spielmann Magdeburg 1927"; id "MG 4348";
5k2/5np1/1p1pQ3/p5PP/3q1B2/P4PK1/8/8 w - - bm Be3; c0 "level: med-11"; c1 "Shirov A.,2500 - Prie E.,2435, Torcy 1990"; id "MG 3200";
5k2/5p2/8/2K5/5pQ1/r7/6P1/8 w - - bm Qh4; c0 "level: hard-16"; c1 "Akopian V.,2535 - Dukic Z.,2420, Niska Banja 1991"; id "EG 2171";
6k1/5pp1/1p1p4/4pP2/1pP1Pn2/1P4qP/rP3RB1/1Q3K2 b - - bm Qd3+; c0 "level: easy-1"; c1 "Var. Czerniak-Lundin Wien 1951"; id "MG 1030";
8/8/4p2k/3bP3/6R1/5pK1/8/8 w - - bm Kh4; c0 "level: med-10"; c1 "Certic B.,2435 - Brajovic R.,2340, Jugoslavija 1995"; id "EG 1530";
r1b1k2r/pp3pp1/2p4p/8/3q4/3B2P1/PPQ2PP1/2KR3R w kq - bm Rhe1+; c0 "level: easy-1"; c1 "Var. Geist-Almirall 1932 corr"; id "MG 1043";
r1b1r1k1/p4p1p/2p2p2/2qP4/8/5P2/P3B1PP/2RQK2R b K - bm Rxe2+; c0 "level: med-8"; c1 "Teller-Tartakower Hastings 1926"; id "MG 2874";
r1bq1rk1/pp3pp1/2pp3p/n2Bp1P1/4Q3/2P5/PPP2PP1/R1B1K2R w KQ - bm Rxh6; c0 "level: med-4"; c1 "Nyholm-Post Berlin 1927"; id "MG 2299";
r1bq2k1/p5bp/3p1nnr/P1pPp3/2P1Pp2/1NN2R1P/4B1P1/1R1QB1K1 b - - bm Nh8; c0 "level: hard-16"; c1 "Korchnoi vs. Fischer"; id "MG 4426";
r2q1rk1/pp2bppp/4pn2/3nN1B1/3P4/1B5Q/PP3PPP/3R1RK1 w - - bm f4; c0 "level: hard-15"; c1 "P6 - open file and diago

I can code a specfic cleaner myself, but could finish it in a month or two because of time contraint and im just a n00b programmer.
I am sure there are similar epd util around.

I will upload this 5.43 MB file here once this epd suite is cleaned up. I can't access the link of this epd now at github.


p.s. Strange, but I figured out the solution after posting with CODE tags here in forum. It looks like CR/LF pair issues against the Linux/Unix world.
MikeGL
Posts: 1010
Joined: Thu Sep 01, 2011 2:49 pm

Re: Tool for cleaning EPD file

Post by MikeGL »

Thanks chris.
I thought it was dirty because Windows OS' notepad.exe didn't break the lines ending in 0x0A (LF) linefeed.
Already fixed it. I opened it in wordpad and wordpad already fixed the problem with 'Save As' then I
saved into "Text Document -MS DOS format" thereby saving the file with correct CR/LF (0x0D 0x0A) pair and
displayed by Windows correctly and parsed by Arena GUI correctly now. This EPD file has 70,000+ positions at 5.5 MB
total filesize but seems there are no easy ways to upload it here in forum easily without going to a third party FILE HOSTING site.

Positions are from STS, Nolot, Crafty test, yacpd etc. Just a double position which is already found in separate epd files on the web.
I told my wife that a husband is like a fine wine; he gets better with age. The next day, she locked me in the cellar.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Tool for cleaning EPD file

Post by chrisw »

MikeGL wrote: Tue Oct 08, 2019 3:10 pm Thanks chris.
I thought it was dirty because Windows OS' notepad.exe didn't break the lines ending in 0x0D.
Already fixed it. I opened it in wordpad and wordpad already fixed the problem with 'Save As' then I saved into "Text Document -MS DOS format" thereby saving the file with correct CR/LF pair and displayed by Windows Correctly and parsed by Arena GUI correctly now. It has 70,000+ position at 5.5 MB but seems there are no easy ways to upload it here in forum easily without going to a third party FILE HOSTING site. Positions are from STS, Nolot, Crafty test, yacpd etc. Just a double position which is already found in separate epd files on the web.
How much cleaning you need really depends on what you are trying to do with the epds.
If you're running long (timewise) programs that are reading them in and processing them for whatever, then a broken epd only found after one hour, will set you back a hour, or however long it took to find. So I'm a believer in massively integrity checking of big data before using it.

If you want, I could probably generate a stand alone massive-checker utility and ask Ed to include it on his page.