Xiangqi PGN parser ("C2=5 H8+7" to "h2e2 h9g7")

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
maksimKorzh
Posts: 775
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Xiangqi PGN parser ("C2=5 H8+7" to "h2e2 h9g7")

Post by maksimKorzh »

Hey what's up guys, Code Monkey King's here.

I wanted to view master Xiangqi games in XBoard/Winboard as well as in my XIangqi engine.
But non of above options was possible until now, so here's the work I made to achieve the goal:
1. Scraped 43878 games from wxf.ca https://github.com/maksimKorzh/wukong-x ... /main/xqdb
2. Decoded moves form "[DhtmlXQ_movelist]6947725279677062666523241727...[/[DhtmlXQ_movelist]" to
- international format: 1. C2=5 H8+7 2. H2+3 R9=8 3. P7+1 P7+1
- traditional format: 1. 炮二平五 马8进7 2. 马二进三 车9平8 3. 兵七进一 卒7进1
using pieces of existing JS code - nothing really special so far

The problem with above formats (least evil) is that they can't be read by Xboard/Winboard, so the only
way to view the games that was available so far is only on world xiangqi federation website. And here's
where the rock-n-roll starts)

3. I written a JS script from scratch to convert international notation moves into UCI format. It successfully
converted and validated (using my Xiangqi engine) 40711 of 42228 games from international to UCI/UCCI/ICCS format.
The rest 1517 games are parsed partially and are not included into the eventual dataset. The issue seems to be within the
malformed moves obtained during web scraping/decoding phase, so it doesn't have to be the issue of the parser.

4. Finally I've managed to open Xiangqi games in Xboard and by my engine (next I'll write game viewer derived from my engine's web GUI)

PGN parser: https://github.com/maksimKorzh/wukong-x ... pgn_parser
Xiangqi master games DB: https://github.com/maksimKorzh/wukong-x ... /main/xqdb
Video report on this project:
User avatar
hgm
Posts: 28454
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Xiangqi PGN parcer ("C2=5 H8+7" to "h2e2 h9g7")

Post by hgm »

I think WinBoard should be able to read the DhtmlXQ_movelist format. I never got to finish the parsing of the international / traditional formats. It can be very tricky to do that fully correctly, so that it also understand exceptional positions (like with 5 same-colored Pawns in one file). The current parser gets here when it reads a piece ID plus a single number (n=1) and no following letters:

Code: Select all

	    } else if((**p == '+' || **p == '=') && n == 1 && piece && type[0] == NUMERIC) { // can be traditional Xiangqi notation
		separator = *(*p)++;
		n = 2;
		if((i = Number(p)) != BADNUMBER) coord[n] = i, type[n++] = NUMERIC;
	    } else if(n == 2) { // only one square mentioned, must be to-square
One of the problem is that I cannot treat the '-' case here, as this would already be intercepted earlier, as it can also occur as a separator in some algebraic Chess formats (like R1-a8), where the single number was a rank disambiguator.
Fabian Fichter
Posts: 50
Joined: Mon Dec 12, 2016 2:14 pm

Re: Xiangqi PGN parser ("C2=5 H8+7" to "h2e2 h9g7")

Post by Fabian Fichter »

The python and JS bindings of Fairy-Stockfish also support the generation of various move notation formats, including the international/WXF Xiangqi notation, so they can also be used as simple parsers by checking if a given move string is in the generated list. Such a simple parser of course can not handle cases that do not strictly comply with the given notation, e.g., in case of unnecessary disambiguation. However, to my knowledge the code handles all special cases of WXF notation correctly, including tandem pawns, also see the unit tests, so as long as the input is correct, it should in principle work well.
User avatar
maksimKorzh
Posts: 775
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Re: Xiangqi PGN parser ("C2=5 H8+7" to "h2e2 h9g7")

Post by maksimKorzh »

Fabian Fichter wrote: Sun Feb 07, 2021 10:47 am The python and JS bindings of Fairy-Stockfish also support the generation of various move notation formats, including the international/WXF Xiangqi notation, so they can also be used as simple parsers by checking if a given move string is in the generated list. Such a simple parser of course can not handle cases that do not strictly comply with the given notation, e.g., in case of unnecessary disambiguation. However, to my knowledge the code handles all special cases of WXF notation correctly, including tandem pawns, also see the unit tests, so as long as the input is correct, it should in principle work well.
Thank Fabien, if only you told me this a couple of days ago)
User avatar
maksimKorzh
Posts: 775
Joined: Sat Sep 08, 2018 5:37 pm
Location: Ukraine
Full name: Maksim Korzh

Re: Xiangqi PGN parser ("C2=5 H8+7" to "h2e2 h9g7")

Post by maksimKorzh »

hgm wrote: Sat Feb 06, 2021 9:57 pm I think WinBoard should be able to read the DhtmlXQ_movelist format. I never got to finish the parsing of the international / traditional formats. It can be very tricky to do that fully correctly, so that it also understand exceptional positions (like with 5 same-colored Pawns in one file). The current parser gets here when it reads a piece ID plus a single number (n=1) and no following letters:

Code: Select all

	    } else if((**p == '+' || **p == '=') && n == 1 && piece && type[0] == NUMERIC) { // can be traditional Xiangqi notation
		separator = *(*p)++;
		n = 2;
		if((i = Number(p)) != BADNUMBER) coord[n] = i, type[n++] = NUMERIC;
	    } else if(n == 2) { // only one square mentioned, must be to-square
One of the problem is that I cannot treat the '-' case here, as this would already be intercepted earlier, as it can also occur as a separator in some algebraic Chess formats (like R1-a8), where the single number was a rank disambiguator.
re: I think WinBoard should be able to read the DhtmlXQ_movelist format
- could you please provide an example of input file to read with Winboard?
- How DhtmlXq encoding works? Who has created it?

re: It can be very tricky to do that fully correctly, so that it also understand exceptional positions (like with 5 same-colored Pawns in one file)
- I didn't handle 5 same-colored pawns intentionally because 40K + games didn't contain a single game for that case.
My parser is not the target script, I needed it just to convert my Xiangqi games db into UCI format - now it's done and it's fairly enough.
I publish it just for those enthusiasts who possibly want to play around with the code. If same-colored pawns become crucial or someone issues
that on github then I'll add that feature - with current implementation I have It's simply the matter of finding a proper sourceSquare for pawn - all
the rest would get handled automatically.

re: your code
- I think the problem is that you're trying to embed it into rock solid Winboard, I had much more humble goal of simply to convert 40K particular games.
I've been "giving up forever" this project several times because felt totally lost every time realizing that new "arch" to handle all the cases simply sucks.
But eventually, after rewriting code from scratch several times I have what I have. So from now on I have new fantastic avenues regarding the further
development:

1. Build opening books for my bots basing on exsisting DB (my book is JS array containing lines to play that are chosen randomly, plain structure, no tree-like data) https://github.com/maksimKorzh/wukong-x ... me/bots.js (btw, there's "hgm" bot available, hope you don't mind) let me know if you do mind though and I'll remove it)

2. Extracting "mate in N" positions in FEN format to feed my puzzle solver: https://maksimkorzh.github.io/wukong-xi ... olver.html

3. Learning from GM games as a human (implemented today): https://maksimkorzh.github.io/wukong-xi ... iewer.html

I wish I could write rock solid code like you do... I really wish I could...
User avatar
hgm
Posts: 28454
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Xiangqi PGN parser ("C2=5 H8+7" to "h2e2 h9g7")

Post by hgm »

maksimKorzh wrote: Sun Feb 07, 2021 4:48 pmre: I think WinBoard should be able to read the DhtmlXQ_movelist format
- could you please provide an example of input file to read with Winboard?
I don't have such afile at had; IIRC I mostly used it for copy-pasting games from web pages into WinBoard.
- How DhtmlXq encoding works? Who has created it?
This is long ago, so I hope I still remember it correctly. I think I encountered this format on (forum?) web pages that used a particular kind of JavaScript viewer (the Xiangqi equivalent of pgn4web). A move is encoded in coordinate notation, like UCI, but also used digits for indicating the file, 0-8 instead of a-i. Without any separator between the moves (not even a space). So basically the entire game is one big number of hundreds of digits.

This is of course ambiguous with normal PGN parsing, where numbers are move numbers (which WinBoard then ignores, except for move number 1.). So it has to be aware whether it is reading this 'xqUBB format' or something else, which is achieved by recognizing the [DhtmlXq_movelist] tag that precedes the number. I think that what I did was to ask for the page source of the page containing the viewer, and then copy-paste the info between and including the tags into WinBoard.
re: your code
- I think the problem is that you're trying to embed it into rock solid Winboard, I had much more humble goal of simply to convert 40K particular games.
Indeed, the problem is that the WinBoard parser currently decides (except for this xqUBB case) on a per-move basis which format the move is. So I get into trouble when two formats where the components mean completely different things can look the same. And the '-' that indicates 'move backwards' in the international format international format is used in some other formats as separator between piece ID and coordinates (TSA format), or as separator between from- and to-square (ICS format). A look-ahead could probably solve that, but not based on the next symbol (which must be a digit for XQ, but the first character of a square coordinate is also a digit in all Shogi formats). There is no preceding tag that I could use to set a flag for the rest of the game to solve the ambiguity, like in xqUBB format, as the tags look like normal PGN tag. I could of course make it dependent on the variant being Xiangqi; I don't believe that any of the formats that use the '-' as separator between squares are ever used for Xiangqi.