Xiangqi text game parser/reader?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Xiangqi text game parser/reader?

Post by phhnguyen »

I have been working with Xiangqi (Chinese chess) databases and I need to parse games in PGN. The problem is that many Xiangqi PGNs I found on Internet don’t follow strictly the PGN notation, say, they use Chinese characters and the traditional notation for moves instead of Latin/SAN ones, for example:

Code: Select all

[Game "Chinese Chess"]
[Event "2000ƒÍ»´π˙œÛ∆Â∏ˆ»ÀΩı±Í»¸"]
[Round "?"]
[Date "2000.11.??"]
[Site "?"]
[RedTeam "ª˙µÁÖ≠"]
[Red "≥¬∏ªΩ‹"]
[BlackTeam "±±æ©"]
[Black "π®œ˛√Ò"]
[Result "1-0"]
  1. ±¯∆flΩ¯“ª  ◊‰£∑Ω¯£±    2. ¬Ì∞ÀΩ¯∆fl  ¬Ì£∏Ω¯£∑
  3. œ‡∆flΩ¯ŒÂ  ≈⁄£≤∆Ω£∂    4. ≥µæ≈∆Ω∞À  ¬Ì£≤Ω¯£≥
  5. ¬Ì∂˛Ω¯“ª  œÛ£∑Ω¯£µ    6. ≈⁄∂˛∆ΩÀƒ  ≥µ£π∆Ω£∏
  7. ≥µ“ª∆Ω∂˛  ≈⁄£∏Ω¯£¥    8. ±¯“ªΩ¯“ª  ≥µ£±∆Ω£≤
  9.  ø¡˘Ω¯ŒÂ  ≥µ£≤Ω¯£∂   10. ≥µ∂˛Ω¯»˝  ≥µ£∏Ω¯£∂
 11. ≈⁄ÀƒΩ¯“ª  ≥µ£≤Ω¯£±   12. ≥µ∞ÀΩ¯∂˛  ≥µ£∏Ω¯£±
 13. ±¯ŒÂΩ¯“ª  ¬Ì£∑Ω¯£∏   14. ≈⁄Àƒ∆Ω¡˘  ≈⁄£∂∆Ω£∑
 15. ≈⁄¡˘ÕÀ“ª  ≥µ£∏Ω¯£±   16. ≥µ∞ÀΩ¯“ª  ≥µ£∏∆Ω£∂
 17. ≥µ∞À∆ΩŒÂ  ≥µ£∂ÕÀ£¥   18. ¬Ì“ªΩ¯∂˛  ◊‰£∑Ω¯£±
 19. ±¯»˝Ω¯“ª  ◊‰£≥Ω¯£±   20. ±¯ŒÂΩ¯“ª  ≥µ£∂∆Ω£µ
 21. ≥µŒÂΩ¯∂˛  ◊‰£µΩ¯£±   22. ±¯∆flΩ¯“ª  ◊‰£µΩ¯£±
 23. ±¯∆flΩ¯“ª  ¬Ì£≥Ω¯£µ   24. ±¯∆fl∆Ω¡˘  ¬Ì£µΩ¯£∂
 25. ≈⁄¡˘Ω¯»˝  ≈⁄£∑∆Ω£∏   26. ¬Ì∂˛ÕÀ“ª   ø£∂Ω¯£µ
 27. ≈⁄¡˘∆ΩŒÂ  Ω´£µ∆Ω£∂   28. ±¯“ªΩ¯“ª  ◊‰£πΩ¯£±
 29. ≈⁄ŒÂ∆Ω“ª  ≈⁄£∏Ω¯£±   30. ≈⁄“ªΩ¯“ª
1-0
I have trouble reading and converting them into Latin ones since I am not good at page code/Unicode, and I don't understand Chinese. Does anyone have/share or know some code/libraries to read, and parse them (into a chessboard or convert them to standard PGN)? I prefer C++ but I guess I could understand the code in any programming language.

Thanks
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
hgm
Posts: 28353
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Xiangqi text game parser/reader?

Post by hgm »

Unfortunately this is one of the formats WinBoard/XBoard does not yet understand. It does understand Shogi kifu format, both in shift-JIS and unicode encoding. My original goal for the move parser was that it would parse any move string independent of context, recognizing both move and format. But the traditional Xiangqi notation in Latin script was in some cases ambiguous with other formats it recognized, so I never got to implement that. Of course an encoding in Chinese would be a giveaway for what format it would be, like it is for kifu. There are three possible Chinese encodings, though: GB2312 (simplified), Big5 (traditional) and unicode. Presumably simplified and traditional Chinese have different unicodes, so there really are four encodings. The number of kanji used in move notiton is so small, however, that I don't expect there is any ambiguity here. So once you have a parser that would handle the Latinized version of the traditional move notation, you could have the GetNextCharacter routine of the lexical scanner intercept any character with the upper bit set, and match it and its successor byte(s) against a list of all relevant kanji in any encoding, and return the corresponding Latin character (e.g. R for each of the encodings of the chariot kanji).

That would solve the encoding problem. The traditional format is a bit cumbersome to parse, however. Especially the disambiguation rules for the rare positions when there are 4 or 5 Pawns in the same file.

Most likely the game that you posted here is encoded in GB2312, but the (2-byte) codes for the kanji are interpreted as single-byte codes of another code page (perhaps Latin-1). The problem is that the act of posting it here and copying it from the browser converts this to the unicodes for this Latin interpretation of the original format, after which recovering the original becomes a bit difficult. In Linux there is a command iconv that can be used to transform files from one encoding into another, but it would be best to do that on the original file. Just convert it from gb2312 or big5 to unicode (or whatever locale your system is set for), to see if this makes the familiar kanji for the pieces appear. Once it is in a properly understood encoding you can copy-paste all the occurring kanji into a short file, use iconv to convert it to the encodings you want to recognize, and use od to view their codes.