The Mysterious datacomp.exe

LiquidNitrogenOverclocker · Tue Feb 23, 2010 1:12 am

Everyone that is familiar with Andrew Kadatch's datacomp.exe program is most likely from the EGTB world. Finding datacomp.exe online is a bit of a journey, as google finds the old, dead links first, and the few links that actually work are several pages into the hunt.

After finding it, the hunt for "how to use it" begins. The available "documentation" suggests using something like this from the command prompt:

datacomp.exe e:8192 <filname>

..after which time the compression algorithm is run, and you are left with a file called

<filename>.emd

The 8192 is the "blocksize" that datacomp uses to write the data in chunks (in this case, every 8192 bytes). You can use other numbers as well. The higher the number, the (slightly better) the overall compression, but the worse the run-time performance will be when probing the data in RAM by your engine.

In my own "checkers tablebase" experience, I found that 4096 is a great blocksize to use (I would create my own run-length encoding scheme for each tablebase I would solve, not to be confused with the datacomp.exe blocksize parameter), since reading 1 byte off the disk, or 4095 bytes, will still result in a 4096 byte sequential read executed by the operating system (this will no doubt vary over time and with the operating system).

The "pros" associated with my own runtime scheme: I do get better compression, but the tokens I have to use are "hand-made", and it takes me a while to create a good set. This translates into less RAM being needed at runtime, which delivers better overall performance.

The "cons" of this approach: I do have to create all the tokens for EACH material distribution I wish to solve (Side A vs. Side B) and this is a tedious process. Also, for larger tablebases, the length of time to compress goes up, and, afterwards, I may find the tokens I am choosing are not that great, so the process is repeated, etc.

Overall, the datacomp.exe compression is not bad, compared to my own scheme. The good news is, it works for EVERY material distribution I could come up with, so I have no more code to write.

The bad news is, I don't see any documentation on how to PARSE a compressed .emd file that is not intertwined with the chess code of Eugene Nalimov.

Does anyone know where such a generic description exists?

Dann Corbit · Post by **Dann Corbit** » Tue Feb 23, 2010 2:49 am

LiquidNitrogenOverclocker wrote:Everyone that is familiar with Andrew Kadatch's datacomp.exe program is most likely from the EGTB world. Finding datacomp.exe online is a bit of a journey, as google finds the old, dead links first, and the few links that actually work are several pages into the hunt.

After finding it, the hunt for "how to use it" begins. The available "documentation" suggests using something like this from the command prompt:

datacomp.exe e:8192 <filname>

..after which time the compression algorithm is run, and you are left with a file called

<filename>.emd

The 8192 is the "blocksize" that datacomp uses to write the data in chunks (in this case, every 8192 bytes). You can use other numbers as well. The higher the number, the (slightly better) the overall compression, but the worse the run-time performance will be when probing the data in RAM by your engine.

In my own "checkers tablebase" experience, I found that 4096 is a great blocksize to use (I would create my own run-length encoding scheme for each tablebase I would solve, not to be confused with the datacomp.exe blocksize parameter), since reading 1 byte off the disk, or 4095 bytes, will still result in a 4096 byte sequential read executed by the operating system (this will no doubt vary over time and with the operating system).

The "pros" associated with my own runtime scheme: I do get better compression, but the tokens I have to use are "hand-made", and it takes me a while to create a good set. This translates into less RAM being needed at runtime, which delivers better overall performance.

The "cons" of this approach: I do have to create all the tokens for EACH material distribution I wish to solve (Side A vs. Side B) and this is a tedious process. Also, for larger tablebases, the length of time to compress goes up, and, afterwards, I may find the tokens I am choosing are not that great, so the process is repeated, etc.

Overall, the datacomp.exe compression is not bad, compared to my own scheme. The good news is, it works for EVERY material distribution I could come up with, so I have no more code to write.

The bad news is, I don't see any documentation on how to PARSE a compressed .emd file that is not intertwined with the chess code of Eugene Nalimov.

Does anyone know where such a generic description exists?

The use of the datacomp code is restricted anyway. You have to get permission to use it.

LiquidNitrogenOverclocker · Tue Feb 23, 2010 3:48 am

Dann Corbit wrote:
The use of the datacomp code is restricted anyway. You have to get permission to use it.

Yes, of this I am aware. Andrew was very instrumental in helping Gothic Vortex (see http://www.GothicChess.com/vortex.zip for a program that uses 80-square tablebases compressed by datacomp) read the compressed databases that we generated. You can see the Mate in 268 (and others) at http://www.gothicchess.com/javascript_endings.html if you'd like.

I was just wondering if there was generic documentation on it.

I guess not, based on what you said. Maybe I should just email Andrew, but I hate to bother him (all I have is his work email address).

jshriver · Post by **jshriver** » Wed Feb 24, 2010 5:15 pm

To access the egtb via code check out this thread. I asked the same question before and there are two links included that showed me how to do it.

http://www.talkchess.com/forum/viewtopi ... highlight=

-Josh

LiquidNitrogenOverclocker · Wed Feb 24, 2010 7:40 pm

As previously mentioned, I know how to generate endgames of my own, compress them on my own, etc.

My question was: How do you decode ANY FILE compressed by datacomp.exe even if it is NOT a chess endgame tablebase?

I have generated very large tablebases for the game of checkers, which I can also compress on my own. But, my compression method requires a time investment for each family of positions I solve. I have to do some "trial and error" to determine which tokens I will implement for maximum compression.

Apparently datacomp.exe will do the same thing (sometimes it is 10% to 15% larger, but I don't care about this) but it uses a generalized algorithm that will require NO TIME for me to investigate, because all of its run-length encoding tokens are handled automatically.

I'd like to be able to figure out what to do to get to the decompressed checkers data.

I could always just make some files and observe the input/output, but this would be more time consuming than doing it individually.

jshriver · Post by **jshriver** » Wed Feb 24, 2010 7:53 pm

http://www.open-aurec.com/wbforum/viewt ... 3d0eb73308

Has probing code "for chess" and while you mention looking for generic decompression code that's not chess specific Dieter Buerssner's code should be able to help since it takes the emd (datacomp.exe compressed files) and can decompress them.

At least that's my understanding and hope it helps.
-Josh

michiguel · Post by **michiguel** » Wed Feb 24, 2010 8:19 pm

LiquidNitrogenOverclocker wrote:As previously mentioned, I know how to generate endgames of my own, compress them on my own, etc.

My question was: How do you decode ANY FILE compressed by datacomp.exe even if it is NOT a chess endgame tablebase?

I have generated very large tablebases for the game of checkers, which I can also compress on my own. But, my compression method requires a time investment for each family of positions I solve. I have to do some "trial and error" to determine which tokens I will implement for maximum compression.

Apparently datacomp.exe will do the same thing (sometimes it is 10% to 15% larger, but I don't care about this) but it uses a generalized algorithm that will require NO TIME for me to investigate, because all of its run-length encoding tokens are handled automatically.

I'd like to be able to figure out what to do to get to the decompressed checkers data.

I could always just make some files and observe the input/output, but this would be more time consuming than doing it individually.

What about this? the compressor is tbcp and it is included in the Gaviota distribution
http://kirill-kryukov.com/chess/discuss ... c2a3e2f456

The decompressing code is open
http://sites.google.com/site/gaviotache ... e/releases

and updated it here
http://github.com/michiguel/Gaviota-Tablebases

I think you can slightly modify the code for checkers. You may need a function to convert my format of distance to mate to yours and make sure that before you compress the files, white to move and black to move files are concatenated into one. I think that you should check the function get_dtm() and follow what it does. In gtb-dec.c you have the decompressing code, which calls different libraries. If you do no like the compression system, you can plug your own. Gaviota TBs come with four differerent preinstalled ones.

Miguel

LiquidNitrogenOverclocker · Thu Feb 25, 2010 3:24 am

My thinking is this:

A file is nothing more than a sequence of bytes. You read a certain number of bytes into the file, and the result you are seeking is at that address in the file.

I have an indexing function that coverts a position into a unique number. That number is the address where I store the result.

The result is the Distance To Win written as a 1-byte value from 0 to 255. The value 0 represents an immediate loss for the side to move. A value of 1 indicates that white to move "mates" in 1. Therefor 2 is a loss in 2 plies, 3 is a win in 3 plies, and so on.

I use 255 for an "unknown" value (only used for database generation) and 254 is a draw. Very, very conveniently, in checkers, the longest win for any combination of material up to and including 7 pieces, is 253 plies.

I "just barely" get away with solving everything using just 1 byte per position.

In the 6-piece database of 3 vs. 3, the largest "slice" features the material distribution of 2 kings + 1 checker vs. 2 kings + 1 checker. There are 124,966,800 unique positions, each one occupying 1 byte in the file.

If I need to look up a result, I convert the position into an index, I call fseek() and pass in the byte number I am seeking, I read that byte, and that tells me if it is a win in 167, loss in 42, or a draw, or whatever the result is.

Now I run datacomp.exe e:8192 on the file, and it produces another file that is 45,777,731 bytes - much smaller!

But, where is the data for position index 121,097,435? In the old scheme, I just call

Code: Select all

fseek &#40;my_file_ptr, 121097435 - 1, SEEK_SET&#41;;

then I read in the byte.

What do I do now with the compressed file? I need to somehow convert my index to the new index.

When I do my own compression scheme, I divide the entire file into "blocks" of fixed size, usually 4096 bytes. Each block contains a binary-searchable "max" index, so there is a way to discover which block must contain the index. Then, you have to hunt for it during the decompress phase, which again is straightforward, since I know how it would be compressed.

This "block data" is written into a very small separate file.

But datacomp.exe does not do this. It compresses the data... very well I might add, but I am left with no clue how to unravel each index.

I guess what I can do is just look at the code in the links you have provided, and use one of the other schemes. For now, I don't know how well they compress, or if the data would be easily accessible at runtime.

Thanks for all of your "pointers", I will invest the time and report on how it all turns out.

The Mysterious datacomp.exe

The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe

Re: The Mysterious datacomp.exe