0x88 engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 0x88 engines

Post by bob »

Gerd Isenberg wrote:
Codeman wrote:According to Pete Isensee's paper (http://www.tantalon.com/pete/cppopt/app ... ativeCosts) about optimization tricks, doing various operations with int variables is a lot faster than with bytes.
Homogeneous char arrays in conjunction with zero- or sign-extension (movzx, movsx) are fine with respect to the smaller size and possible L1 savings. A 64-byte versus 256-byte board array is one 64-byte cacheline versus four (if cachline aligned). Since the board array is heavily used, all the four cachelines of an int-array are most often fetched from L1, while three others need to been thrown out, which may otherwise reside inside the cache with the byte-array.

So the marginally slower zero- or sign-extension to load chars to a native 32-bit register likely pays off in less cache misses elsewhere. If and by how much depends on your memory and cache footprint and may vary from program to program.

For the stack, that is locals and parameter (most are passed via registers on x64 anyway) and very often used scalar globals, better stay with int and native register width, also due to alignment, load store forewarding issues, and partial register stalls.
Don't do any local char stuff in Crafty. But I do have a few global arrays of type char...


the previously mentioned 64 element board array, plus a few other small arrays. I changed them all to ints and see this kind of speed difference. First is original, second is with chars replaced by ints.:

log.001: time=37.90 mat=-1 n=105649503 fh=91% nps=2.8M
log.002: time=38.09 mat=-1 n=105649503 fh=91% nps=2.8M

Results were similar on several tests. Very small difference, but measurable and repeatable. I will note that this was run on my core2 with 4mb of L2. The difference is a little more significant on my office PIV with 512K of L2.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: 0x88 engines

Post by diep »

bob wrote:
Gerd Isenberg wrote:
Codeman wrote:According to Pete Isensee's paper (http://www.tantalon.com/pete/cppopt/app ... ativeCosts) about optimization tricks, doing various operations with int variables is a lot faster than with bytes.
Homogeneous char arrays in conjunction with zero- or sign-extension (movzx, movsx) are fine with respect to the smaller size and possible L1 savings. A 64-byte versus 256-byte board array is one 64-byte cacheline versus four (if cachline aligned). Since the board array is heavily used, all the four cachelines of an int-array are most often fetched from L1, while three others need to been thrown out, which may otherwise reside inside the cache with the byte-array.

So the marginally slower zero- or sign-extension to load chars to a native 32-bit register likely pays off in less cache misses elsewhere. If and by how much depends on your memory and cache footprint and may vary from program to program.

For the stack, that is locals and parameter (most are passed via registers on x64 anyway) and very often used scalar globals, better stay with int and native register width, also due to alignment, load store forewarding issues, and partial register stalls.
Don't do any local char stuff in Crafty. But I do have a few global arrays of type char...


the previously mentioned 64 element board array, plus a few other small arrays. I changed them all to ints and see this kind of speed difference. First is original, second is with chars replaced by ints.:

log.001: time=37.90 mat=-1 n=105649503 fh=91% nps=2.8M
log.002: time=38.09 mat=-1 n=105649503 fh=91% nps=2.8M

Results were similar on several tests. Very small difference, but measurable and repeatable. I will note that this was run on my core2 with 4mb of L2. The difference is a little more significant on my office PIV with 512K of L2.
Bob,

Valgrind nowadays has something called cachegrind. If you run that at crafty, what does it show?

So statistics like L1 cache misses, L2 misses, L3 misses, branch mispredict %.

That should directly answer whether it makes sense for crafty to do things like this.

Vincent
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 0x88 engines

Post by bob »

frankp wrote:
Codeman wrote:According to Pete Isensee's paper (http://www.tantalon.com/pete/cppopt/app ... ativeCosts) about optimization tricks, doing various operations with int variables is a lot faster than with bytes.
Does this apply to table look-ups. So it is faster to have int array[] than int8 array[]?
Memory speed is memory speed. The advantage of an int8 array is that it is 1/4 the size of an int array, which uses less L1/L2/L3 cache. Not to mention that cache blocks are 64 bytes in newer hardware, which means you can fetch 64 byte values, of 16 int values in the same amount of time. And the pre-fetching of the 64 bytes gives you a greater chance of a future cache hit as well.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 0x88 engines

Post by bob »

MattieShoes wrote:I think for us noobs, movegen speed is very unimportant in the scheme of things. If engines are spending 10% of their time in movegen, doubling speed of movegen would increase overall speed by 5%, which might give you something like 6 points? Undetectable unless you can test like Dr. Hyatt. Slimming down the branching factor via smarter move ordering and better pruning will help far more. And eval of course... That's my personal nightmare. I think it's tougher than the rest of the engine combined.

The messy code one is huge though, at least for me. It makes working on the beast much less fun if you're having trouble understanding what the heck you were doing LAST time...

And some test functions to make sure changes don't break something help a lot too :-)
For a first cut, you need to pick a board representation that you understand, and can write readable code for.

If you are careful, you can separate the board representation from the main engine parts by using macros or inline fuctions, so that if you choose to change the board layout, it won't wreck all the code you have written.

However, with that said, there are approaches that are so different (say mailbox whether it bi 0x88 or whatever compared to bitboards) that this decision is important because it will be a real challenge to change from one to the other, since the concepts are entirely different. If your goal is to do a bitboard program, which in my opinion is the way to go, then you should start there. Otherwise you will end up throwing most everything away and starting from scratch when you decide to change.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 0x88 engines

Post by bob »

diep wrote:
bob wrote:
Gerd Isenberg wrote:
Codeman wrote:According to Pete Isensee's paper (http://www.tantalon.com/pete/cppopt/app ... ativeCosts) about optimization tricks, doing various operations with int variables is a lot faster than with bytes.
Homogeneous char arrays in conjunction with zero- or sign-extension (movzx, movsx) are fine with respect to the smaller size and possible L1 savings. A 64-byte versus 256-byte board array is one 64-byte cacheline versus four (if cachline aligned). Since the board array is heavily used, all the four cachelines of an int-array are most often fetched from L1, while three others need to been thrown out, which may otherwise reside inside the cache with the byte-array.

So the marginally slower zero- or sign-extension to load chars to a native 32-bit register likely pays off in less cache misses elsewhere. If and by how much depends on your memory and cache footprint and may vary from program to program.

For the stack, that is locals and parameter (most are passed via registers on x64 anyway) and very often used scalar globals, better stay with int and native register width, also due to alignment, load store forewarding issues, and partial register stalls.
Don't do any local char stuff in Crafty. But I do have a few global arrays of type char...


the previously mentioned 64 element board array, plus a few other small arrays. I changed them all to ints and see this kind of speed difference. First is original, second is with chars replaced by ints.:

log.001: time=37.90 mat=-1 n=105649503 fh=91% nps=2.8M
log.002: time=38.09 mat=-1 n=105649503 fh=91% nps=2.8M

Results were similar on several tests. Very small difference, but measurable and repeatable. I will note that this was run on my core2 with 4mb of L2. The difference is a little more significant on my office PIV with 512K of L2.
Bob,

Valgrind nowadays has something called cachegrind. If you run that at crafty, what does it show?

So statistics like L1 cache misses, L2 misses, L3 misses, branch mispredict %.

That should directly answer whether it makes sense for crafty to do things like this.

Vincent
It is even easier than that. I just run a test that takes about 30 minutes, with and without a chance, and compare the times. Faster is better since I am not changing anything about the shape of the tree.

I use the valgrind package every now and then, always have. But here it is not necessary since I only care about raw speed.
MattieShoes
Posts: 718
Joined: Fri Mar 20, 2009 8:59 pm

Re: 0x88 engines

Post by MattieShoes »

I understand how bitboards work in a general sense and there's something appealing about a binary computer representing the board as a big bucket of bits rather than an array of more complex data like mailbox systems... But bitboards seem to require a big bucket of tricks to do some things that are straightforard in a mailbox scheme. The tricks are fascinating but it's just not the part I'm interested in thinking hard about right now. :-)

My goal was never to try and make "the next Rybka"... I was aiming for 2000+ and it's already there. I'd like to make it play variants though, so I went with C++ and I've been abstracting stuff out. Search is now almost entirely variant independent, though it is dependent on the structure of a move. I'm considering making changes that will actually make the chess engine a bit weaker but make it easier to implement variants. Book code and hash code is already variant independent.

Sticking point now is eval. I'd like to make either learned or automatically tuned eval, not because it's stronger but because there's something sexy about the idea of just writing a board class, giving it some eval variables to tweak, and letting it figure the rest out by examining high level games or trial and error.

And this is a total non sequitur, but I just finished rereading "The Moon is a Harsh Mistress" and now I keep typing in the pidgin language he used, dropping things like pronouns and "the". Tanstaafl! :-)
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: 0x88 engines

Post by Robert Pope »

MattieShoes wrote:Hmm, a more specific implementation question will probably get more specific answers :-)

Is 100k nps for perft or for actual searching, with eval and whatnot? Because a simplistic 0x88 scheme without piecelists or iterative check detection, etc. should get far more than 100k nps in perft...

CPW has source there so you could look at one possible implementation for everything basic there.
That's actual search. Perft gets 2.5M-5M nps without speed-up tricks. That's at least twice as slow as Crafty, which I could live with, but in search Crafty is still covering over 1M nps while mine is under 200K nps with a very basic eval.

So most likely I have serious qsearch generation/search issues or eval issues, but at this point I think I'm ready to start over and build things slowly but properly.
MattieShoes
Posts: 718
Joined: Fri Mar 20, 2009 8:59 pm

Re: 0x88 engines

Post by MattieShoes »

My perft is not very smart or tricksy, it gets perhaps 6.5 million nps. More is always better, but if your perft is 1/2 the speed of crafty, then it's not what's holding you back. If doubling overall speed nets 50-80 points, doubling perft speed would get you a fraction of that.

Going from 3.75 million nps in perft to 200,000 nps with search seems like kind of a steep dropoff for a simplistic eval. Perhaps something there is slowing it down?

Either way, move ordering will help you most, and null move pruning will help too.

If this is 0x88 and you don't use some form of attack tables for check detection, it will make a big difference. Iterative check detection can help too.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: 0x88 engines

Post by diep »

Robert Pope wrote:
MattieShoes wrote:Hmm, a more specific implementation question will probably get more specific answers :-)

Is 100k nps for perft or for actual searching, with eval and whatnot? Because a simplistic 0x88 scheme without piecelists or iterative check detection, etc. should get far more than 100k nps in perft...

CPW has source there so you could look at one possible implementation for everything basic there.
That's actual search. Perft gets 2.5M-5M nps without speed-up tricks. That's at least twice as slow as Crafty, which I could live with, but in search Crafty is still covering over 1M nps while mine is under 200K nps with a very basic eval.

So most likely I have serious qsearch generation/search issues or eval issues, but at this point I think I'm ready to start over and build things slowly but properly.
Is this you?

http://www.bangor.ac.uk/trs/staff/roberts_pope.php.en?

Oh on the nps difference: how do you count nodes?
Every call to search or qsearch in diep i call as a node as a basic concept.

What hardware do you develop on and in what programming language?

Vincent
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 0x88 engines

Post by bob »

Robert Pope wrote:
MattieShoes wrote:Hmm, a more specific implementation question will probably get more specific answers :-)

Is 100k nps for perft or for actual searching, with eval and whatnot? Because a simplistic 0x88 scheme without piecelists or iterative check detection, etc. should get far more than 100k nps in perft...

CPW has source there so you could look at one possible implementation for everything basic there.
That's actual search. Perft gets 2.5M-5M nps without speed-up tricks. That's at least twice as slow as Crafty, which I could live with, but in search Crafty is still covering over 1M nps while mine is under 200K nps with a very basic eval.

So most likely I have serious qsearch generation/search issues or eval issues, but at this point I think I'm ready to start over and build things slowly but properly.
I would not jump overboard too quickly. Crafty is only 15 years old and counting. It has been written, rewritten, tweaked, re-tweaked, to get to the current speed level. That's not a one year project at all. The first step is to come up with a workable design that you are happy with. Which starts with board representation. Then a functional search and evaluation that plays chess. Doing this will lead you to modifying the basic engine to do things more efficiently. But you likely won't know what you need to modify until after you have a chance to get something working that plays reasonably...

There are not many programs faster than Crafty. You've picked a tough one to use for a speed target.