0x88 engines

MattieShoes · Post by **MattieShoes** » Sat May 02, 2009 3:42 pm

I think for us noobs, movegen speed is very unimportant in the scheme of things. If engines are spending 10% of their time in movegen, doubling speed of movegen would increase overall speed by 5%, which might give you something like 6 points? Undetectable unless you can test like Dr. Hyatt. Slimming down the branching factor via smarter move ordering and better pruning will help far more. And eval of course... That's my personal nightmare. I think it's tougher than the rest of the engine combined.

The messy code one is huge though, at least for me. It makes working on the beast much less fun if you're having trouble understanding what the heck you were doing LAST time...

And some test functions to make sure changes don't break something help a lot too

diep · Post by **diep** » Sat May 02, 2009 5:36 pm

Zach Wegner wrote:Still on that anti-bitboard kick, eh Vincent?

This has nothing to do with bitboards, i just objectively report what is fastest.

Thanks,
Vincent

diep · Post by **diep** » Sat May 02, 2009 5:47 pm

MattieShoes wrote:0x88 was 16x8 (or 8x16, it matters not) I believe. 10x12 would not allow the 0x88 scheme to work since the "8 bits" aren't necessarily OOB flags.

I'm trying to understand the advantage of 16x16 and I'm not getting it. Can anybody explain why 16x16 is better than 16x8? What additional capabilities does it offer over 16x8?

16x16 is faster for detecting checks and attacks. Doing checks in qsearch is real important and attacks are for other purposes interesting to discover.

It is easy to see that if you want to see whether a piece P at square sq attacks square x that you can already have a quick indication of this with a few + and minus signs:

attack1mask = attackarray[constant + sq- x];

First of all when this is not set you already know for sure there isn't any attacks. So a single L1 lookup is enough to cutoff most attempts.

Now the advantage of 16x16 over others is that it is more accurate.

Secondly looping through it. Todays OoO processors are real clever there and intel can do looping without the loop even costing many cycles (older AMD's had a 1 extra cycle penalty for it).

So from the mask you get back you can loop through the squares real quickly.

For normal operations to scan in evaluation function through it, you can just do shifting. All those loops are real cheap. You keep a low register load and can push through simply at a real high IPC at modern processors, that is the huge advantage.

Design it once and never have problems again, whereas with other datastructures like bitboards you keep forever busy with losing 1 cycle here or losing 2 there. The low number of instructions the processor can issue in 64 bits, the bigger load onto the caches of it and so on. A never ending story.

This is simply fastest of all and no need to upgrade the datastructure coming years. It also can run at very tiny caches, which with upcoming manycore hardware might be interesting also.

The lookup tables needed for a 16x16 board are simply real tiny and fit easily within L1.

Realize all these programs in 90s were under 1000 cycles a node at P5 hardware which can do just 2 instructions a cycle max. Todays hardware can do up to 4 instructions a cycle (32 bits of course, not 64 bits) in integers and not a single bitboard engine except for Rybka gets under 1000 cycles a node AFAIK.

With the simple engine i wrote, even diep datastructure compatible, in 32 bits i easily got to underneath 1000 cycles a node at a k7 2.1Ghz.

Still that's slow compared to Schach 3.0 for example which was like 600 cycles a node at a Pentium. The strong programmers left the computerchess field. What got back is a bunch of burocrats who still are busy learning how to program.

Thanks,
Vincent

Zach Wegner · Post by **Zach Wegner** » Sat May 02, 2009 6:09 pm

diep wrote:
Zach Wegner wrote:Still on that anti-bitboard kick, eh Vincent?
This has nothing to do with bitboards, i just objectively report what is fastest.

Thanks,
Vincent

Well, I think you would be right about the move generator, no argument there. Deserialization of bitboards is rather tedious and doesn't interest me, though creating the attack bitboards themselves is fun to think about from a mathematical point of view. But bitboards are just so much massively simpler and easier to conceptualize for me. For mailbox or gnuchess 4.x approaches to be fast at things like eval, you need to keep around tons of little tables. For example, take the evaluation of Fruit. A lot of it is fairly simple, but looking at the pawn eval or the king safety eval in particular shows tons of stuff that's just ugly: all the pawn_file and BitGE stuff, the piece_attacks_king() function, etc. Now compare this to Strelka, where (ahem) the exact same calculations are done with bitboards. Much simpler, and much faster. Even if you look at eval_pattern() in Fruit, where mailbox would supposedly shine, the Strelka version is simpler and faster. Even if bitboards weren't faster at this sort of stuff, it would be a massive advantage just in conceptualizing it.

All I'm saying is that I think the "bitboards are slow and complicated" mode of thinking is getting more and more outdated.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 02, 2009 6:55 pm

MattieShoes wrote:I never even considered shifting by more bits than there are in the value... Clever solution
Should it run faster? I tested it in debug mode (otherwise, the compiler optimized away the entire loop!) and it was within a few percent, but I don't know if the fact it was running in debug mode makes any real difference here.

As you said elsewhere, there are more important things to improve the engine

One (or in my case, two) alu-instruction less is fine, but x86 cl is exclusive register for variable general purpose shifts, which might become an issue, to test whether two squares share the same color. Often additional instructions may improve ipc. If the in-register lookup is faster, it is probably not measurable inside a chess program. The behavior of shift amount greater than register width is not specified within the C or C++ language, but defined on x86 with as modulo register width.
With todays processors and compilers changing some code sometimes behaves "chaotic". It is hard to measure small code snippets independently. Forgot loop tests in debug mode, you may try for instance a loop in release mode, pass some random squares and count how many are dark or light. Now the compiler has no chance to optimize something away.

Code: Select all

#define RAND_SIZE &#40;1<<14&#41; 
#define RAND_MASK &#40;RAND_SIZE - 1&#41;
#define N 1000000000

int randomSquare&#91;RAND_SIZE&#93;; // initilalized by random squares
int colorCounter&#91;2&#93;;

void testColor&#40;)
&#123;
   for &#40;int i = 0; i < N; i++)
      colorCounter&#91;colorOfSquare&#40;randomSquare&#91;i & RAND_MASK&#93;)&#93; ++;
&#125;

Of course the color of square function should be inlined.

If you intend to program for other platforms than x86 as well, you'll better stay with the portable version.

diep · Post by **diep** » Sat May 02, 2009 7:44 pm

Zach Wegner wrote:
diep wrote:
Zach Wegner wrote:Still on that anti-bitboard kick, eh Vincent?
This has nothing to do with bitboards, i just objectively report what is fastest.

Thanks,
Vincent
Well, I think you would be right about the move generator, no argument there. Deserialization of bitboards is rather tedious and doesn't interest me, though creating the attack bitboards themselves is fun to think about from a mathematical point of view. But bitboards are just so much massively simpler and easier to conceptualize for me. For mailbox or gnuchess 4.x approaches to be fast at things like eval, you need to keep around tons of little tables. For example, take the evaluation of Fruit. A lot of it is fairly simple, but looking at the pawn eval or the king safety eval in particular shows tons of stuff that's just ugly: all the pawn_file and BitGE stuff, the piece_attacks_king() function, etc. Now compare this to Strelka, where (ahem) the exact same calculations are done with bitboards. Much simpler, and much faster. Even if you look at eval_pattern() in Fruit, where mailbox would supposedly shine, the Strelka version is simpler and faster. Even if bitboards weren't faster at this sort of stuff, it would be a massive advantage just in conceptualizing it.

All I'm saying is that I think the "bitboards are slow and complicated" mode of thinking is getting more and more outdated.

Attacktables is slower in bitboards of course, in diep i'm doing it incremental.

You can mathematical prove this easily. It is best for an evaluation function to have all information within a single integer preferably, or 2 integers.

In bitboards you are doing a big diaspora of information. Fruit is a bad comparision to compare with. It is french style coding, of a programmer who seems to write some sort of Fortran type C++ code. It's easy to speedup his code and get 2x more nps than Fruit in a lossless manner.

In diep i'm using an incremental attacktable. It works as follows:

int attacktable[2][64]; For side times squares.

In each integer: sign bit is not used.
bits 15..30 are giving the index numbers, in bitboard style, 1 bit
for each n, into the piecelist array.

int piecelist[2][18]; // gives square of piece.

So in that manner you can see, if you would need it what squares the attackers are on.

Then least few significant 4 bits are the number of attackers onto 1 square.

For titled players, number of attackers to a square is an important consideration in evaluating a position simply. Also in SEE it is of use.

the bits 4..29 there is a 'gnuchess 4.0' type of sense of value of a piece.

That ranges from ctlP being most significant to ctlK being least significant

Note the manner in how i do this 'ctlX' stuff is different from gnuchess 4.0

Bitboards is way too slow to do this non-incremental. You need some sort of datastructure like this. If you just see the 'combine attacks' function of crafty there, that's so so slow.

If you realize all this attack info gets used throughout Diep's eval, you'll realize the necessity of incremental attacktables for Diep.

The speed of your move generator determines a lot more than just move generation. It also shows how much talent you've got as a programmer.

Fiddling with bitboards already means bitboards is too slow for what you do, besides that it loses too much time.

Setup a datastructure like i've got in diep and you hardly ever again lose time to the datastructure itself. With bitboards you're nonstop busy with it. Higher level thinking hardly happens.

We can definitely conclude nearly 10 years after Frans Morsch did do the historic statement: "Vincent, please don't post what you do inside Diep, let them all toy with bitboards; bitboards is too complicated for them, it will limit engine strengths".

I can conclude he was so so right. Like 1 person really is fast in bitboards, the rest, despite trying to make a fast beancounter (nearly all of them), doesn't even manage to get under 1000 cycles a node. Real clumsy programmers. In fact biggest limit is c++ in most codes. The guys here don't realize how much of a slow down average c++ code is compared to C code, let alone assembler.

Additionally to that, if you just create a beancounter, how do you plan to beat rybka if you have a smaller evaluation function AND get 2 to 4 times slower nps than rybka?

Note that's still better than the Go guys. I remember a few years a discussion there, where they claimed JAVA to be nearly same speed like C.

If your evaluatoin function is rather simple and tiny, you can't claim to be a good programmer if you need more than 1000 cycles a node.

I see cases of bitboard engines, where diep, having worlds largest evaluation function, nearly gets the same nps, just thanks to clumsy implemented code and c++ overhead.

Now note that i do appreciate people who write their own code, a lot more than those who just model after existing engines; we would see a posting of GCP regarding Thinker here, but so far i didn't see him post it.

Vincent

Greg Strong · Post by **Greg Strong** » Sat May 02, 2009 9:34 pm

diep wrote:In fact biggest limit is c++ in most codes. The guys here don't realize how much of a slow down average c++ code is compared to C code, let alone assembler.

We don't realize it because it's not true.

MattieShoes · Post by **MattieShoes** » Sun May 03, 2009 3:27 am

diep wrote: 16x16 is faster for detecting checks and attacks. Doing checks in qsearch is real important and attacks are for other purposes interesting to discover.

It is easy to see that if you want to see whether a piece P at square sq attacks square x that you can already have a quick indication of this with a few + and minus signs:

attack1mask = attackarray[constant + sq- x];

This works with 16x8 as well, the exact same way, though, so I don't see how it'd makes 16x16 faster. for 16x8...
array[119+sqOne-sqTwo]
The constant might change in 16x16 but the array would be identical in size as far as I can see...

MattieShoes · Post by **MattieShoes** » Sun May 03, 2009 3:38 am

Of course the color of square function should be inlined.

It's just one of my macros. I probably rely on macros too much...

Code: Select all

#define ROW&#40;x&#41;				(&#40;x&#41; >> 4&#41;
#define COL&#40;x&#41;				(&#40;x&#41;&7&#41;
#define LIGHT_SQUARE&#40;x&#41;		(((&#40;x&#41;>>4&#41;^&#40;x&#41;)&1&#41;
#define TO_BOARD64&#40;x&#41;		((&#40;x&#41;&7&#41; | ((&#40;x&#41;>>1&#41;&56&#41;)
#define TO_BOARD128&#40;x&#41;		(((&#40;x&#41;<<1&#41;&0xF0&#41; | (&#40;x&#41;&7&#41;)
#define INVALID_SQUARE&#40;x&#41;	(&#40;x&#41;&0x88&#41;

(By the way, index 0 is a8 in my scheme, not a1, so it's backwards compared to cpw)

Nowadays, the compiler handles inlining functions automatically, yes? I haven't explicitly inlined any of my functions on the assumption that very simple 1 line functions would automatically get inlined anyway.

Zach Wegner · Post by **Zach Wegner** » Sun May 03, 2009 5:25 am

MattieShoes wrote:
diep wrote: 16x16 is faster for detecting checks and attacks. Doing checks in qsearch is real important and attacks are for other purposes interesting to discover.

It is easy to see that if you want to see whether a piece P at square sq attacks square x that you can already have a quick indication of this with a few + and minus signs:

attack1mask = attackarray[constant + sq- x];
This works with 16x8 as well, the exact same way, though, so I don't see how it'd makes 16x16 faster. for 16x8...
array[119+sqOne-sqTwo]
The constant might change in 16x16 but the array would be identical in size as far as I can see...

Simple explanation of superiority of 16x12+ versus 0x88:
16x12:

Code: Select all

int *gen_slider_moves&#40;int sq, int dir&#41;
&#123;
  int to = sq + dir;
  while &#40;board&#91;to&#93; == EMPTY&#41;
  &#123;
    add_move&#40;sq, to&#41;;
    to += dir;
  &#125;
  if &#40;board&#91;to&#93; == ENEMY&#41;
    add_move&#40;sq, to&#41;;
&#125;

0x88:

Code: Select all

int *gen_slider_moves&#40;int sq, int dir&#41;
&#123;
  int to = sq + dir;
  while (!&#40;to & 0x88&#41; && board&#91;to&#93; == EMPTY&#41;
  &#123;
    add_move&#40;sq, to&#41;;
    to += dir;
  &#125;
  if (!&#40;to & 0x88&#41; && board&#91;to&#93; == ENEMY&#41;
    add_move&#40;sq, to&#41;;
&#125;

Basically you save 1 branch per move generated.

0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines

Re: 0x88 engines