OliPerft with divide Option as Pre Version for OliThink 5

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Dec 31, 2007 11:49 am

Hi Oliver,

I don't see magic bitboards http://chessprogramming.wikispaces.com/Magic+Bitboards, but linewise 32 bit - kindergarten like - bitboards index generation with all eight bits per line - to index four huge pre-calculated arrays of 2MByte in total by square, occupied-state 256 and {0,256,512,768}.

Code: Select all

u64 ray000[4*64*256];
u64 ray090[4*64*256];
u64 ray045[4*64*256];
u64 ray135[4*64*256];

At the first glance it looks like all attacks (+0), attacks (+256) on empty squares (quite move targets), xray-attacks "behind" first blocker (512) and some measure based on population count (768). Can you please explain what information is exactly stored inside the rayxxx-arrays?

Thanks,
Gerd

OliverBr · Post by **OliverBr** » Mon Dec 31, 2007 10:03 pm

Gerd Isenberg wrote:Hi Oliver,

I don't see magic bitboards http://chessprogramming.wikispaces.com/Magic+Bitboards, but linewise 32 bit - kindergarten like -

You may call it "kingergarten like" but you have to consider that I invented this completely by myself. I called it "Magic Bitboard" later because it seems to have the same basic idea.

bitboards index generation with all eight bits per line -
to index four huge pre-calculated arrays of 2MByte in total by square, occupied-state 256 and {0,256,512,768}.

Absolutely correct. Except for the "huge". 2 MByte doesn't seem huge to me at all.

Code: Select all
u64 ray000[4*64*256];
u64 ray090[4*64*256];
u64 ray045[4*64*256];
u64 ray135[4*64*256];
At the first glance it looks like all attacks (+0), attacks (+256) on empty squares (quite move targets), xray-attacks "behind" first blocker (512) and some measure based on population count (768). Can you please explain what information is exactly stored inside the rayxxx-arrays?

Thanks,
Gerd

Of course. You are right that I put 4 different information in those rayXXX.

+0 yields all attacks as you said, but *includes* the attacks on own pieces, too. This is due to the fact that I did not save informataion about the color of the pieces in those arrays. So I have just to "AND" it with the enemy's color to get the real attacks.

+256 is move targets, as you said, too. Here color information is not needed.

+512 is indeed xray attacked behind first block. Again without any color information. This array is just need to get all pinned pieces.

+768 is normally empty and not needed. I experimented to put there something what is *not* a bitboard (other than the first 3). I am putting there a list of moves including the captures and the number of them. I am not using it for move generation now.

OliverBr · Post by **OliverBr** » Mon Dec 31, 2007 10:05 pm

Edsel Apostol wrote:Hi Oliver,

Is Olithink going to be released as open source?

What protocol do you use?

I am studying OliPerft as I want to adapt the move geneartion for my engine Twisted Logic.

Yes, I am planning to release it open source. I am not sure wht you mean with "protocol"? xboard protocol? for now it understands "xboard", "go" and "force" that enough to play.

The move generator is far from perfect, but if you build something like a factor simpliness x shortness speed, it could be #1 worldwide now ;o)

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Dec 31, 2007 10:48 pm

OliverBr wrote: You may call it "kingergarten like" but you have to consider that I invented this completely by myself. I called it "Magic Bitboard" later because it seems to have the same basic idea.

Almost the same calculation was discussed in winboard forum
http://wbforum.vpittlik.org/viewtopic.php?t=4523
some time ago - nominated as kindergarten bitboards later by me as described in the chess-programming wiki as well: http://chessprogramming.wikispaces.com/ ... +Bitboards

Absolutely correct. Except for the "huge". 2 MByte doesn't seem huge to me at all.

I consider 2MByte huge for that purpose with respect to the 9.5 KByte for Kindergarten or 2.5KByte for the Hyperbola Quintessence http://chessprogramming.wikispaces.com/ ... intessence
I mean L1-cache and 4KByte page ressources will likely start to pollute sometime.

So far i don't see why you need to index by all eight bits of the line-occupancy - you may safe some memory resources by using only the inner six bits - or? Also the second entry seems redundant with respect to a simple intersection of all attacks with empty squares.

If using so much memory, I would prefer to keep four entries (32-byte 1/2 cacheline) per square and occupancy close together in memory - to use only one cacheline instead of four.

I appreciate that you found all by yourself - that's great!
Good luck and fun with your approach - and a happy and successful new year.

Cheers,
Gerd

OliverBr · Post by **OliverBr** » Mon Dec 31, 2007 11:12 pm

Gerd Isenberg wrote:
OliverBr wrote: some time ago - nominated as kindergarten bitboards later by me as described in the chess-programming wiki as well: http://chessprogramming.wikispaces.com/ ... +Bitboards
I see... Now there is that very interesting point that you are using some multiplication to get the key. I am not used to use multiplication
when looking for performance as I started programming with processors where there weren't any multiplicating operations available.

I am still doubting about the speed of those multiplication, are they the same than a old fashioned XOR? Another question: What happens to the performance when going down to a 32bit system?
Does anyone has an idea about this?

So far i don't see why you need to index by all eight bits of the line-occupancy - you may safe some memory resources by using only the inner six bits - or?
It would be nice if you could explain, how the outer two bits can are not necessary. I see -for the moment- that only one bit is redudant. That very one from where the attacks are calculation.
Of course, this serves nothing for the method of my key generation.

Also the second entry seems redundant with respect to a simple intersection of all attacks with empty squares.
Cough. Good point ;o) You see, nothing is perfect.

If using so much memory, I would prefer to keep four entries (32-byte 1/2 cacheline) per square and occupancy close together in memory - to use only one cacheline instead of four.
Next good point. I have to admit I am not confirm enough with cache architecture.

I appreciate that you found all by yourself - that's great!
Good luck and fun with your approach - and a happy and successful new year.
Thank you. The same for you.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jan 01, 2008 12:01 am

OliverBr wrote: I see... Now there is that very interesting point that you are using some multiplication to get the key. I am not used to use multiplication
when looking for performance as I started programming with processors where there weren't any multiplicating operations available.

I am still doubting about the speed of those multiplication, are they the same than a old fashioned XOR? Another question: What happens to the performance when going down to a 32bit system?
Does anyone has an idea about this?

Yes, imul has become amazingly fast on AMD K8 and intel core2.
In 64-bit mode 64*64=64bit imul has a latency of four cycles on K8 and 5 cycles on core2. Reciprocal throughput is 2 cycles each. I guess on P4 your approach is faster, but for the masked diagonals you may consider a simple parallel prefix solution so safe the generalized shift and some masks:

Code: Select all

int collapsedFilesIndex(U64 b) {
   b |= b >> 32;
   b |= b >> 16;
   b |= b >>  8;
   return b & 0xFF;
}

In 32-bit mode one may simply use 32-bit imul, which takes 3(1) cycles latency, reciprocal throughput:

Code: Select all

U64 diaOrRankAttacks(U64 occupied, U64 mask, U32 sq) {
   U64 x = occupied & mask;
   U32 i = ( ((U32)x + (U32)(x>>32)) * 0x02020202) >> 26;
   return fillUpAttacks[sq&7][i] & mask;
}

It would be nice if you could explain, how the outer two bits can are not necessary. I see -for the moment- that only one bit is redudant. That very one from where the attacks are calculation.
Of course, this serves nothing for the method of my key generation.

For pure attack-sets the outer squares don't care - they are either attacked or not - independent whether they are occupied. There are no more squares behind them.

cms271828 · Post by **cms271828** » Tue Jan 01, 2008 7:08 am

Hi, nice to see a java engine, this is incredible..

I ran it on eclipse, with output below..

Chess - OliThink 4.1.2 (Java)
d4
1 0 0 24 d7-d5
2 -34 1 116 d7-d5 g1-f3
3 0 3 270 d7-d5 g1-f3 g8-f6
4 -34 12 1836 d7-d5 g1-f3 g8-f6 b1-c3
5 0 17 5200 d7-d5 g1-f3 g8-f6 b1-c3 b8-c6
6 -34 29 23751 d7-d5 g1-f3 b8-c6 c1-f4 c8-f5 b1-c3
7 -3 56 68939 d7-d5 b1-c3 c8-f5 g2-g3 b8-c6 f1-h3 e7-e6 h3xf5 e6xf5
8 -17 143 218020 g8-f6 b1-c3 e7-e6 g1-f3 f8-b4 c1-d2 b8-c6 e2-e3
9 -4 468 737280 g8-f6 b1-c3 d7-d5 c1-f4 c8-f5 h2-h3 h7-h6 g2-g4 g7-g5
1. ... g8-f6
whisper -4(9) 737280 nds 157001 nps 4688 ms 328183 evs

I'm not sure how you reached ply 9, in 4688ms, does this include Transposition table?
Strangely.. your node count is 157001 nps (I think) on my computer (Pentium 4, 3Ghz 1G ram).

But my chess program was doing 1.8M nodes/sec (without TTs), and I just got it to do 1M nodes/sec (with TTs), but I've just put TTs in, and I don't know what I'm doing, so its not working properly.

But I can't get to that sort of depth in 5 seconds, only around 6 - 7 ply from initial position.

Didn't you say you could do 100's of millions of nodes per second?
Its still very good, would like to know how it can search so deep with such small node count, unless I'm reading it wrong.

Thanks

Dann Corbit · Post by **Dann Corbit** » Tue Jan 01, 2008 7:35 am

cms271828 wrote:Hi, nice to see a java engine, this is incredible..

I ran it on eclipse, with output below..
Chess - OliThink 4.1.2 (Java)
d4
1 0 0 24 d7-d5
2 -34 1 116 d7-d5 g1-f3
3 0 3 270 d7-d5 g1-f3 g8-f6
4 -34 12 1836 d7-d5 g1-f3 g8-f6 b1-c3
5 0 17 5200 d7-d5 g1-f3 g8-f6 b1-c3 b8-c6
6 -34 29 23751 d7-d5 g1-f3 b8-c6 c1-f4 c8-f5 b1-c3
7 -3 56 68939 d7-d5 b1-c3 c8-f5 g2-g3 b8-c6 f1-h3 e7-e6 h3xf5 e6xf5
8 -17 143 218020 g8-f6 b1-c3 e7-e6 g1-f3 f8-b4 c1-d2 b8-c6 e2-e3
9 -4 468 737280 g8-f6 b1-c3 d7-d5 c1-f4 c8-f5 h2-h3 h7-h6 g2-g4 g7-g5
1. ... g8-f6
whisper -4(9) 737280 nds 157001 nps 4688 ms 328183 evs
I'm not sure how you reached ply 9, in 4688ms, does this include Transposition table?
Strangely.. your node count is 157001 nps (I think) on my computer (Pentium 4, 3Ghz 1G ram).

But my chess program was doing 1.8M nodes/sec (without TTs), and I just got it to do 1M nodes/sec (with TTs), but I've just put TTs in, and I don't know what I'm doing, so its not working properly.

But I can't get to that sort of depth in 5 seconds, only around 6 - 7 ply from initial position.

Didn't you say you could do 100's of millions of nodes per second?
Its still very good, would like to know how it can search so deep with such small node count, unless I'm reading it wrong.

Thanks

He gets the really high node counts with the perft program.
If you add alpha-beta and null move to your search, you will see big depth increases.

OliverBr · Post by **OliverBr** » Tue Jan 01, 2008 11:37 am

cms271828 wrote: I'm not sure how you reached ply 9, in 4688ms, does this include Transposition table?

Yes, there are hash tables, but I think in that position the null move stuff is doing most of the work. But Null move is dangerous. It's not a loss-free reducing of nodes.
The alpha-beta is loss-free and so the move ordering is so important.

Strangely.. your node count is 157001 nps (I think) on my computer (Pentium 4, 3Ghz 1G ram).

I am just counting *real* nodes here so quiescence nodes are not counted.

Didn't you say you could do 100's of millions of nodes per second?

Yes, but without any evaluation (just perft). With a decent eval the performance goes down to less than 5 millions of nodes.

OliverBr · Post by **OliverBr** » Wed Jan 02, 2008 12:21 am

Thanks to the tips from Gerd Isenberg I succeeded in improving another 10% :

oliperft 7 "r3k2r/p6p/8/8/8/8/P6P/R3K2R w - - 0 1"

Former: Nodes: 553838205 ms: 1992 knps: 278031
Now: Nodes: 553838205 ms: 1840 knps: 300183

oliperft 6 "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq -"

Former: Nodes: 8229523927 ms: 50456 knps: 163102
Now: Nodes: 8229523927 ms: 40610 knps: 202647

This was on a 64bit system. On 32bit systems there is hardly any improvement.

http://home.arcor.de/dreamlike/chess/index.html

OliPerft with divide Option as Pre Version for OliThink 5

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink

Re: OliPerft with divide Option as Pre Version for OliThink