pawn enumeration

Daniel Shawul · Post by **Daniel Shawul** » Sun Jul 29, 2012 12:31 am

My generator is maybe 120 times faster than yours. I have provided reasons for this and I have explained to you how I do it. Whether you want to take my advice or not is up to you, but implying that I don't know what I am talking about only makes you look silly.

This makes you a sore looser bringing a strawman. After hours you came up with mine is faster than yours stuff even if I _explicitly_ mentioned your generator is faster. Still your *move generator* is not faster than mine period! Now you attribute your overall generation speed to that. I never bothered optimizing generation time so I don't care about it. The major problem for me is my indexing which is done in reverse manner. So it only makes your argument even poorer as now you don't have anything left. How many times did I say I prefered the small elegance I see in avoiding for loops like below...

Do you think I wont get to your speed if I do
for(i = 0;i < 64;i++)
for(j = 0;j < 64;j++)
index =i * 64 + j

Sven · Post by **Sven** » Sun Jul 29, 2012 12:46 am

Daniel Shawul wrote:
My generator is maybe 120 times faster than yours. I have provided reasons for this and I have explained to you how I do it. Whether you want to take my advice or not is up to you, but implying that I don't know what I am talking about only makes you look silly.
This makes you a sore looser bringing a strawman. After hours you came up with mine is faster than yours stuff even if I _explicitly_ mentioned your generator is faster. Still your *move generator* is not faster than mine period! Now you attribute your overall generation speed to that. I never bothered optimizing generation time so I don't care about it. The major problem for me is my indexing which is done in reverse manner. So it only makes your argument even poorer as now you don't have anything left. How many times did I say I prefered the small elegance I see in avoiding for loops like below...

Do you think I wont get to your speed if I do
for(i = 0;i < 64;i++)
for(j = 0;j < 64;j++)
index =i * 64 + j

Hi Daniel,

do you think that *move generator* speed is a relevant magnitude to be compared if one program (yours) has such a move generator (that stores moves in move lists, if I understood you correctly) and the other (Ronald's) doesn't have it since it has a completely different program structure where generating target squares (i.e. moves) is combined with index calculation and table lookup?

Would sound a bit like comparing gasoline consumption of a bike and a motorbike. Bike wins, obviously

Sven

Daniel Shawul · Post by **Daniel Shawul** » Sun Jul 29, 2012 12:53 am

Sven Schüle wrote:
Daniel Shawul wrote:
My generator is maybe 120 times faster than yours. I have provided reasons for this and I have explained to you how I do it. Whether you want to take my advice or not is up to you, but implying that I don't know what I am talking about only makes you look silly.
This makes you a sore looser bringing a strawman. After hours you came up with mine is faster than yours stuff even if I _explicitly_ mentioned your generator is faster. Still your *move generator* is not faster than mine period! Now you attribute your overall generation speed to that. I never bothered optimizing generation time so I don't care about it. The major problem for me is my indexing which is done in reverse manner. So it only makes your argument even poorer as now you don't have anything left. How many times did I say I prefered the small elegance I see in avoiding for loops like below...

Do you think I wont get to your speed if I do
for(i = 0;i < 64;i++)
for(j = 0;j < 64;j++)
index =i * 64 + j
Hi Daniel,

do you think that *move generator* speed is a relevant magnitude to be compared if one program (yours) has such a move generator (that stores moves in move lists, if I understood you correctly) and the other (Ronald's) doesn't have it since it has a completely different program structure where generating target squares (i.e. moves) is combined with index calculation and table lookup?

Would sound a bit like comparing gasoline consumption of a bike and a motorbike. Bike wins, obviously

Sven

No it doesn't make sense at all. I never bothered optimizing the bitbase generator because all I wanted was to generatie them and compress as well as I can. It took me 5 months with the forward generator and I didn't care as long as I got them. Now I wanted to make it a little faster since I had nothing to do. I know my indexing is very bad and the profiler shows 22% there while a 2% gain for move generator. Ronald says he got super bitboards and I have 0x88 which doesn't make sense at all because we generate non captures and 0x88 are very good at that. Bitboards excel only at that and I know I use a hybrid approach even in scorpio to generate captures with bitbaord and non-captures with 0x88. Even if I double my speed the gain would be minimal for me as it is only 2% time that is spent there. Just because I have a boulder somewhere in my code doesn't mean I don't have super fast things elsewhere...

The argument you see here from Ronald is because his generator of tbs is faster than mine, my move generator must suck too

(i.e even after I posted 2% is spent on move generation

) This is a complete misrepresentation of my positon. I also mentioned I never wanted to compete in generation time. If he wanted a challenge go challenge H.G's generator. But I _know_ if I really wanted to generate them fast I can. it is just that only I started doing that a couple of days ago. So I don't want to copare generation speed nor do I want to compare move generation of different programs. But for the life of me I can't understand why one would think bitboards excel at non-caps and retro moves as well.

syzygy · Post by **syzygy** » Sun Jul 29, 2012 1:23 am

Daniel Shawul wrote:The argument you see here from Ronald is because his generator of tbs is faster than mine, my move generator must suck too (i.e even after I posted 2% is spent on move generation )

No, my argument was that I have built something that works and is fast, so I know what I am talking about. If i say that switching from 0x88 to bitboard gave a speed up for my program of 25%, then it is just a bit funny if you say that it can't be true. Maybe you could scroll up and see that I am referring to my generator.

But for the life of me I can't understand why one would think bitboards excel at non-caps and retro moves as well.

I am just reporting a fact. I have also offered an explanation. Scroll up.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jul 29, 2012 1:39 am

syzygy wrote:
Daniel Shawul wrote:The argument you see here from Ronald is because his generator of tbs is faster than mine, my move generator must suck too (i.e even after I posted 2% is spent on move generation )
No, my argument was that I have built something that works and is fast, so I know what I am talking about. If i say that switching from 0x88 to bitboard gave a speed up for my program of 25%, then it is just a bit funny if you say that it can't be true. Maybe you could scroll up and see that I am referring to my generator.

But for the life of me I can't understand why one would think bitboards excel at non-caps and retro moves as well.
I am just reporting a fact. I have also offered an explanation. Scroll up.

In mean time look at the titel will you? "The price of complex indexing". Top time spent there 22% and move generation 2%. It never was my intention to compare tb generation speeds but you run out of ideas so you had to use everything in your sight...

syzygy · Post by **syzygy** » Sun Jul 29, 2012 1:50 am

Daniel Shawul wrote:
syzygy wrote:
Daniel Shawul wrote:The argument you see here from Ronald is because his generator of tbs is faster than mine, my move generator must suck too (i.e even after I posted 2% is spent on move generation )
No, my argument was that I have built something that works and is fast, so I know what I am talking about. If i say that switching from 0x88 to bitboard gave a speed up for my program of 25%, then it is just a bit funny if you say that it can't be true. Maybe you could scroll up and see that I am referring to my generator.

But for the life of me I can't understand why one would think bitboards excel at non-caps and retro moves as well.
I am just reporting a fact. I have also offered an explanation. Scroll up.
In mean time look at the titel will you? "The price of complex indexing". Top time spent there 22% and move generation 2%. It never was my intention to compare tb generation speeds but you run out of ideas so you had to use everything in your sight...

True, I obviously did not at all get into this discussion to help you and give you useful tips.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jul 29, 2012 1:54 am

syzygy wrote:
Daniel Shawul wrote:
syzygy wrote:
Daniel Shawul wrote:The argument you see here from Ronald is because his generator of tbs is faster than mine, my move generator must suck too (i.e even after I posted 2% is spent on move generation )
No, my argument was that I have built something that works and is fast, so I know what I am talking about. If i say that switching from 0x88 to bitboard gave a speed up for my program of 25%, then it is just a bit funny if you say that it can't be true. Maybe you could scroll up and see that I am referring to my generator.

But for the life of me I can't understand why one would think bitboards excel at non-caps and retro moves as well.
I am just reporting a fact. I have also offered an explanation. Scroll up.
In mean time look at the titel will you? "The price of complex indexing". Top time spent there 22% and move generation 2%. It never was my intention to compare tb generation speeds but you run out of ideas so you had to use everything in your sight...
True, I obviously did not at all get into this discussion to help you and give you useful tips.

See there you go again with your snarly comments. Please understand I do not like exchanging one liners. It is just not my style.
If you want to discuss bitboard move generation vs 0x88 for non captures we can discuss. Otherwise I don't have any intention to continue in troll mode.

diep · Post by **diep** » Mon Jul 30, 2012 12:15 am

Daniel Shawul wrote:
diep wrote: What i do is a tad more work. I have a function that's looping in the same manner as how the indexation works. So the way how it puts pieces on the board, that's exactly the same as how the indexation works. That's WAY FASTER. then for each lookup you do, you figure out where it is in incremental manners. The things you don't want to do is sophisticated tricks. What you can't avoid is mirroring and n of the same men, as the savings is too huge of that. Koistinen wanted to save also onto the 63 * 62 * bla bla, and just use either 48 for pawns or 64 there,

but that's something you have to test yourself. With a few tables you already get far there and just replace the tables.
For 6 men with no alike peices I have 462x62x61x60x59 = 5.7G positons and naive is 64^6 =64G which is about a saving of 11x times, most of it

Please read what i wrote.

I said you can't do without reducing for kings nor the same man.

if we speak about 4 different pieces after the kings then koistinen proposal was 10 * 64^5

That's quite a lot of overhead as you can see:

So in case of pawnless the Koistinen proposal is to do it in 10 * 64^5 = 10.7G which is nearly factor 2 overhead.

Note i do it with a bit more calculation :

That's 10 * 63 * 62 * 61 * 60 * 64 = 9.1G entries.

this doesn't eat 8GB ram as you store everything in 1 bit an entry a bitmap.
For largest 6 men WITH PAWN you need 64MB ram roughly using this method or something to run well.

You do need to write a caching system then of course which you can also use for the program - and you need that caching system anyway for your programs search, so you can reuse most of that code, and/or make it generic.

If you have enough RAM you can put some of the bitmaps in a RAMDISK,
which will speed you up a tad, yet it's not really ging to be that relevant, it's just a small percentage of total time, as the harddisk reads get prefetched anyway.

Be careful though which filesystem to do this onto.

Now in nalimov format the biggest entry size for a 6 man, with all tricks included is something like a 16.1 G entries or so (didn't really check).

The biggest bitmap in this unoptimized method is:

24 * 63 * 62 * 61 * 60 * 64 = 21.9G entries

So your overhead is just : 21.9G / 16.1G = 36.4%

That's not really much huh?

So for the toughest to generate EGTBs the overhead is limited. It's a tad more for the pawnless though.

Realize there is another overhead that's rewriting the EGTB.

With the caching system it simply doesn't matter which of the previous 4 pieces is the other king, of course easiest is to take the one last piece:

So if we use for example KRBP kq

24 * 63 * 62 * 61 * 60 * 64 = PRBqKk

then you have to write that for the next pass to: PRBqkK

Koistinens proposal which assumes pages to avoid random pages is simply not needed, your caching system already will make things lineair fast; you get the maximum bandwidth out of a harddrive when the pages you lookup randomly at them are big enough. In fact the drive implicitly already is doing this.

However i do not know how modern drives react here. The 64KB pages that i use in Diep's caching system might be maybe too small for the terabyte drives. You might want to experiment there what size pages you need to get the full bandwidth out of the drive.

Such benchmark software is easy to write. If i dig hard i might still have the testprogram i used 12 years ago there.

Be careful to not fall for the pitfall though. Drives are up to 2 times faster at the edges in bandwith than when they store data at the inside. This has a simple explanation, but you might want to ask HGM about that. It's more his expertise to know of that hardware side explanation than mine.

At the time i did need to reformat my drive by the way for this, as NTFS after a while messed up.

The bandwidth was dominated more by NTFS than by hardware.

Using empty drive solved that problem though for me.

I hope you really realize by now that DOING RANDOM LOOKUPS IS NOT A PROBLEM?

If you read 64KB pages from harddrive then you get the full bandwidth out of the harddrive.

You got that message?

So if you need position 101919191 right now, you get from disk the 64KB where this position is inside and then you hope you need soon the other positions in it. Practical this is the case.

You realize what i mean with a caching system now? As i wrote that 10 times now and each time you hammer about 'random lookups'.

There are no random lookups to a drive if you get huge pages simply as you get 99.99% of the full bandwidth out of it then.

The way how to build up the caching system is endless, there is many solutions. Using a FIFO system and binary lookup tree is fastest. Note for 6 men you also can make it O ( 1 ) with a few megabytes of RAM as lookup table. That's what i did create. The simple math is that 5T positions / 64k = 5G / 64 = a bit more than 78M pages of 64KB.

the slowest thing for me was rewriting the kK to Kk and vice versa, as you rewrite every bit there. I didn't optimize this well simply. maybe you'll find some bitboard fiddling trick that can do this FAST. Would be grateful if you shared that one.

coming from the 8x fold symmetry. So you are right that mirroring is a must but the rest are not. I think I will try to use his method for in RAM generation but that would need an 8G ram. My goal was to get 6 men on 32 bit computers with less than 4G ram but if I can get out a lot of speed from generation it will be worth it.
Note that most profilers do not reflect correctly how much you would lose to the RAM when code would be optimized. Don't forget this remark please some time from now. The GCC/valgrind junk is notorious there. Intel's stuff doing a tad better job there.

p.s. doing things backwards is too slow - do things forward it's easier and faster and more bugfree.
Even if I do forward,the RAM access would still be random. If Koinstein's method does the 8 fold symmetry and still "congregates" the positions after we make a move then it forward may be better but I doubt it. Also since retro moves are almost the same as forward moves, they would still benefit from whatever tricks you have for the former. I really don't see how it could be faster. What I do is a combination of both actually. You can save a lot of effort by retro moving lost potitions to won as that will need no verification at all. I also do retro moves from won-positions to lost followed by verification which is time consuming.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 30, 2012 2:09 am

Please read what i wrote.

I said you can't do without reducing for kings nor the same man.

Which planet are you from ?? I just agreed with you and you quoted it too

For 6 men with no alike peices I have 462x62x61x60x59 = 5.7G positons and naive is 64^6 =64G which is about a saving of 11x times, most of it coming from the 8x fold symmetry. So you are right that mirroring is a must but the rest are not.

Isn't that an agreement with what you said?? I will tell you what once you apologize, I will read whatever you wrote below that I snipped. Otherwise it is only you who will be reading your 5k verbose self talk as usual. I can't believe what is going on here ...

if we speak about 4 different pieces after the kings then koistinen proposal was 10 * 64^5

That's quite a lot of overhead as you can see:

So in case of pawnless the Koistinen proposal is to do it in 10 * 64^5 = 10.7G which is nearly factor 2 overhead.

Note i do it with a bit more calculation :

That's 10 * 63 * 62 * 61 * 60 * 64 = 9.1G entries.

this doesn't eat 8GB ram as you store everything in 1 bit an entry a bitmap.
For largest 6 men WITH PAWN you need 64MB ram roughly using this method or something to run well.

You do need to write a caching system then of course which you can also use for the program - and you need that caching system anyway for your programs search, so you can reuse most of that code, and/or make it generic.

If you have enough RAM you can put some of the bitmaps in a RAMDISK,
which will speed you up a tad, yet it's not really ging to be that relevant, it's just a small percentage of total time, as the harddisk reads get prefetched anyway.

Be careful though which filesystem to do this onto.

Now in nalimov format the biggest entry size for a 6 man, with all tricks included is something like a 16.1 G entries or so (didn't really check).

The biggest bitmap in this unoptimized method is:

24 * 63 * 62 * 61 * 60 * 64 = 21.9G entries

So your overhead is just : 21.9G / 16.1G = 36.4%

That's not really much huh?

So for the toughest to generate EGTBs the overhead is limited. It's a tad more for the pawnless though.

Realize there is another overhead that's rewriting the EGTB.

With the caching system it simply doesn't matter which of the previous 4 pieces is the other king, of course easiest is to take the one last piece:

So if we use for example KRBP kq

24 * 63 * 62 * 61 * 60 * 64 = PRBqKk

then you have to write that for the next pass to: PRBqkK

Koistinens proposal which assumes pages to avoid random pages is simply not needed, your caching system already will make things lineair fast; you get the maximum bandwidth out of a harddrive when the pages you lookup randomly at them are big enough. In fact the drive implicitly already is doing this.

However i do not know how modern drives react here. The 64KB pages that i use in Diep's caching system might be maybe too small for the terabyte drives. You might want to experiment there what size pages you need to get the full bandwidth out of the drive.

Such benchmark software is easy to write. If i dig hard i might still have the testprogram i used 12 years ago there.

Be careful to not fall for the pitfall though. Drives are up to 2 times faster at the edges in bandwith than when they store data at the inside. This has a simple explanation, but you might want to ask HGM about that. It's more his expertise to know of that hardware side explanation than mine.

At the time i did need to reformat my drive by the way for this, as NTFS after a while messed up.

The bandwidth was dominated more by NTFS than by hardware.

Using empty drive solved that problem though for me.

I hope you really realize by now that DOING RANDOM LOOKUPS IS NOT A PROBLEM?

If you read 64KB pages from harddrive then you get the full bandwidth out of the harddrive.

You got that message?

So if you need position 101919191 right now, you get from disk the 64KB where this position is inside and then you hope you need soon the other positions in it. Practical this is the case.

You realize what i mean with a caching system now? As i wrote that 10 times now and each time you hammer about 'random lookups'.

There are no random lookups to a drive if you get huge pages simply as you get 99.99% of the full bandwidth out of it then.

The way how to build up the caching system is endless, there is many solutions. Using a FIFO system and binary lookup tree is fastest. Note for 6 men you also can make it O ( 1 ) with a few megabytes of RAM as lookup table. That's what i did create. The simple math is that 5T positions / 64k = 5G / 64 = a bit more than 78M pages of 64KB.

the slowest thing for me was rewriting the kK to Kk and vice versa, as you rewrite every bit there. I didn't optimize this well simply. maybe you'll find some bitboard fiddling trick that can do this FAST. Would be grateful if you shared that one.

coming from the 8x fold symmetry. So you are right that mirroring is a must but the rest are not. I think I will try to use his method for in RAM generation but that would need an 8G ram. My goal was to get 6 men on 32 bit computers with less than 4G ram but if I can get out a lot of speed from generation it will be worth it.
Note that most profilers do not reflect correctly how much you would lose to the RAM when code would be optimized. Don't forget this remark please some time from now. The GCC/valgrind junk is notorious there. Intel's stuff doing a tad better job there.

p.s. doing things backwards is too slow - do things forward it's easier and faster and more bugfree.
Even if I do forward,the RAM access would still be random. If Koinstein's method does the 8 fold symmetry and still "congregates" the positions after we make a move then it forward may be better but I doubt it. Also since retro moves are almost the same as forward moves, they would still benefit from whatever tricks you have for the former. I really don't see how it could be faster. What I do is a combination of both actually. You can save a lot of effort by retro moving lost potitions to won as that will need no verification at all. I also do retro moves from won-positions to lost followed by verification which is time consuming.

diep · Post by **diep** » Mon Jul 30, 2012 3:06 pm

You're the guy i have been patient with.
I've already done this and i get back a very big unpolite mouth from someone whom i tried to explain several times the same thing and i'm still not sure you got the message.

This has to do with someone lacking IQ, who in about a dozen messages you posted here, can't even talk science, that much you dislike some persons. Your social feelings total overrule any possible IQ you might have - if any at all.

It is obvious that i didn't need to quote the technical know how i have how to get EGTBs done in a fast manner, to someone who repeatedly has insulted me here.

Now in 1 message i post a few times: "did you understand it?" and that proves already too much to you.

This where in half a dozen of messages you've called me all sorts of names and words and i've also seen what you posted before editting there.

Maybe you better go do something else in life - as game tree search you'll suck forever at.

pawn enumeration

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing

Re: price of complex indexing