Zobrist keys (again)

hgm · Post by **hgm** » Tue Jan 27, 2009 6:07 pm

Yes, I think Vincent misunderstood that we wanted to do this in real time. Athough I am not sure I agree with his reasoning that it would not be worth a few extra instructions to reduce L1 cache load by 6KB.

The nice thing about my key-generation scheme is that it can indeed be combined with any method to generate a good set of maximally independent basis keys. But because you need fewer basis keys, presumably you can make better ones.

I am still a bit suspicious w.r.t. the rotated DeBruijn numbers; I really want to measure the dependence of such a set to see it it would be any good.

rjgibert · Post by **rjgibert** » Tue Jan 27, 2009 8:26 pm

I long time ago, I thought about using this type of idea as a way of producing random numbers. The idea was to do it in combination with gray codes to make it computationally efficient, while retaining a long period.

One of the properties I was seeking with the idea was for all the random numbers to be randomly accessible e.g. you could efficiently generate the nth random number without having to generate any of the intervening numbers. This is seldom a property of PRNs, but one that can be useful for some applications.

One of the observations I made about all this was from considering what happens at the bit slice level e.g. what happens to the 1st bit as you are XORing all of these numbers? What I noticed was that it worked best when exactly 50% of all the bits at bit zero of each of the numbers in the set were equal to 1 and the rest of the numbers at bit zero were equal to zero.

The reason being was that this was the only way to guarantee that the bit at bit zero would flip with a 50% probability, which is what you want for good random numbers. Any deviation from a 50% density would make the result at this bit more predictable.

It is a simple program to seed your set of numbers so that it contains only numbers with a 50% density at every bit slice. Such a set of numbers would probably be a very good set of numbers for zobrist numbers set irrespective of whether "your idea" is used or not. Note the set of numbers would be otherwise random.

As for whether your idea is a good one or not, I'm skeptical because you are shrinking the table at the expense of performing more XORs. With the increase in numbers of this operation, you increase the probability of a poor result e.g a result of zero for an XOR operation would increase the probability of a collision.

More generally, intuitively I would expect a smaller table to be worse rather than better. For example, with random number generation, in general, the bigger the table the better.

diep · Post by **diep** » Tue Jan 27, 2009 10:10 pm

hgm wrote:Yes, I think Vincent misunderstood that we wanted to do this in real time. Athough I am not sure I agree with his reasoning that it would not be worth a few extra instructions to reduce L1 cache load by 6KB.

The nice thing about my key-generation scheme is that it can indeed be combined with any method to generate a good set of maximally independent basis keys. But because you need fewer basis keys, presumably you can make better ones.

I am still a bit suspicious w.r.t. the rotated DeBruijn numbers; I really want to measure the dependence of such a set to see it it would be any good.

Well it has a simple processor reason why your plan is going to fail. A load from L1 or even in 0.1% of cases maybe L2, is getting automatically prefetched at nehalem/phenom2.

So there is 0 slowdown of your program doing that.

Now in the other case, if you replace it with code, you need several instructions which are very expensive instructions. Example suppose you try doing a rotate, at core2 there is maybe 1 or 2 ports available to run that instruction, so it is an expensive instruction there. So you lose for sure 2 cycles extra or so to executing the code.

Nasty in case of Diep is also that to load the instructions into the processor from L1, you have to do more loads this time than that single load from datacache.

I extensively had experimented replacing some big tables with some extra code and ALWAYS it was slower than using lookup tables.

Instructions are just like data from L1 cache. They also need to get loaded from that cache!

More loads from a cache = bad.

Just big difference is that instructions you load sequential and data you can access random.

The real problem is encoding the pieces with smallest amount of instructions without using XOR, as the XOR is reserved for the Zobrist.

So all you can do is rotating, shifting, moving, ANDING and oring.

Yet for a rook on d4 you definitely want a total different key than a knight on d4 for the Zobrist.

So somehow within 1 or 2 cycles you want to really modify a lot of bits.

That eats at least 4 to 5 cycles or so and worse (especially for Diep): a lot of loads from L1i cache.

Much cheaper is therefore 1 lookup from L1d cache.

Best regards,
Vincent

hgm · Post by **hgm** » Tue Jan 27, 2009 10:29 pm

diep wrote:
hgm wrote:Well it has a simple processor reason why your plan is going to fail. A load from L1 or even in 0.1% of cases maybe L2, is getting automatically prefetched at nehalem/phenom2.

So there is 0 slowdown of your program doing that.

Now in the other case, if you replace it with code, you need several instructions which are very expensive instructions. Example suppose you try doing a rotate, at core2 there is maybe 1 or 2 ports available to run that instruction, so it is an expensive instruction there. So you lose for sure 2 cycles extra or so to executing the code.
This is why all the plans presented here do a simple load from L1, from a table, with no extra instructions. We are only discussing here how to pre-compute that table...

Zobrist keys (again)

Re: Zobrist keys (again)

Re: Zobrist keys (again)

Re: Zobrist keys (again)

Re: Zobrist keys (again)