The original question was a caching question to not get outside of the L1. Focus upon that. Changing the discussion is not what wise men would do.Don wrote:You missed the point. We have already discussed using two small tables and have concluded that adding them together might work just fine. Then it was suggested that might find a way to completely avoid any table lookups and wondered if there was a fast way to do that. Not because we don't think we should, but because we wondered if we could.diep wrote:Hi, just try it.Daniel Shawul wrote:Hi Vincent
We can't use a look up table for sq and piece since we are trying to avoid that. Rotation of a magic number was a substitute for that in this case. This will probably be not a good cryptographic hash but for zobrist might work since xor changes the hash key based on adjacent bits. The problem is that there are too many squares 1200 so we need a many majic numbers. But the rotation reduces tables size by 64 times. This is mentioned somewhere in this thread to reduce tables size, but can it be used to completely avoid it? I will see what can be used from RanRot for this purpose.
Thanks
I'm rotating the sq random number by a small prime lookuped for piece and vice versa. Maybe you missed this detail.
And then we add up this after 2 rotations.
So effectively we're having a bitresult of a multiplication in a field.
HGM wanted to avoid 5MB lookup to generate his Zobrist key and this definitely does with such tiny lookup tables of just a few kilobyte
HGM wanted to avoid lookup of a double array, not single array.
double array is [piece][sq]
this is however single lookups of just [piece] and [sq]
We CAN use looking up piece and sq in a table,
we just can't afford a table sized sq * piece, that's all
Bottomline is that what i posted here is factors faster than the multiplication you proposed.
As the lookups probably can get prefetched what i posted is under 2 clocks and will keep just 3 execution units busy for 1 cycle
So throughput latency is 1 cycle, not counting the prefetch of the L1.
What you posted is maybe 5 cycles throughput or so?
Using multiplication then and smaller tables isn't very clever, as todays cpu's have magnificent caches, so the multiplication will just slow it down too much as it's 5 times slower or so than using small lookup tables.