question about symmertic evaluation

Gerd Isenberg · Post by **Gerd Isenberg** » Thu May 24, 2007 10:33 pm

bob wrote: However, when a program spends > 50% of the total search time inside the evaluation, 2-3 divides don't even show up in the timing and don't affect my NPS at all. Verified by lots of testing of course...

I thought Crafty was in a ~1000 cycles range per node.
Thus 2-3 times 42 cycles don't looks negligible on a K8, where vector path idiv almost disables all parallel execution. So i guess you divide by constants, where the mentioned reciprocal or shift algorithms are applied by the compiler or code-generator.
http://swox.com/~tege/divcnst-pldi94.pdf

Gerd Isenberg · Post by **Gerd Isenberg** » Thu May 24, 2007 10:55 pm

bob wrote: I actually don't know how (nor do I care) the compiler handles a/b for ints. All I care about is symmetrical answers, which >> won't give directly. Whatever it does the cost inside Crafty is nil. And I am not dividing by a power of 2 either. I have this in a couple of key places:

((s) * (62 - Min(TotalWhitePieces + TotalBlackPieces, 42)) / 86)

which is not exactly a power of 2...

You would care if really an idiv instruction would be generated

But i agree that using div-operator "/" as C-instruction to divide by constant is fine and good for symmetrical scaling - since compiler are smart enough.

Here the output how to divide by 86 and the source from AMD64 optimization manual:

Code: Select all

Signed division by constant
===========================

enter divisor&#58; 86

; dividend&#58; memory location or register other than EAX or EDX

MOV EAX, 02FA0BE83h
IMUL dividend
MOV EAX, dividend
SAR EDX, 4
SHR EAX, 31
ADD EDX, EAX

; quotient now in EDX

Code: Select all

/* This program determines the algorithm &#40;a&#41;, multiplier &#40;m&#41;, and
shift factor &#40;s&#41; to be used to accomplish *signed* division by
a constant divisor. Compile with MSVC.
*/
#include <stdio.h>

typedef unsigned __int64 U64;
typedef unsigned long U32;


U32 log2&#40;U32 i&#41;
&#123;
	U32 t = 0;
	i = i >> 1;
	while &#40;i&#41; &#123;
		i = i >> 1;
		t++;
	&#125;
	return&#40;t&#41;;
&#125;


int main&#40;void&#41;
&#123;
	long e;
	U32 d, l, s, m, a;
	U64 m_low, m_high, j, k;
	fprintf&#40;stderr, "\n");
	fprintf&#40;stderr, "Signed division by constant\n");
	fprintf&#40;stderr, "===========================\n\n");
	fprintf&#40;stderr, "enter divisor&#58; ");
	scanf&#40;"%ld", &d&#41;;
	fprintf&#40;stderr, "\n");
	e = d;
	if ( e < 0 ) d = -d;
	if &#40;d == 0&#41; goto printed_code;
	if &#40;e == (-1&#41;) &#123;
		printf&#40;"; dividend&#58; register or memory location\n");
		printf&#40;"\n");
		printf&#40;"NEG dividend\n");
		printf&#40;"\n");
		printf&#40;"; quotient replaced dividend\n");
		goto printed_code;
	&#125;
	if &#40;d == 2&#41; &#123;
		printf&#40;"; dividend expected in EAX\n");
		printf&#40;"\n");
		printf&#40;"CMP EAX, 080000000h\n");
		printf&#40;"SBB EAX, -1\n");
		printf&#40;"SAR EAX, 1\n");
		if &#40;e < 0&#41; printf&#40;"NEG EAX\n");
		printf&#40;"\n");
		printf&#40;"; quotient now in EAX\n");
		goto printed_code;
	&#125;
	if (!&#40;d & &#40;d - 1&#41;)) &#123;
		printf&#40;"; dividend expected in EAX\n");
		printf&#40;"\n");
		printf&#40;"CDQ\n");
		printf&#40;"AND EDX, 0%08lXh\n", &#40;d-1&#41;);
		printf&#40;"ADD EAX, EDX\n");
		if &#40;log2&#40;d&#41;) printf&#40;"SAR EAX, %d\n", log2&#40;d&#41;);
		if &#40;e < 0&#41; printf&#40;"NEG EAX\n");
		printf&#40;"\n");
		printf&#40;"; quotient now in EAX\n");
		goto printed_code;
	&#125;

	/* Determine algorithm &#40;a&#41;, multiplier &#40;m&#41;, and shift factor &#40;s&#41; for 32-bit
	signed integer division. Based on&#58; Granlund, T.; Montgomery, P.L.&#58;
	"Division by Invariant Integers using Multiplication". SIGPLAN Notices,
	Vol. 29, June 1994, page 61.
	*/

	l = log2&#40;d&#41;;
	j = ((&#40;U64&#41;&#40;0x80000000&#41;) % (&#40;U64&#41;&#40;d&#41;));
	k = ((&#40;U64&#41;&#40;1&#41;) << &#40;32 + l&#41;) / (&#40;U64&#41;&#40;0x80000000 - j&#41;);
	m_low = ((&#40;U64&#41;&#40;1&#41;) << &#40;32 + l&#41;) / d;
	m_high = (((&#40;U64&#41;&#40;1&#41;) << &#40;32 + l&#41;) + k&#41; / d;
	while ((&#40;m_low >> 1&#41; < &#40;m_high >> 1&#41;) && &#40;l > 0&#41;) &#123;
		m_low = m_low >> 1;
		m_high = m_high >> 1;
		l = l - 1;
	&#125;
	m = (&#40;U32&#41;&#40;m_high&#41;);
	s = l;
	a = &#40;m_high >> 31&#41; ? 1 &#58; 0;

	if &#40;a&#41; &#123;
		printf&#40;"; dividend&#58; memory location or register other than EAX or EDX\n");
		printf&#40;"\n");
		printf&#40;"MOV EAX, 0%08LXh\n", m&#41;;
		printf&#40;"IMUL dividend\n");
		printf&#40;"MOV EAX, dividend\n");
		printf&#40;"ADD EDX, EAX\n");
		if &#40;s&#41; printf&#40;"SAR EDX, %d\n", s&#41;;
		printf&#40;"SHR EAX, 31\n");
		printf&#40;"ADD EDX, EAX\n");
		if &#40;e < 0&#41; printf&#40;"NEG EDX\n");
		printf&#40;"\n");
		printf&#40;"; quotient now in EDX\n");
	&#125;

	else &#123;
		printf&#40;"; dividend&#58; memory location or register other than EAX or EDX\n");
		printf&#40;"\n");
		printf&#40;"MOV EAX, 0%08LXh\n", m&#41;;
		printf&#40;"IMUL dividend\n");
		printf&#40;"MOV EAX, dividend\n");
		if &#40;s&#41; printf&#40;"SAR EDX, %d\n", s&#41;;
		printf&#40;"SHR EAX, 31\n");
		printf&#40;"ADD EDX, EAX\n");
		if &#40;e < 0&#41; printf&#40;"NEG EDX\n");
		printf&#40;"\n");
		printf&#40;"; quotient now in EDX\n");
	&#125;
	printed_code&#58;
	fprintf&#40;stderr, "\n");
	return 0;
&#125;

Uri Blass · Post by **Uri Blass** » Fri May 25, 2007 12:44 pm

My mistake is that I simply did not know that using / has different meaning than using >>

I thought that it is simply the same and this is the reason that I even did not think about using / for getting symmetric evaluation.

In case of knowing that / make things symmetric I could simply use it without asking questions because I had no reason to believe that / can be significantly slower than >>(I trust the compiler to make good work for every function).

I have some /24 in my code and I never cared about speed of these cases
I used >>3 instead of /8 simply because I thought it is the same and did not care if I write >>3 and not /8

Uri

Gerd Isenberg · Post by **Gerd Isenberg** » Fri May 25, 2007 8:28 pm

Uri Blass wrote:My mistake is that I simply did not know that using / has different meaning than using >>

I thought that it is simply the same and this is the reason that I even did not think about using / for getting symmetric evaluation.

In case of knowing that / make things symmetric I could simply use it without asking questions because I had no reason to believe that / can be significantly slower than >>(I trust the compiler to make good work for every function).

I have some /24 in my code and I never cared about speed of these cases
I used >>3 instead of /8 simply because I thought it is the same and did not care if I write >>3 and not /8

Uri

Even worse - in C arithmetical shift with signed int is not specified (even twos-complement might not be used for signed int types!). It might be compiler and/or target architecture depending.

With todays processors and compilers I guess this expression is almost ever true:

assert (-1 >> 1 == ~1 + 1)

If some rare systems have only logical shift right, so that always zeros are shifted in from left to right, one has to use something like this (adapted from http://swox.com/~tege/divcnst-pldi94.pdf):

Code: Select all

int shiftArithmeticalRight&#40;int x, int s&#41;
&#123;
   unsigned int u = x ^ 0x80000000;
   return &#40;u >> s&#41; - &#40;1 << &#40;s^31&#41;);
&#125;

with K > 0

Code: Select all

-K >> i == -K / 2**i

is only true, if the division goes without remainder, that is the i least significant bits are zero.
Otherwise shift rounds to -oo, while idiv truncates toward zero.

Many programmers are aware of some tricks to avoid very expensive instructions like division or modulo, eg. for making hash indices by "and" for power of two sized tables instead of modulo table size. Same for shift versus division by power of two divisors with unsigned values.

In the meantime recent compiler aka code generators will likely use similar algorithms like I posted from the amd-manual - to replace division by invariant integers with reciprocal imul or sar.

Therefor it is recommend to use "/" by constants rather than own shift- or whatever tricks. But i think it does not hurt to be aware of what the compiler does - and to be aware of div/idiv is really expensive - to better avoid variable divisors in critical code.

Btw. K10 idiv-latency is 23 + number of significant bits in absolute value of the dividend. 32*32=64bit imul still takes 3 cycles.

Gerd

Uri Blass · Post by **Uri Blass** » Fri May 25, 2007 9:42 pm

I can add that I also have a division by a variable in movei

The reason is that I want to allow the user to change it by editing a personality changes file so it changes the variable only in the
start of the program in case that it reads from the changes files that it needs to do it.

I do not like to change the code every time that I want to test changing the dividing factor.

Maybe from speed point of view it is better to divide by a big constant and to multiply by a variable that the user can change.

Uri

Bill Rogers · Post by **Bill Rogers** » Fri May 25, 2007 9:48 pm

Uri
Although I don't know as much about compter chess as you do I use a simple solution for my evaluation routine. I evaulate both sides as a positive number, but if the side is black I just negate the value at the end.
It save much programming space and time to do it that way for me.
Bill

Gerd Isenberg · Post by **Gerd Isenberg** » Fri May 25, 2007 10:00 pm

Uri Blass wrote:I can add that I also have a division by a variable in movei

The reason is that I want to allow the user to change it by editing a personality changes file so it changes the variable only in the
start of the program in case that it reads from the changes files that it needs to do it.

I do not like to change the code every time that I want to test changing the dividing factor.

Maybe from speed point of view it is better to divide by a big constant and to multiply by a variable that the user can change.

Uri

Even 50 cycles doesn't take a second but 25 nanos on a 2GHz machine

Only once at startup doesn't matter at all.
But avoid it n times each node inside your search.

bob · Post by **bob** » Sat May 26, 2007 3:05 am

Gerd Isenberg wrote:
bob wrote: However, when a program spends > 50% of the total search time inside the evaluation, 2-3 divides don't even show up in the timing and don't affect my NPS at all. Verified by lots of testing of course...
I thought Crafty was in a ~1000 cycles range per node.
Thus 2-3 times 42 cycles don't looks negligible on a K8, where vector path idiv almost disables all parallel execution. So i guess you divide by constants, where the mentioned reciprocal or shift algorithms are applied by the compiler or code-generator.
http://swox.com/~tege/divcnst-pldi94.pdf

I don't think so. I am searching about 1.7M nps on my core-2 duo at 2.0ghz. I guess I could use the hardware counters to count the number of instructions, but it has to be way more than 1000 instructions per node. Last time I did use hardware it was around 2700 instructions per node but that was probably 8-9 years ago and the new version has slowed down as we have added some eval stuff...

I looked at an old run where I divided by a variable (when I had variable scaling for endgame terms) and there was no difference in NPS....

I've seen others doing the same thing with the hash signature where it is modulo the hash table size to get the table index, and they apparently didn't feel it made enough difference compared with the AND and power-of-2 size limitation.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat May 26, 2007 9:18 am

bob wrote:I don't think so. I am searching about 1.7M nps on my core-2 duo at 2.0ghz. I guess I could use the hardware counters to count the number of instructions, but it has to be way more than 1000 instructions per node. Last time I did use hardware it was around 2700 instructions per node but that was probably 8-9 years ago and the new version has slowed down as we have added some eval stuff...

I looked at an old run where I divided by a variable (when I had variable scaling for endgame terms) and there was no difference in NPS....

I've seen others doing the same thing with the hash signature where it is modulo the hash table size to get the table index, and they apparently didn't feel it made enough difference compared with the AND and power-of-2 size limitation.

1.7MNode per second and 2GHz on a core-2 duo - ahh yes, two processors. So I was at least factor two wrong with my 1000 cycles per node guess - sorry.

1E9 ns / 1.7E6 = 588 ns per node = 1176 cycles per node on two cores, which is about 2352 cycles per node and processor.

One idiv latency is still 1-2% of that, but probably this latency hides some memory stalls or whatever and/or schedules partly with some other instructions around on a core 2 duo. Also, you may not divide at every node you count. Empirical evidence is always right.

Uri Blass · Post by **Uri Blass** » Sat May 26, 2007 1:38 pm

Bill Rogers wrote:Uri
Although I don't know as much about compter chess as you do I use a simple solution for my evaluation routine. I evaulate both sides as a positive number, but if the side is black I just negate the value at the end.
It save much programming space and time to do it that way for me.
Bill

Does it mean that you can give bonus for a pawn that is not isolated
instead of reducing the score for isolated pawns or maybe isolated pawns for white is part of the score for black?

It is possible to do things different with only positive numbers when you substract only at the end of the evaluation but I am not sure if it is more simple.

Uri

question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation

Re: question about symmertic evaluation