Carnivor, another unheard of chess engine!

Michael Sherwin · Post by **Michael Sherwin** » Wed Jun 27, 2007 5:28 am

hgm wrote:No hard feelings! In fact our posts crossed.

Perhaps you could try my perft on your machine for fairer comparison.

Hi HG,

I was going to run/compare your perft on my machine, but after looking at your source code, I realize that any comparison would be meaningless. Your code is just a perft example and not equivalent to the perft of a real chess program. The biggest reason why is that your make/unmake are inline inside of the perft function itself and do not keep tract of such things as the 50 move rule. My make/unmake are all set up as though they are for a real chess program. They also keep track of, on the fly, other things like the material for each side and the positional accumulaters for the piece-square tables. And other stuff. It is apples and oranges!

Mike

cooldalek · Post by **cooldalek** » Wed Jun 27, 2007 8:11 am

Terry McCracken wrote:
hgm wrote:How weak is weak?

I would only be interested if it is below 1200 Elo... (and if it is Winboard or UCI!)

Why do you want a really weak engine?

I want weak engines that I can play against (as opposed to a grandmaster engine that makes one blunder a game when set at a lower level)

hgm · Post by **hgm** » Wed Jun 27, 2007 8:59 am

Michael Sherwin wrote: The biggest reason why is that your make/unmake are inline inside of the perft function itself and do not keep tract of such things as the 50 move rule. My make/unmake are all set up as though they are for a real chess program. They also keep track of, on the fly, other things like the material for each side and the positional accumulaters for the piece-square tables. And other stuff. It is apples and oranges!

Well, they are in-lined that way in the search of Joker. Why would I not in-line them? They are not called from anywhere else, and they draw heavily on information that is available for free in the move loop, but would have to be passed on to them otherwise. And you should never keep track of the 50-move rule in an engine anyway...

Material balance belongs in the evaluation, and evaluation can be arbitrarily complicated, and usually takes most time. There is no way guessing which fraction of the evaluation engines do differentially, and which not. So comparing perfts of programs that partly do evaluation is meaningless as well. You should either compare them with full evaluation, to get a measure for the speed of the Chess engine, or switch evaluation completely off to measure the speed of the move-generation / board representation.

There is no limit to what you could include, if you include everything you might as well just look at the nps of your engine, and not have any special perft function at all. What do your numbers include? Hash-key updating? Hash probing, although there is no hashing? Move sorting? Minimax score updating? Rep-draw detection?

If all differential updating you do is material composition and piece-square points, I don't think the amount of time needed for that would be very significant: it seems comparable in work to making the count breakdown by move type, which makes the difference between 77 and 84 sec on the Core 2 Duo. It would not involve any conditional branches.

It would still be interesting, though, to have the numbers on your machine, to get a clock-for-clock comparison of Pentium 4, Athlon and Core 2 Duo, even if they cannot be used to compare move generation schemes.

Michael Sherwin · Post by **Michael Sherwin** » Wed Jun 27, 2007 1:36 pm

Compiled with MSVC 6 default 'release' compiler switches.

Athlon 1500+
5) 0.313
6) 7.672
7) 201.453

Athlon 3400+
5) 0.141
6) 3.672
7) 97.922

P4 2GHZ Laptop
5) 0.380
6) 9.183
7) 154.623 Edit: this time is correct

486 DX4 100MHZ Laptop
5) 39.600

hgm · Post by **hgm** » Wed Jun 27, 2007 3:30 pm

Remarkable...

I suppose this was for the version that specifies captures, castlings, etc., that took 199 sec on my 2.8 GHz P4.

For perft(6) the laptop is nearly as fast as my 1.4 times faster P4 desktop (which is already strange, but can perhaps be explained by a better compiler). But perft(7) then it is actually much faster.

I did not like the large difference I got between the version that was giving the count break-down vs the one that only calculated total count. (199 vs 125 sec, a ridiculous 60% slowdown due to counting, that only occurred on the P4; on the Core 2 Duo the counting only gave a 9% slowdown.) This suggests there is something very sick going on with the translation of a few counter increments.

Could you delete all the conditional counting after the count++; near the recursive perft call,

Code: Select all

      /* recursion or count end leaf */
            if(depth>1) perft(COLOR-color, stack[i], depth-1, d+1);
            else
            {   count++; if(to!=capt) epcnt++; if(victim!=DUMMY)xcnt++;
                if((unsigned int)stack[i]>0xA2000000) cascnt++;
                if(mode==0xA1) promcnt++;
            }

and time how long it takes then? I have the feeling that the poor performance I have on P4 here is due to heavy branch misprediction caused by poor translation (gcc is not using conditional moves here). Your compiler might do it differently, and thus give much more efficient code.

That the speed advantage does not show up with the shallower perfts might be because there is some power saving trick in the laptop, that only switches the CPU to full speed after a few seconds of no idle time.

Anyway, the Athlon 3400+ seems way faster than my 2.8GHz P4. It is 1.57 times faster than your 2GHz P4 laptop, which translates to an effective clock performance rating of 3157. Not so far from 3400...

I guess that compiler influence is too big to compare your timings with my timings. On the P4 things behave extremely erratic. Without the count breakdown I measure 6.7 sec / 124.7 sec (perft(6) / perft(7)) on the 2.8GHz P4, after compilation with 'gcc -O2'. When I compile with 'gcc -O3' (which is supposed to be better...) this goes to 6.8 sec / 182.5 sec. Almost no change on the perft(6), and an enormous slowdown of the perft(7)! When I then compile with 'gcc -O2 -mno-cygwin', which is supposed to give no other change than using native Windows I/O calls in stead of those from the cygwin dll, I get 4.3 sec / 169.5 sec. An enormous speedup of the perft(6) compared to using cygwin.dll, but an enormous slowdown for the perft(7).

Anyway, your timing on the Athlon corresponds to 33M nps.

Michael Sherwin · Post by **Michael Sherwin** » Wed Jun 27, 2007 10:28 pm

Could you delete all the conditional counting after the count++; near the recursive perft call, ... and time how long it takes then?

Athlon 3400+
5) 0.140
6) 3.469
7) 92.219

P4 2GHZ Laptop
5) 0.380
6) 9.173
7) 150.817

before delete:

Code: Select all

$L516:

; 687  :             {   count++; if(to!=capt) epcnt++; if(victim!=DUMMY)xcnt++;

	mov	edx, DWORD PTR _count
	mov	ecx, DWORD PTR _to$[esp+60]
	inc	edx
	mov	DWORD PTR _count, edx
	mov	edx, DWORD PTR _capt$[esp+56]
	cmp	ecx, edx
	je	SHORT $L518
	inc	DWORD PTR _epcnt
$L518:
	mov	edx, DWORD PTR _victim$[esp+60]
	cmp	edx, 159				; 0000009fH
	je	SHORT $L519
	inc	DWORD PTR _xcnt
$L519:

; 688  :                 if((unsigned int)stack[i]>0xA2000000) cascnt++;

	cmp	DWORD PTR [ebp], -1577058304		; a2000000H
	jbe	SHORT $L521
	inc	DWORD PTR _cascnt
$L521:

; 689  :                 if(mode==0xA1) promcnt++;

	cmp	edi, 161				; 000000a1H
	jne	$L522
	inc	DWORD PTR _promcnt
	jmp	$L522

hgm · Post by **hgm** » Wed Jun 27, 2007 11:36 pm

OK, thanks. So on the Athlon (like on the Core 2 Duo) the counting code also doesn't take too long, despite the fact that is doesn't use conditional moves but branches over an increment instruction (92 sec in stead of 97 sec, where the C2D went to 77 from 84). So the Core 2 Duo is only 20% faster than the equally clocked Athlon 64 3400+, in this task, which does not exercise the strong suite of the Athlon at all: its on-die memory controller. This whole perft runs entirely from the cache.

I am still puzzled about the Pentium 4 results. In particular that the nps seems to alter the opposite way between different compiles for perft(6) and perft(7). It must have something to do with cache collisions. The P4 is probably very sensitive to that, as it has only 8KB of L1 cache. So I can imagine that at some depth you get unlucky, and the move stack, system stack and crucial static variables all map to the same cache entries, flushing each other out all the time. As the top of the move stack and the the system stack move through memory at a different rate with ply depth, and the static data of course doen't move at all, there will be some depth were the two stacks meet (modulo cache size). In an unlucky compile / link this might collide with the most frequently used static variables (such as board and piece list) as well, causing heavy cache trashing at that level of the tree.

This might be a reason why Joker performs so poorly in George Lyapko's leagues: he is running on a P4! I will see if I can make a better compile, or merge the system and move stacks somehow. Or at least synchronize their motion through memory, to prevent cache collisions.

Bill Rogers · Post by **Bill Rogers** » Thu Jun 28, 2007 5:41 am

Hi Michael
I would like a copy of carnivor too if you get the time.
You can reach me through my profile.
Thanks
Bill

hgm · Post by **hgm** » Thu Jun 28, 2007 12:54 pm

The strange behaviour on the P4 definitely has to do something with memory layout and resulting cache collissions.

I now made a version of the slow perft that declares all global variables in a single array of 1.6KB, to which I than equivalence all the variables used in the program by define statements. So something like

Code: Select all

int RawData[N];

#define count RawData[0]
#define xcnt  RawData[1]
#define board ((char *) (RawData+2))

in stead of

Code: Select all

int count, xcnt;

char board[128];

so that I have precise control over the order of variables in memory. (I guess a cleaner way to do this would be to define them all as components of asingle large static struct, but in an existing program I would then have to change every reference to these variables, so I chose for the '#define hack'.) By mapping them all into a contiguous array smaller than 2KB, it is guaranteed that they cannot collide in the L1 cache. (The P4 has a 4-way 8KB L1 data cache, so each way is 2KB.)

The code should not change a single instruction by this, the only change is that is uses different addresses for the data.

This brings the time of the perft(7) down from 199 sec to 169 sec (and of perft(6) from 8.8 to 7.7 sec).

hgm · Post by **hgm** » Thu Jun 28, 2007 2:53 pm

Funny thing is that now the compilation with 'gcc -O2 -mno-cygwin' is the fastest, where before it was much slower than 'gcc -O2' (at least, for perft(7)).

It seems that the 'manual' allocation of global memory that solved the cache conflict now removed the depth-dependence of the efficiency, so that I can have best performance at any depth with the same compile.

Code: Select all

$ ./spertft 7
pc = 406040
 - - - - - - - - - - - -
 - - - - - - - - - - - -
 - - R N B Q K B N R - -
 - - P P P P P P P P - -
 - - . . . . . . . . - -
 - - . . . . . . . . - -
 - - . . . . . . . . - -
 - - . . . . . . . . - -
 - - p p p p p p p p - -
 - - r n b q k b n r - -
 - - - - - - - - - - - -
 - - - - - - - - - - - -

Slow Perft by H.G. Muller
Perft mode: No hashing, make/unmake last ply

perft(1)=        20, x=         0, ep=     0, Q=     0, OO=     0 ( 0.000 sec)

perft(2)=       400, x=         0, ep=     0, Q=     0, OO=     0 ( 0.000 sec)

perft(3)=      8902, x=         0, ep=     0, Q=     0, OO=     0 ( 0.000 sec)

perft(4)=    197281, x=         0, ep=     0, Q=     0, OO=     0 ( 0.000 sec)

perft(5)=   4865609, x=         0, ep=     0, Q=     0, OO=     0 ( 0.172 sec)

perft(6)= 119060324, x=         0, ep=     0, Q=     0, OO=     0 ( 4.203 sec)

perft(7)=3195901860, x=         0, ep=     0, Q=     0, OO=     0 (122.235 sec)

The sizable difference between the compile with or without requirement of cygwin support is probably explained by differences in starting adress of the code, which might cause padding in the linking stage (due to .p2align directives) with different numbers of NOP instructions in some time-critical code loops.

The 122 sec is for the version without count specification. When I do count captures etc., it goes to 133 sec. I tried the original way of counting, with if-statements, and a branch-free way using boolean expressions, like:

Code: Select all

xcnt += (victim != DUMMY);

I verified in the compiler-produced assembler that the latter generates branch-free code, using SETNE instructions. But I see no timing difference due to this!

Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!

Re: Carnivor, another unheard of chess engine!