Page 3 of 4

Re: Back To The Beginning

Posted: Mon Sep 09, 2019 7:09 am
by Joost Buijs
fabianVDW wrote: Sun Sep 08, 2019 8:10 pm
fabianVDW wrote: Sat Sep 07, 2019 6:55 pm
Joost Buijs wrote: Sat Sep 07, 2019 12:57 pm
fabianVDW wrote: Sat Sep 07, 2019 12:21 pm AFAIK, movegeneration speed is a good chunk of the runtime in modern engines. Atleast in FabChess, it is about 25-30% of the time used, being on par with the evaluation function (IIRC).
25% to 30% of the time used by the move generation is quite a lot. Maybe it depends upon staged move generation or not and whether you execute the hash move with or without move generation, when I profile my engine move generation takes considerably less than 10% of the total time used.
I also do staged movegen in FabChess. Maybe I remembered the numbers wrong... I will do another profiling session tomorrow.
Okay, I did some profiling. First of all, Fab runs perft 6 on startpos with about 80MNPS (with bulk counting) on my I5-6400. During search, movegen takes up about 5.5% of the time, but it uses a struct which contains all the pseudo legal moves for every piece. This struct is later reused in eval for mobility etc. Calculation of that struct is about 15% of the time.

I am sure my engine is badly tuned in that regard aswell, as I seem not to be able to recognize bottlenecks well
This is more or less comparable to what I see over here, move-gen takes roughly 4% of the time, I calculate the squares attacked by each piece (for both colors) on the fly in the evaluation function, when in check I have a separate routine to calculate the squares attacked by the opponent because my move-gen needs this.

The most time consuming routines in my engine are: Eval(), TT-probe() and See(). I always wonder why probing the transposition table is rather slow, it is a table with four 16 byte entries per bucket, I never managed to get it any faster. Prefetching doesn't help either, with SMP it even slows the engine down a little.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 5:39 am
by bob
thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 6:35 am
by Michael Sherwin
bob wrote: Wed Sep 11, 2019 5:39 am thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
Downloaded Crafty 25.3 from github. On my machine overclocked to 4GHz (at present) perft 7 74.53 sec. 42,880,744 nodes per second. So your laptop at 2.2 GHz is no slouch! Mine at 4 GHz bench 7 is 48.29 sec, 69,311,177 nodes per sec. 69.3 / 42.9 = 1.615. Mine is also, "pure generate/make/unmake".

The perft in Crafty 25.3 is much faster than the perft in Crafty 19.x. I remember that my bitboard perft example outperformed C19.x by almost double. Now it takes longer, 117.4 seconds.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 8:20 am
by Joost Buijs
bob wrote: Wed Sep 11, 2019 5:39 am thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
It all comes very close to each other, on my 4 GHz. machine with perft 7 from start position I get:

perft: 7
moves: 3195901860 ok!
time: 55.7651 sec.
mps: 57310090

This is with full up/down-dating, no bulk-counting or hashing. It is not my goal to make perft() run as fast as possible, it is just a means to check whether the move logic is working like it should.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 10:45 am
by hgm
When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).

The speed of move generation in itself is a bit meaningless. You have to consider it in combination with the time spent on move sorting. Move generators that superficially are equally fast can still lead to a significant difference in performance if one of them generates the moves in the desired search order (e.g. MVV/LVA), so that the moves do not have to be stored in a list, and sorted, but can be searched immediately.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 11:33 am
by zullil
hgm wrote: Wed Sep 11, 2019 10:45 am When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).
And yet asmfish was markedly faster than Stockfish with exactly the same serial search.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 11:50 am
by Michael Sherwin
zullil wrote: Wed Sep 11, 2019 11:33 am
hgm wrote: Wed Sep 11, 2019 10:45 am When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).
And yet asmfish was markedly faster than Stockfish with exactly the same serial search.
Exactly. The two examples Carnivor and Godzilla use the exact same method. Carnivor is written in C and Godzilla has the move generator, MakeMove and TakeBack written in assembler.

Carnivor in C bench 6 3.198 seconds
Godzilla in asm bench 6 1.903 seconds

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 1:36 pm
by Joost Buijs
Michael Sherwin wrote: Wed Sep 11, 2019 11:50 am
zullil wrote: Wed Sep 11, 2019 11:33 am
hgm wrote: Wed Sep 11, 2019 10:45 am When I had first written qperft (I was using a Pentium M at the time), I tried to speed it up by using assembler. To not waste time on non-time-ritical code, I started with the assembler output of the C compiler, and 'hand-optimized' the code of the time-critical loops. It turned out I could delete as much as 40% of the instructions there, by keeping data in registers rather than pointlessly storing it and reading it back from memory.

When I later timed the various perfts, the speedup turned out to be exactly.... 0%!

This experience made me give up assembly coding. Although I guess assembler can still have advantages if you need it to access hardware features that the compiler would ignore (like YMM registers).
And yet asmfish was markedly faster than Stockfish with exactly the same serial search.
Exactly. The two examples Carnivor and Godzilla use the exact same method. Carnivor is written in C and Godzilla has the move generator, MakeMove and TakeBack written in assembler.

Carnivor in C bench 6 3.198 seconds
Godzilla in asm bench 6 1.903 seconds
I don't know what Carnivor and Godzilla are, but I'm pretty sure that when you code the C version for speed that the difference with the asm version will be negligible. It's the same with Stockfish, it is coded for readability (sort of) but it is not coded for speed.

Re: Back To The Beginning

Posted: Wed Sep 11, 2019 3:08 pm
by hgm
zullil wrote: Wed Sep 11, 2019 11:33 amAnd yet asmfish was markedly faster than Stockfish with exactly the same serial search.
This could be due to the quality of the original source code. I understood that Stockfish is written in C++.

Re: Back To The Beginning

Posted: Thu Sep 12, 2019 6:24 am
by bob
Joost Buijs wrote: Wed Sep 11, 2019 8:20 am
bob wrote: Wed Sep 11, 2019 5:39 am thought I would reply out of interest. On my 2 year old MacBook, 2.2ghz i7 circa 2017 I see this number for perf 6 from start position:

White(1): perft 6
total moves=119060324 time=3.11
White(1):

Here's the next one:

White(1): perft 7
total moves=3195901860 time=82.65

My perft was never optimized for speed. It is a pure generate/make/unmake implementation with no hashing or anything.
It all comes very close to each other, on my 4 GHz. machine with perft 7 from start position I get:

perft: 7
moves: 3195901860 ok!
time: 55.7651 sec.
mps: 57310090

This is with full up/down-dating, no bulk-counting or hashing. It is not my goal to make perft() run as fast as possible, it is just a means to check whether the move logic is working like it should.
I believe that perft first showed up in Crafty. The oldest version I have is 8.11 which was probably 1994/1995ish. It has "perf" rather than "perft". By version 10 (the next version series I have) it had been changed to perft although perf was still present. "perf" seemed to be a static test, while perft worked as it does today. My intent was not to use this as a "tachometer" to measure speed, but to use it as a debugging tool to detect movgen/make/unmake errors when those were changed for optimization. 8.11 actually produces wrong numbers, but there are a zillion warnings and it was really a 32 bit program that probably has lots of bugs in the 64 bit world.