Minic raw speed

xr_a_y · Post by **xr_a_y** » Thu Jan 02, 2020 11:26 pm

Back on the same subject ...

I'm trying to investigate why Minic is slow. It is still copy/make but now use Magic BB instead of HQBB (but this change gives no crazy speed improvment). I wonder if the (too) big tables in Minic can be the root cause, causing cache miss.
Using perf, on shirov position until depth 25 on 256Mb TT with usual other table (material, pawn TT, ...) I get this

Code: Select all

         23 176,20 msec task-clock                #    0,996 CPUs utilized
             1 332      context-switches          #    0,057 K/sec
                 2      cpu-migrations            #    0,000 K/sec
            46 433      page-faults               #    0,002 M/sec
   102 002 645 618      cycles                    #    4,401 GHz
   192 062 839 497      instructions              #    1,88  insn per cycle
    23 903 716 812      branches                  # 1031,391 M/sec
       705 826 801      branch-misses             #    2,95% of all branches

      23,259313445 seconds time elapsed

Which seems ok to me. I am right ?

Code: Select all

         23 371,43 msec task-clock                #    0,974 CPUs utilized
   102 900 155 607      cycles                    #    4,403 GHz
   192 077 762 835      instructions              #    1,87  insn per cycle
     1 738 195 101      cache-references          #   74,373 M/sec
       589 395 636      cache-misses              #   33,908 % of all cache refs

      23,986881200 seconds time elapsed

      23,332399000 seconds user
       0,035920000 seconds sys

but a lot of cache miss
But for example, igel is no better (here on start position to depth 18), with 39% misses

Code: Select all

          3 300,62 msec task-clock                #    0,294 CPUs utilized
    14 551 909 458      cycles                    #    4,409 GHz
    24 034 823 891      instructions              #    1,65  insn per cycle
       187 602 212      cache-references          #   56,839 M/sec
        74 768 541      cache-misses              #   39,855 % of all cache refs

      11,216000712 seconds time elapsed

       3,256560000 seconds user
       0,043845000 seconds sys

And stockfish is no better, so I guess, cache hit rate is not the root cause ....
In fact Minic has less cache misses and less branch prediction misses than stockfish ...
Perft 6 of start pos in 4.7sec, so move gen + copy make at 25Mnps .... probably not the issue
Pure eval (during texel tuning) at 2.7Mnps, this is probably a bit slow maybe
Standard search on same hardware is at 1.9Mnps

I don't get why SMP Minic is so slow at TCEC ...only 70Mnps on 176 threads, where many others are around 120Mnps

Joost Buijs · Post by **Joost Buijs** » Fri Jan 03, 2020 8:17 am

xr_a_y wrote: ↑Thu Jan 02, 2020 11:26 pm Perft 6 of start pos in 4.7sec, so move gen + copy make at 25Mnps .... probably not the issue
Pure eval (during texel tuning) at 2.7Mnps, this is probably a bit slow maybe
Standard search on same hardware is at 1.9Mnps

A perft of 25 mnps on the start position without bulk counting seems about right, I get about 22 mnps at a 3.8 GHz. Broadwell.

I timed my evaluation function (material plus positional) on the start position and that runs on average in 758 cyles, so that is ~5 million evaluations per second at 3.8 GHz., so your 2.7 million at 4.4 seems about twice as slow.

Depending upon the stage of the game my search does from 2.6 to 4.5 mnps, at 4.4 GHz. that would be 3.0 to 5.0 mnps.

I could very well be that the speed of your evaluation function is the culprit, but don't hang me up on this.

elcabesa · Post by **elcabesa** » Fri Jan 03, 2020 9:24 am

xr_a_y wrote: ↑Thu Jan 02, 2020 11:26 pm Back on the same subject ...

I'm trying to investigate why Minic is slow.
....
....

I don't get why SMP Minic is so slow at TCEC ...only 70Mnps on 176 threads, where many others are around 120Mnps

I don't have understood if minic is slow or if it's slow only on 176 threads. Depending on this answer you can start solving the problem.

if Minic is slower both at 1 and 176 threads it could simply be slow. But if it's slower only with an high number of threads probably is a problem of false sharing or scaling.

Joost Buijs · Post by **Joost Buijs** » Fri Jan 03, 2020 9:52 am

Minic seems to do 1.9 mnps at the start position with a single thread on a 4.4 GHz. cpu and that is somewhat slow.
It could be false sharing, but with lazy-smp (besides the TT) you hardly share anything.

What I don't get is that TCEC lets engines run on 176 threads on 88 core hardware, hyperthreading may give a bit higher nps but it certainly won't increase playing strength.

xr_a_y · Post by **xr_a_y** » Fri Jan 03, 2020 10:36 am

To answer a few questions :
- Minic does not use "shared" counters (for nps for example), all threads have its own, and only on display they are gathered
- TCEC hardware is indeed 88 physical cores but 176 MT are used. I was sceptical at the beginning but many engines are scaling well ...
- On my own hardware (a 8 cores not HT, i7 9700K), single thread perf is 1.9Mnps and it scale perfectly to 8 cores
- On another hardware (a dual xeon with 20 cores), single thread perf is 800knps and it scale bad very soon, final perf beging around 23Mnps
- The TCEC hardware is closer to the second test
- My eval is currently 1223 cycles

elcabesa · Post by **elcabesa** » Fri Jan 03, 2020 10:41 am

xr_a_y wrote: ↑Fri Jan 03, 2020 10:36 am - On another hardware (a dual xeon with 20 cores), single thread perf is 800knps and it scale bad very soon, final perf beging around 23Mnps

what does it means? 28X with 40 cores?
how does it scales another program i.e.stockfish on the dual xeon with 20 cores?

xr_a_y · Post by **xr_a_y** » Fri Jan 03, 2020 10:47 am

Indeed. I don't get why not nearest to x40 on 40 cores.
I did not have the opportunity to use another engine on this specific hardware.

xr_a_y · Post by **xr_a_y** » Tue Mar 31, 2020 11:38 am

So I'm back on this speed subject. Here are current Minic stats

Code: Select all

See          246094508  4.78911%  2042773 120
Apply        716793548  13.9491%  5692750 125
Move Piece   161574554  3.14431%  5419009 29
Eval         839637208  16.3397%  1290309 650
Generate     309262924  6.01839%  1039769 297
PseudoLegal  68977662  1.34234%  787407 87
IsAttacked   363850958  7.0807%  9717315 37
MoveSorting  1149323634  22.3664%  1039769 1105
Total        5138628306

% column is the % of time in the function versus total program time
Last column is cycle per function

My main worry here is that % does not sum up to 100%, which means something else is consuming a lot !
I guess it is simply the core of pvs and qsearch function but I am surprise how much they cost.

mvanthoor · Post by **mvanthoor** » Tue Mar 31, 2020 12:49 pm

xr_a_y wrote: ↑Thu Jan 02, 2020 11:26 pm Back on the same subject ...

In fact Minic has less cache misses and less branch prediction misses than stockfish ...
Perft 6 of start pos in 4.7sec, so move gen + copy make at 25Mnps .... probably not the issue

It depends on the speed of your CPU. Perft 6 from the starting position runs at 4.4 sec, 27 million nodes/sec, in my (not yet complete) engine. This is one thread on a vintage 2015 Intel i7-6700K.

So, if your 4.7 sec @ 25M nodes/sec is normal or not depends on the speed of your CPU.

edit: "- On my own hardware (a 8 cores not HT, i7 9700K), single thread"

Missed this on first reading.
6700K passmark (https://www.cpubenchmark.net/cpu.php?cp ... Hz&id=2565) = 9138
9700K passmark (https://www.cpubenchmark.net/cpu.php?cp ... Hz&id=3335) = 14880

So your CPU is 63% (!) faster than mine, but it's slower at running Perft 6. As both our engines are using magic bitboards, and a language compiled to machine code (C++/Rust), that is probably no difference. I'm assuming you're compiling with full optimizations and cpu-architecture = native (or at least, skylake).

If you want, I can share a pre-compiled version of Rustic with you, which will run Perft 6 with no hash and no bulk counting to compare on your CPU.

As make_move() is the function that's called the most during perft, I assume your make_move() function is slower than the one in Rustic.

I've found some posts on other forums (https://open-chess.org/viewtopic.php?t=2855) that Crafty runs Perft 6 in 3.8 seconds (31.3M nodes/sec) without hash, on a 2012 era CPU, so move generation can still be made much faster in both Minic and Rustic.

xr_a_y · Post by **xr_a_y** » Tue Mar 31, 2020 1:35 pm

mvanthoor wrote: ↑Tue Mar 31, 2020 12:49 pm
xr_a_y wrote: ↑Thu Jan 02, 2020 11:26 pm Back on the same subject ...

In fact Minic has less cache misses and less branch prediction misses than stockfish ...
Perft 6 of start pos in 4.7sec, so move gen + copy make at 25Mnps .... probably not the issue
It depends on the speed of your CPU. Perft 6 from the starting position runs at 4.4 sec, 27 million nodes/sec, in my (not yet complete) engine. This is one thread on a vintage 2015 Intel i7-6700K.

So, if your 4.7 sec @ 25M nodes/sec is normal or not depends on the speed of your CPU.

edit: "- On my own hardware (a 8 cores not HT, i7 9700K), single thread"

Missed this on first reading.
6700K passmark (https://www.cpubenchmark.net/cpu.php?cp ... Hz&id=2565) = 9138
9700K passmark (https://www.cpubenchmark.net/cpu.php?cp ... Hz&id=3335) = 14880

So your CPU is 63% (!) faster than mine, but it's slower at running Perft 6. As both our engines are using magic bitboards, and a language compiled to machine code (C++/Rust), that is probably no difference. I'm assuming you're compiling with full optimizations and cpu-architecture = native (or at least, skylake).

If you want, I can share a pre-compiled version of Rustic with you, which will run Perft 6 with no hash and no bulk counting to compare on your CPU.

As make_move() is the function that's called the most during perft, I assume your make_move() function is slower than the one in Rustic.

I've found some posts on other forums (https://open-chess.org/viewtopic.php?t=2855) that Crafty runs Perft 6 in 3.8 seconds (31.3M nodes/sec) without hash, on a 2012 era CPU, so move generation can still be made much faster in both Minic and Rustic.

Thanks for your inputs.

Minic current perft 6 on start position on a single core of an i7 9700K (nothing else running) is

Code: Select all

real	0m4,501s
user	0m4,465s
sys	0m0,036s

My perft function has no cache, and Minic is applying pseudo moves and test them for legality even at depth 1 (no switch to a valid move generator).

If I activate timers using perft I get this

Code: Select all

Apply        14229653290  59.6611%  125049325 113
Move Piece   2769228928  11.6106%  125043819 22
Eval         22458  9.41604e-05%  5 4491
Generate     1647643480  6.90813%  5072213 324
IsAttacked   2564711194  10.7532%  125049325 20
Total        23850787022

with indeed a big contribution from apply (which in Minic is quite ugly ...)

But still, Minic can perft at 26Mnps and only search at 1.7Mnps ... So I don't care too much about move generation.

Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed

Re: Minic raw speed