Minic raw speed

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
xr_a_y
Posts: 848
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Minic raw speed

Post by xr_a_y » Thu Jan 02, 2020 10:26 pm

Back on the same subject ...

I'm trying to investigate why Minic is slow. It is still copy/make but now use Magic BB instead of HQBB (but this change gives no crazy speed improvment). I wonder if the (too) big tables in Minic can be the root cause, causing cache miss.
Using perf, on shirov position until depth 25 on 256Mb TT with usual other table (material, pawn TT, ...) I get this

Code: Select all

         23 176,20 msec task-clock                #    0,996 CPUs utilized
             1 332      context-switches          #    0,057 K/sec
                 2      cpu-migrations            #    0,000 K/sec
            46 433      page-faults               #    0,002 M/sec
   102 002 645 618      cycles                    #    4,401 GHz
   192 062 839 497      instructions              #    1,88  insn per cycle
    23 903 716 812      branches                  # 1031,391 M/sec
       705 826 801      branch-misses             #    2,95% of all branches

      23,259313445 seconds time elapsed
Which seems ok to me. I am right ?

Code: Select all

         23 371,43 msec task-clock                #    0,974 CPUs utilized
   102 900 155 607      cycles                    #    4,403 GHz
   192 077 762 835      instructions              #    1,87  insn per cycle
     1 738 195 101      cache-references          #   74,373 M/sec
       589 395 636      cache-misses              #   33,908 % of all cache refs

      23,986881200 seconds time elapsed

      23,332399000 seconds user
       0,035920000 seconds sys
but a lot of cache miss
But for example, igel is no better (here on start position to depth 18), with 39% misses

Code: Select all

          3 300,62 msec task-clock                #    0,294 CPUs utilized
    14 551 909 458      cycles                    #    4,409 GHz
    24 034 823 891      instructions              #    1,65  insn per cycle
       187 602 212      cache-references          #   56,839 M/sec
        74 768 541      cache-misses              #   39,855 % of all cache refs

      11,216000712 seconds time elapsed

       3,256560000 seconds user
       0,043845000 seconds sys
       
And stockfish is no better, so I guess, cache hit rate is not the root cause ....
In fact Minic has less cache misses and less branch prediction misses than stockfish ...
Perft 6 of start pos in 4.7sec, so move gen + copy make at 25Mnps .... probably not the issue
Pure eval (during texel tuning) at 2.7Mnps, this is probably a bit slow maybe
Standard search on same hardware is at 1.9Mnps

I don't get why SMP Minic is so slow at TCEC ...only 70Mnps on 176 threads, where many others are around 120Mnps

Joost Buijs
Posts: 1004
Joined: Thu Jul 16, 2009 8:47 am
Location: Almere, The Netherlands

Re: Minic raw speed

Post by Joost Buijs » Fri Jan 03, 2020 7:17 am

xr_a_y wrote:
Thu Jan 02, 2020 10:26 pm
Perft 6 of start pos in 4.7sec, so move gen + copy make at 25Mnps .... probably not the issue
Pure eval (during texel tuning) at 2.7Mnps, this is probably a bit slow maybe
Standard search on same hardware is at 1.9Mnps
A perft of 25 mnps on the start position without bulk counting seems about right, I get about 22 mnps at a 3.8 GHz. Broadwell.

I timed my evaluation function (material plus positional) on the start position and that runs on average in 758 cyles, so that is ~5 million evaluations per second at 3.8 GHz., so your 2.7 million at 4.4 seems about twice as slow.

Depending upon the stage of the game my search does from 2.6 to 4.5 mnps, at 4.4 GHz. that would be 3.0 to 5.0 mnps.

I could very well be that the speed of your evaluation function is the culprit, but don't hang me up on this.

elcabesa
Posts: 825
Joined: Sun May 23, 2010 11:32 am
Contact:

Re: Minic raw speed

Post by elcabesa » Fri Jan 03, 2020 8:24 am

xr_a_y wrote:
Thu Jan 02, 2020 10:26 pm
Back on the same subject ...

I'm trying to investigate why Minic is slow.
....
....

I don't get why SMP Minic is so slow at TCEC ...only 70Mnps on 176 threads, where many others are around 120Mnps
I don't have understood if minic is slow or if it's slow only on 176 threads. Depending on this answer you can start solving the problem.

if Minic is slower both at 1 and 176 threads it could simply be slow. But if it's slower only with an high number of threads probably is a problem of false sharing or scaling.

Joost Buijs
Posts: 1004
Joined: Thu Jul 16, 2009 8:47 am
Location: Almere, The Netherlands

Re: Minic raw speed

Post by Joost Buijs » Fri Jan 03, 2020 8:52 am

Minic seems to do 1.9 mnps at the start position with a single thread on a 4.4 GHz. cpu and that is somewhat slow.
It could be false sharing, but with lazy-smp (besides the TT) you hardly share anything.

What I don't get is that TCEC lets engines run on 176 threads on 88 core hardware, hyperthreading may give a bit higher nps but it certainly won't increase playing strength.

User avatar
xr_a_y
Posts: 848
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Minic raw speed

Post by xr_a_y » Fri Jan 03, 2020 9:36 am

To answer a few questions :
- Minic does not use "shared" counters (for nps for example), all threads have its own, and only on display they are gathered
- TCEC hardware is indeed 88 physical cores but 176 MT are used. I was sceptical at the beginning but many engines are scaling well ...
- On my own hardware (a 8 cores not HT, i7 9700K), single thread perf is 1.9Mnps and it scale perfectly to 8 cores
- On another hardware (a dual xeon with 20 cores), single thread perf is 800knps and it scale bad very soon, final perf beging around 23Mnps
- The TCEC hardware is closer to the second test
- My eval is currently 1223 cycles

elcabesa
Posts: 825
Joined: Sun May 23, 2010 11:32 am
Contact:

Re: Minic raw speed

Post by elcabesa » Fri Jan 03, 2020 9:41 am

xr_a_y wrote:
Fri Jan 03, 2020 9:36 am
- On another hardware (a dual xeon with 20 cores), single thread perf is 800knps and it scale bad very soon, final perf beging around 23Mnps
what does it means? 28X with 40 cores?
how does it scales another program i.e.stockfish on the dual xeon with 20 cores?

User avatar
xr_a_y
Posts: 848
Joined: Sat Nov 25, 2017 1:28 pm
Location: France

Re: Minic raw speed

Post by xr_a_y » Fri Jan 03, 2020 9:47 am

Indeed. I don't get why not nearest to x40 on 40 cores.
I did not have the opportunity to use another engine on this specific hardware.

Post Reply