Need advice on SMP

syzygy · Post by **syzygy** » Mon Mar 27, 2023 8:05 pm

Carbec wrote: ↑Mon Mar 27, 2023 10:35 am I use linux Mint, but there is no "perf". I tried to install it, without success.
I had an error about "broken packet". I am certainly not a linux expert, I installed
Mint given its good reputation. Which distro do you use ?

I use Fedora.

Maybe this helps: https://community.linuxmint.com/softwar ... s-unstable
Otherwise, try compiling it directly from the source. Apparently the source code is part of the linux kernel source, under tools/perf.

syzygy · Post by **syzygy** » Mon Mar 27, 2023 8:18 pm

syzygy wrote: ↑Mon Mar 27, 2023 8:05 pm Otherwise, try compiling it directly from the source. Apparently the source code is part of the linux kernel source, under tools/perf.

I just tried this myself. I got a bunch of warnings, probably because the version of the kernel source I downloaded does not correspond to the kernel headers currently installed on my machine, but it produced a binary that seems to work.

Btw, if you happen to use a global node counter, then there is your culprit. Instead, use a node counter per thread and add them up when you report.

syzygy · Post by **syzygy** » Mon Mar 27, 2023 8:24 pm

If you manage to install a functioning perf, then this may help you further:
https://joemario.github.io/blog/2016/09/01/c2c-blog/

Carbec · Post by **Carbec** » Tue Mar 28, 2023 10:46 am

Hello,

I now have a dual boot windows/linux. So I made a little experiment. The same test : positions from BT2630.epd, depth max 10.

threads nps(linux) nps(windows)
---------------------------------------
1 : 5.2 : 4.5
2 : 10.0 : 8.6
3 : 14.7 : 13.3
4 : 19.0 : 16.1
5 : 20.9 : 19.1
6 : 22.0 : 21.6
8 : 22.9 : 23.7

Don't ask me why there is a difference !!
About my code, from what I read here, I made a class "ThreadPool, like this

Code: Select all

struct Thread {
    std::thread thread;
    Search*     search;
};

class ThreadPool
{
   std::array<Thread, MAX_THREADS> threads;
}

I use it to start the search, send stop to the main thread, get nodes...
Each class "Search" has its own node count, board, timer, killer, history.
I launch all the threads in "ThreadPool::start_think".
There is no difference between the threads in the depth search, or other parameter.
I know that some codes use different scheme, but for the moment, I stay like this
as I don't have enough knowledge about it.
The only think they share is the transposition table, well I hope so.
Perhaps the tool perf will show different.

abulmo2 · Post by **abulmo2** » Tue Mar 28, 2023 5:50 pm

Carbec wrote: ↑Tue Mar 28, 2023 10:46 am Hello,

I now have a dual boot windows/linux. So I made a little experiment. The same test : positions from BT2630.epd, depth max 10.

threads nps(linux) nps(windows)
---------------------------------------
1 : 5.2 : 4.5
2 : 10.0 : 8.6
3 : 14.7 : 13.3
4 : 19.0 : 16.1
5 : 20.9 : 19.1
6 : 22.0 : 21.6
8 : 22.9 : 23.7

Don't ask me why there is a difference !!

First, your windows executable is slower than the Linux one (it is often the case, and I guess because of different ABI).
Second, your result are somewhat random because you are testing at fixed depth. Some threads probably finish before the others and are doing 0 nps for a short time. As a result you think you have a bad speed scaling and inaccurate data although everything is probably working fine. Do your test at a fixed time (1s per position for example) first and then look for other sources of poor scaling like false sharing etc.

Carbec · Post by **Carbec** » Tue Mar 28, 2023 8:46 pm

Hello,

Thanks for the suggestion. I will do that soon.
About false sharing, I just learned about it. Its very devious !
I was hoping to use "perf", but I don't have it. I think of replacing
mint by another distro; perhaps ubuntu.

Carbec · Post by **Carbec** » Thu Mar 30, 2023 3:04 pm

Hello,

I finally change my distribution for Ubuntu-Cinammon. I love it)
Redo my little test : same positions, but 10 s per move. 1 s is too small
to have a goot idea. These are the values for linux, I didn't do the tests
for windows. Im too lazy)

1 : 5.1
2 : 10.2
3 : 15.0
4 : 19.5
6 : 22.7
8 : 24.9

I could also use perf c2c, from what I understand, there is no false sharing !
Doing all these tests, I wonder what is exactly the benefit of smp ? Since
ok I have more nps, but for a given depth, there is little gain in time.
But my code is very rustic, and need a better evaluation, and more
heuristics to go deeper faster.

Philippe

abulmo2 · Post by **abulmo2** » Thu Mar 30, 2023 8:27 pm

Carbec wrote: ↑Thu Mar 30, 2023 3:04 pm Doing all these tests, I wonder what is exactly the benefit of smp ? Since
ok I have more nps, but for a given depth, there is little gain in time.

The gain in time is a bad metric. The gain in strength is the right metric. Here lazy SMP shines.

syzygy · Post by **syzygy** » Sun Apr 02, 2023 3:00 am

Carbec wrote: ↑Thu Mar 30, 2023 3:04 pm Hello,

I finally change my distribution for Ubuntu-Cinammon. I love it)
Redo my little test : same positions, but 10 s per move. 1 s is too small
to have a goot idea. These are the values for linux, I didn't do the tests
for windows. Im too lazy)

1 : 5.1
2 : 10.2
3 : 15.0
4 : 19.5
6 : 22.7
8 : 24.9
I could also use perf c2c, from what I understand, there is no false sharing !

The speeds you are reporting now for 1-4 threads indeed suggest that everything is fine.

smatovic · Post by **smatovic** » Sun Apr 02, 2023 10:28 pm

syzygy wrote: ↑Sun Mar 26, 2023 1:36 am
syzygy wrote: ↑Sun Mar 26, 2023 12:59 amand if multiple cores or threads are probing the TT that probably does not measurably increase memory access latency.
Or is this wrong if the cores are accessing the same memory channel?

Does someone know how this works? Are memory accesses effectively pipelined by the memory controller, so that access latencies overlap? Or does the memory controller handle simultaneous cache misses by different cores/threads strictly sequentially?

Interesting question, I don't know.

AMD delivers 1 DDR 64-bit memory channel per 8-core chiplet with Epyc? Must be a reason for this. AFAIK, RAM access latencies are in the range of 100ns, but how the memory controller works in detail is beyond me, maybe the CAS latency of DDR is a clue? How many cycles to lookup. And DDR is meanwhile quad-pumped? Idk.

***edit***
MLP - memory-level parallelism:
https://en.wikipedia.org/wiki/Memory-level_parallelism

***edit 2***
https://deepai.org/publication/exploiti ... arallelism

Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.

--
Srdja

Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP

Re: Need advice on SMP