Need advice on SMP

Discussion of chess software programming and technical issues.

Moderators: hgm, chrisw, Rebel

syzygy
Posts: 5654
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need advice on SMP

Post by syzygy »

smatovic wrote: Sun Apr 02, 2023 10:28 pm
Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.
I understand that requests to the same bank have to be served serially, but what I wonder about is to what extent the latencies overlap in time.

Suppose two cores access the same bank simulateneously. If one request takes 100ns, does this mean that the other request has to wait 100ns and then takes another 100ns, giving 200ns in total?
syzygy
Posts: 5654
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need advice on SMP

Post by syzygy »

syzygy wrote: Sun Apr 02, 2023 10:57 pm
smatovic wrote: Sun Apr 02, 2023 10:28 pm
Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.
I understand that requests to the same bank have to be served serially, but what I wonder about is to what extent the latencies overlap in time.
Aha, the text you quote continues:
Our three proposed mechanisms mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank.
This suggests that there is currently no such overlap of bank access latencies. (But I suspect some of the total latency is caused by other component such as the memory controller and perhaps those are overlapped.)
smatovic
Posts: 2991
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

syzygy wrote: Sun Apr 02, 2023 11:07 pm Aha, the text you quote continues:
Our three proposed mechanisms mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank.
This suggests that there is currently no such overlap of bank access latencies. (But I suspect some of the total latency is caused by other component such as the memory controller and perhaps those are overlapped.)
Idk for sure, the paper is from 2012, meanwhile there might be different techniques to solve this (mentioned in the paper), and 100ns might be the overall CPU-RAM latency, but DRAM has different lookup latencies AFAIK.

--
Srdja
syzygy
Posts: 5654
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need advice on SMP

Post by syzygy »

smatovic wrote: Sun Apr 02, 2023 11:13 pm
syzygy wrote: Sun Apr 02, 2023 11:07 pm Aha, the text you quote continues:
Our three proposed mechanisms mitigate the negative impact of bank serialization by overlapping different components of the bank access latencies of multiple requests that go to different subarrays within the same bank.
This suggests that there is currently no such overlap of bank access latencies. (But I suspect some of the total latency is caused by other component such as the memory controller and perhaps those are overlapped.)
Idk for sure, the paper is from 2012, meanwhile there might be different techniques to solve this (mentioned in the paper), and 100ns might be the overall CPU-RAM latency, but DRAM has different lookup latencies AFAIK.
The paper is from 2018 (the original SALP paper is from 2012), and sections 5 ("related work") and 6 ("significant and long-term impact") suggest that this was still an academic subject.

However, SALP is about sub-bank parallelism, and the paper mentions that modern DRAM chips provide allow bank-level parallelism. Apparently DDR4 has up to 128 banks:
Also, the number of bank addresses has been increased greatly. There are four bank select bits to select up to 16 banks within each DRAM: two bank address bits (BA0, BA1), and two bank group bits (BG0, BG1). There are additional timing restrictions when accessing banks within the same bank group; it is faster to access a bank in a different bank group.

In addition, there are three chip select signals (C0, C1, C2), allowing up to eight stacked chips to be placed inside a single DRAM package. These effectively act as three more bank select bits, bringing the total to seven (128 possible banks).
https://en.wikipedia.org/wiki/DDR4_SDRAM
And this does not yet take into account different channels (I think).

DDR5 doubles the number of bank groups per DRAM:
- The number of chip ID bits remains at three, allowing up to eight stacked chips.
- A third bank group bit (BG2) was added, allowing up to eight bank groups.
- The maximum number of banks per bank group remains at four.
https://en.wikipedia.org/wiki/DDR5_SDRAM

So DDR4 and DDR5 seem to allow plenty of MLP already. With 4 cores there is probably not much increase in average TT access latency.
smatovic
Posts: 2991
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

syzygy wrote: Mon Apr 03, 2023 12:02 am [...]
So DDR4 and DDR5 seem to allow plenty of MLP already. With 4 cores there is probably not much increase in average TT access latency.
I agree, these papers mention 10-way MLP:

https://people.csail.mit.edu/rinard/paper/pact18.pdf
Processors have grown their capacity to exploit instruction
level parallelism (ILP) with wide scalar and vector pipelines,
e.g., cores have 4-way superscalar pipelines, and vector units
can execute 32 arithmetic operations per cycle. Memory
level parallelism (MLP) is also pervasive, with deep buffering
between caches and DRAM that allows 10+ in-flight memory
requests per core.
https://anmolpanda.github.io/papers/MLP.pdf
3. Memory offers parallelism of up to 10 parallel accesses. MLP reduces when latency
inducing factors like TLB misses, cross-QPI accesses and virtualization
--
Srdla
User avatar
hgm
Posts: 28206
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Need advice on SMP

Post by hgm »

That depends on which CPU you have, right? If all data from memory has to be streamed into the CPU through the same data bus, it can become a serious bottleneck. Because the data bus doesn't operate at the CPU's clock frequency, but at the unmultiplied one.
smatovic
Posts: 2991
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Need advice on SMP

Post by smatovic »

Yes, sure, it depends on CPU, RAM, and the engine memory footprint. As Joost mentioned for example, TT lookups during QSearch.

From the GPU point of view I can tell that you really need to do microbenchmarking for such things, latencies and throughput, for caches and RAM (and also instructions).

As mentioned, there must be a core:memory-channel ratio sweetspot for each engine on different hardware.

But as syzygy wrote, 2 memory-channels with 4 (to 16?) cores "should" be fine.

Let us assume 4M NPS, 1 TT memory fetch per node, 64 bytes per TT entry, you get 256MB/s per thread, right? DDR4 offers up to ~25GB/s per channel, DDR5 about ~50GB/s per channel, so bandwidth should not be the bottleneck IMO, but you have to care of RAM latency.

--
Srdja