Memory management and threads

chrisw · Post by **chrisw** » Tue Sep 15, 2020 5:53 pm

On Memory Management

Well, I never paid too much attention to memory efficiency and management, always assuming my effectively one-threaded engine was generally within cache limits, barring hash table, and I mostly attended to make sure it worked rather than worked efficiently, with liberal doses of lookup tables and so on.

Now, see other posting thread, that it's been converted to very lazy SMP, basically calling the 'one-threaded-engine' N times, the nps is not scaling with #threads.

Cache sizes:
Laptop L1=384Kb, L2=1.5 Mb, L3=9.0 Mb
Test PC L1=384Kb, L2=1.5 Mb, L3=12.0 Mb

so I reckon I have about 10 Mb or so to play with until there's a problem. Not actually worked it out, but I'm pretty sure my profligate attitude to memory use is going to have blown everything outside of 10Mb with several threads running at once.

Q1. I'm assuming cache is per processor, not per core? In other words, I don't have 72 Mb of L3 cache across 6 cores?

Q2. How costly is it fact to be in L1, or L2 or L3, or outside?

hgm · Post by **hgm** » Tue Sep 15, 2020 8:30 pm

L3 is usually shared between all cores. If you have multiple CPU chips they of course each have their own L3. L1 and L2 are usually private.

L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.

mar · Post by **mar** » Tue Sep 15, 2020 8:33 pm

I assume all your caches (eval/pawn)/killers/history buffers are thread-local.
if not then false sharing may fry your performance

chrisw · Post by **chrisw** » Tue Sep 15, 2020 8:58 pm

mar wrote: ↑Tue Sep 15, 2020 8:33 pm I assume all your caches (eval/pawn)/killers/history buffers are thread-local.
if not then false sharing may fry your performance

Yes, everything is local to the thread concerned. Main hash not of course.
I did dream up the idea that it ought not to matter if one grabs large tracts of memory blowing through cache size, but what does matter is how many times one tries to (random effectively) access some small part of it. Maybe that’s just a dream.

One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.

mar · Post by **mar** » Tue Sep 15, 2020 9:18 pm

chrisw wrote: ↑Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.

I've never used prefetch so I can't tell.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.

Could you give us some numbers about your nps scaling (out of curiosity)

chrisw · Post by **chrisw** » Tue Sep 15, 2020 10:14 pm

mar wrote: ↑Tue Sep 15, 2020 9:18 pm
chrisw wrote: ↑Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
I've never used prefetch so I can't tell.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.

Could you give us some numbers about your nps scaling (out of curiosity)

See first post in thread immediately below this one

Guenther · Post by **Guenther** » Tue Sep 15, 2020 10:29 pm

mar wrote: ↑Tue Sep 15, 2020 9:18 pm
chrisw wrote: ↑Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
I've never used prefetch so I can't tell.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.

Could you give us some numbers about your nps scaling (out of curiosity)

For better readability in this thread

chrisw wrote: 5. Preliminary results (benchmark 970 positions, each at depth=13) are slightly weird.
Code: Select all
For one thread only:
2.2 Mnps
total nodes 112 M
time per move 54 ms
'solutions' 389
legal, validated tt-moves proposed 3.19 M
tt-moves searched, with v>alpha 1.84 M
'efficiency' of tt-move proposal 57%

For four threads:
4.1 Mnps
total nodes 262 M
time per move 66 ms
'solutions' 391
legal, validated tt-moves proposed 10.1 M
tt-moves searched, with v>alpha 5.6 M
'efficiency' of tt-move proposal 55%
Obviously the major weirdness is in spending a little more time to move to depth, in opposition to expectations.

chrisw · Post by **chrisw** » Tue Sep 15, 2020 10:46 pm

hgm wrote: ↑Tue Sep 15, 2020 8:30 pm L3 is usually shared between all cores. If you have multiple CPU chips they of course each have their own L3. L1 and L2 are usually private.

L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.

So general principles for memory management SMP chess engine programmer are what?
How do we manage increased memory access outside of cache with increasing thread count? I might be asking totally random here, btw.

mar · Post by **mar** » Tue Sep 15, 2020 11:07 pm

thanks.

values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:

Code: Select all

1T: time 54363 nodes 131382573 nps 2416764
2T: time 35140 nodes 166246539 nps 4730977
4T: time 37180 nodes 338705372 nps 9109880
8T: time 18374 nodes 292388814 nps 15913182

so nps scaling is as follows:

Code: Select all

1T: 1.0x
2T: 1.96x
4T: 3.77x
8T: 6.58x

not great, not terrible, but with little effort back then

so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)

chrisw · Post by **chrisw** » Tue Sep 15, 2020 11:51 pm

mar wrote: ↑Tue Sep 15, 2020 11:07 pm thanks.

values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:
Code: Select all
1T: time 54363 nodes 131382573 nps 2416764
2T: time 35140 nodes 166246539 nps 4730977
4T: time 37180 nodes 338705372 nps 9109880
8T: time 18374 nodes 292388814 nps 15913182
so nps scaling is as follows:
Code: Select all
1T: 1.0x
2T: 1.96x
4T: 3.77x
8T: 6.58x
not great, not terrible, but with little effort back then

so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)

4 threads, 6 core machine. Test suite is just short if 1000 positions.

Good idea to test with more time, will take a look at that tomorrow and report back.

Since the above results, it’s been through a tuning with 37 M positions rather than the prior which was 22 M. As an aside that gained 30 Elo in a quick test, not enough games played though.

So, result with nothing else different other than eval weights, I now get:
1 thread 2.3 Mnps
4 threads 3.9 Mnps

As opposed to prior which was 2.2 and 4.1, it’s very often that nps comes out different depending on some eval change.

I still have the idea in the back of my head that if one does heavy heavy processing before actually expanding the node and looking at the children, to select, sort and throw out potential moves, and that is successful, then ratio of prior processing to then processing N moves is significant. Effective sort/prune (which would be increased by finding more tt moves with SMP) is going to show up as less nps. It was proposed earlier that very lazy SMP should just linear scale anyway, but, not sure at the moment....

Memory management and threads

Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads

Re: Memory management and threads