Memory management and threads

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

chrisw
Posts: 4317
Joined: Tue Apr 03, 2012 4:28 pm

Memory management and threads

Post by chrisw »

On Memory Management

Well, I never paid too much attention to memory efficiency and management, always assuming my effectively one-threaded engine was generally within cache limits, barring hash table, and I mostly attended to make sure it worked rather than worked efficiently, with liberal doses of lookup tables and so on.

Now, see other posting thread, that it's been converted to very lazy SMP, basically calling the 'one-threaded-engine' N times, the nps is not scaling with #threads.

Cache sizes:
Laptop L1=384Kb, L2=1.5 Mb, L3=9.0 Mb
Test PC L1=384Kb, L2=1.5 Mb, L3=12.0 Mb

so I reckon I have about 10 Mb or so to play with until there's a problem. Not actually worked it out, but I'm pretty sure my profligate attitude to memory use is going to have blown everything outside of 10Mb with several threads running at once.

Q1. I'm assuming cache is per processor, not per core? In other words, I don't have 72 Mb of L3 cache across 6 cores?

Q2. How costly is it fact to be in L1, or L2 or L3, or outside?
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Memory management and threads

Post by hgm »

L3 is usually shared between all cores. If you have multiple CPU chips they of course each have their own L3. L1 and L2 are usually private.

L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.
mar
Posts: 2555
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Memory management and threads

Post by mar »

I assume all your caches (eval/pawn)/killers/history buffers are thread-local.
if not then false sharing may fry your performance
Martin Sedlak
chrisw
Posts: 4317
Joined: Tue Apr 03, 2012 4:28 pm

Re: Memory management and threads

Post by chrisw »

mar wrote: Tue Sep 15, 2020 8:33 pm I assume all your caches (eval/pawn)/killers/history buffers are thread-local.
if not then false sharing may fry your performance
Yes, everything is local to the thread concerned. Main hash not of course.
I did dream up the idea that it ought not to matter if one grabs large tracts of memory blowing through cache size, but what does matter is how many times one tries to (random effectively) access some small part of it. Maybe that’s just a dream.

One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
mar
Posts: 2555
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Memory management and threads

Post by mar »

chrisw wrote: Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
I've never used prefetch so I can't tell.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.

Could you give us some numbers about your nps scaling (out of curiosity)
Martin Sedlak
chrisw
Posts: 4317
Joined: Tue Apr 03, 2012 4:28 pm

Re: Memory management and threads

Post by chrisw »

mar wrote: Tue Sep 15, 2020 9:18 pm
chrisw wrote: Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
I've never used prefetch so I can't tell.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.

Could you give us some numbers about your nps scaling (out of curiosity)
See first post in thread immediately below this one
User avatar
Guenther
Posts: 4606
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: Memory management and threads

Post by Guenther »

mar wrote: Tue Sep 15, 2020 9:18 pm
chrisw wrote: Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
I've never used prefetch so I can't tell.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.

Could you give us some numbers about your nps scaling (out of curiosity)
For better readability in this thread
chrisw wrote: 5. Preliminary results (benchmark 970 positions, each at depth=13) are slightly weird.

Code: Select all

For one thread only:
2.2 Mnps
total nodes 112 M
time per move 54 ms
'solutions' 389
legal, validated tt-moves proposed 3.19 M
tt-moves searched, with v>alpha 1.84 M
'efficiency' of tt-move proposal 57%

For four threads:
4.1 Mnps
total nodes 262 M
time per move 66 ms
'solutions' 391
legal, validated tt-moves proposed 10.1 M
tt-moves searched, with v>alpha 5.6 M
'efficiency' of tt-move proposal 55%
Obviously the major weirdness is in spending a little more time to move to depth, in opposition to expectations.
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy
chrisw
Posts: 4317
Joined: Tue Apr 03, 2012 4:28 pm

Re: Memory management and threads

Post by chrisw »

hgm wrote: Tue Sep 15, 2020 8:30 pm L3 is usually shared between all cores. If you have multiple CPU chips they of course each have their own L3. L1 and L2 are usually private.

L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.
So general principles for memory management SMP chess engine programmer are what?
How do we manage increased memory access outside of cache with increasing thread count? I might be asking totally random here, btw.
mar
Posts: 2555
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Memory management and threads

Post by mar »

thanks.

values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:

Code: Select all

1T: time 54363 nodes 131382573 nps 2416764
2T: time 35140 nodes 166246539 nps 4730977
4T: time 37180 nodes 338705372 nps 9109880
8T: time 18374 nodes 292388814 nps 15913182
so nps scaling is as follows:

Code: Select all

1T: 1.0x
2T: 1.96x
4T: 3.77x
8T: 6.58x
not great, not terrible, but with little effort back then

so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)
Martin Sedlak
chrisw
Posts: 4317
Joined: Tue Apr 03, 2012 4:28 pm

Re: Memory management and threads

Post by chrisw »

mar wrote: Tue Sep 15, 2020 11:07 pm thanks.

values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:

Code: Select all

1T: time 54363 nodes 131382573 nps 2416764
2T: time 35140 nodes 166246539 nps 4730977
4T: time 37180 nodes 338705372 nps 9109880
8T: time 18374 nodes 292388814 nps 15913182
so nps scaling is as follows:

Code: Select all

1T: 1.0x
2T: 1.96x
4T: 3.77x
8T: 6.58x
not great, not terrible, but with little effort back then

so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)
4 threads, 6 core machine. Test suite is just short if 1000 positions.

Good idea to test with more time, will take a look at that tomorrow and report back.

Since the above results, it’s been through a tuning with 37 M positions rather than the prior which was 22 M. As an aside that gained 30 Elo in a quick test, not enough games played though.

So, result with nothing else different other than eval weights, I now get:
1 thread 2.3 Mnps
4 threads 3.9 Mnps

As opposed to prior which was 2.2 and 4.1, it’s very often that nps comes out different depending on some eval change.

I still have the idea in the back of my head that if one does heavy heavy processing before actually expanding the node and looking at the children, to select, sort and throw out potential moves, and that is successful, then ratio of prior processing to then processing N moves is significant. Effective sort/prune (which would be increased by finding more tt moves with SMP) is going to show up as less nps. It was proposed earlier that very lazy SMP should just linear scale anyway, but, not sure at the moment....