On Memory Management
Well, I never paid too much attention to memory efficiency and management, always assuming my effectively one-threaded engine was generally within cache limits, barring hash table, and I mostly attended to make sure it worked rather than worked efficiently, with liberal doses of lookup tables and so on.
Now, see other posting thread, that it's been converted to very lazy SMP, basically calling the 'one-threaded-engine' N times, the nps is not scaling with #threads.
Cache sizes:
Laptop L1=384Kb, L2=1.5 Mb, L3=9.0 Mb
Test PC L1=384Kb, L2=1.5 Mb, L3=12.0 Mb
so I reckon I have about 10 Mb or so to play with until there's a problem. Not actually worked it out, but I'm pretty sure my profligate attitude to memory use is going to have blown everything outside of 10Mb with several threads running at once.
Q1. I'm assuming cache is per processor, not per core? In other words, I don't have 72 Mb of L3 cache across 6 cores?
Q2. How costly is it fact to be in L1, or L2 or L3, or outside?
Memory management and threads
Moderators: hgm, Rebel, chrisw
-
- Posts: 27808
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Memory management and threads
L3 is usually shared between all cores. If you have multiple CPU chips they of course each have their own L3. L1 and L2 are usually private.
L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.
L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Memory management and threads
I assume all your caches (eval/pawn)/killers/history buffers are thread-local.
if not then false sharing may fry your performance
if not then false sharing may fry your performance
Martin Sedlak
-
- Posts: 4319
- Joined: Tue Apr 03, 2012 4:28 pm
Re: Memory management and threads
Yes, everything is local to the thread concerned. Main hash not of course.
I did dream up the idea that it ought not to matter if one grabs large tracts of memory blowing through cache size, but what does matter is how many times one tries to (random effectively) access some small part of it. Maybe that’s just a dream.
One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Memory management and threads
I've never used prefetch so I can't tell.chrisw wrote: ↑Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.
Could you give us some numbers about your nps scaling (out of curiosity)
Martin Sedlak
-
- Posts: 4319
- Joined: Tue Apr 03, 2012 4:28 pm
Re: Memory management and threads
See first post in thread immediately below this onemar wrote: ↑Tue Sep 15, 2020 9:18 pmI've never used prefetch so I can't tell.chrisw wrote: ↑Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.
Could you give us some numbers about your nps scaling (out of curiosity)
-
- Posts: 4607
- Joined: Wed Oct 01, 2008 6:33 am
- Location: Regensburg, Germany
- Full name: Guenther Simon
Re: Memory management and threads
For better readability in this threadmar wrote: ↑Tue Sep 15, 2020 9:18 pmI've never used prefetch so I can't tell.chrisw wrote: ↑Tue Sep 15, 2020 8:58 pm One other aspect, when I profiled some time ago, my prefetch instruction (hash related) flagged up some large value (compared to anything else). I parked this, but it seems odd, because prefetch itself is designed to, well, kind of work in the background for later.
However, as far as I remember back in the days my nps scaling was pretty good, sublinear but more or less as expected.
Could you give us some numbers about your nps scaling (out of curiosity)
chrisw wrote: 5. Preliminary results (benchmark 970 positions, each at depth=13) are slightly weird.
Obviously the major weirdness is in spending a little more time to move to depth, in opposition to expectations.Code: Select all
For one thread only: 2.2 Mnps total nodes 112 M time per move 54 ms 'solutions' 389 legal, validated tt-moves proposed 3.19 M tt-moves searched, with v>alpha 1.84 M 'efficiency' of tt-move proposal 57% For four threads: 4.1 Mnps total nodes 262 M time per move 66 ms 'solutions' 391 legal, validated tt-moves proposed 10.1 M tt-moves searched, with v>alpha 5.6 M 'efficiency' of tt-move proposal 55%
https://rwbc-chess.de
trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...
trollwatch:
Talkchess nowadays is a joke - it is full of trolls/idiots/people stuck in the pleistocene > 80% of the posts fall into this category...
-
- Posts: 4319
- Joined: Tue Apr 03, 2012 4:28 pm
Re: Memory management and threads
So general principles for memory management SMP chess engine programmer are what?hgm wrote: ↑Tue Sep 15, 2020 8:30 pm L3 is usually shared between all cores. If you have multiple CPU chips they of course each have their own L3. L1 and L2 are usually private.
L1 has a latency of a few cycles (3 or 4), but you can have a sustained throughput of two accesses per clock cycle. That is so good as instant. L2 is slower (~12 cycles latency?), and allows only one access per cycle.
How do we manage increased memory access outside of cache with increasing thread count? I might be asking totally random here, btw.
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Memory management and threads
thanks.
values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:
so nps scaling is as follows:
not great, not terrible, but with little effort back then
so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)
values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:
Code: Select all
1T: time 54363 nodes 131382573 nps 2416764
2T: time 35140 nodes 166246539 nps 4730977
4T: time 37180 nodes 338705372 nps 9109880
8T: time 18374 nodes 292388814 nps 15913182
Code: Select all
1T: 1.0x
2T: 1.96x
4T: 3.77x
8T: 6.58x
so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)
Martin Sedlak
-
- Posts: 4319
- Joined: Tue Apr 03, 2012 4:28 pm
Re: Memory management and threads
4 threads, 6 core machine. Test suite is just short if 1000 positions.mar wrote: ↑Tue Sep 15, 2020 11:07 pm thanks.
values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:
so nps scaling is as follows:Code: Select all
1T: time 54363 nodes 131382573 nps 2416764 2T: time 35140 nodes 166246539 nps 4730977 4T: time 37180 nodes 338705372 nps 9109880 8T: time 18374 nodes 292388814 nps 15913182
not great, not terrible, but with little effort back thenCode: Select all
1T: 1.0x 2T: 1.96x 4T: 3.77x 8T: 6.58x
so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)
Good idea to test with more time, will take a look at that tomorrow and report back.
Since the above results, it’s been through a tuning with 37 M positions rather than the prior which was 22 M. As an aside that gained 30 Elo in a quick test, not enough games played though.
So, result with nothing else different other than eval weights, I now get:
1 thread 2.3 Mnps
4 threads 3.9 Mnps
As opposed to prior which was 2.2 and 4.1, it’s very often that nps comes out different depending on some eval change.
I still have the idea in the back of my head that if one does heavy heavy processing before actually expanding the node and looking at the children, to select, sort and throw out potential moves, and that is successful, then ratio of prior processing to then processing N moves is significant. Effective sort/prune (which would be increased by finding more tt moves with SMP) is going to show up as less nps. It was proposed earlier that very lazy SMP should just linear scale anyway, but, not sure at the moment....