re-inventing the SMP wheel

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

bob wrote:First, I though we were discussing this in the context of a chess engine, not in terms of general-purpose applications?
I am not sure what you want to say with this. We are discussing the best way to do paging. For Chess programs you don't want to page at all. So in the context of a Chess program it is immaterial how you page.
Secondly, operating systems definitely use a fault-in-on-demand strategy on program startup and while it is running. I do not want to page my program, and therefore I limit the size of everything to be certain it will fit in physical memory. If nothing else needs to run along with me. But I know that on a hash probe, I do not want to read 1/4 megabyte of data when I am only going to use 16 bytes. On the large magic number tables, I do not want to read in 1/4 megabyte to get to a 64 bit or 32 bit value. The list goes on. Because if I do have to page, I do not want to have to page out 256K chunks when only 1, 2 or 16 bytes have been changed.
There seems to be no basis for this. If part of your hash table gets paged out, you want it paged back in as quickly as possible. You can't say that you only need 16 bytes, as the hash table is constantly bombarded with probes. The magic number tables even more so. Any part of it that is missing will bring you to a grinding halt very soon. It should be always loaded. The more you bring in in advance the better. The larger the pages, the more likely that they will be 'least recently used', and not swapped out. And if it has to be swapped out to shut down operations of your chess program, you want to put the OS under maximum pressure to revoke this disastrous decision. If swapping 256KB is hardly anymore expensive that swapping 4KB, it is just as well that 256KB is swapped in at once, rather than having to wait for 64 page faults (which will certainly come quite soon, but not soon enough to have missed the opportunity to load them on this rotation of the disk, and forcing you to wait for the next).
Large page sizes have their place, but then so do small ones. The O/S is tricky with buffering file I/O using page-sized buffers. 256K is too big unless your program spends all its time reading/writing.

In general, if we have to use a "one-size-fits-all" approach, smaller is better, because _most_ running applications are very small.
Well, as I reported before, applications smaller than 2MB (memory footprint) virtually seem non-existent.. So "very small" means something different then I think, or this is just not true at all.
How would you do that? One large function? Poor software engineering. Lots of small functions? No way to put 'em into one contiguous block then, since we don't have any semantics to cause that to happen.
Linkers usually pack code coming from the same source file in one contiguous block. The order in which you specify the arguments could be a clue.
Again, the general feeling among computer architecure and O/S designers is that if the choice is large or small, small is better. Were it dynamically variable, then yes, you could tune page size to a specific application's behavior and help things.
Well, large or small are relative notions. I maintain that there also is something like 'too small': the optimum will not ly at asymptotically small page size (say 4 bytes). So the question is what is optimum. I agree that for general applications 4MB is definitely too large. But 4KB seems way too small.
But right now it is more about memory wastage, rather than paging costs since most systems are configured to minimize actual paging I/O. And that's the bad mark on the 4M page size. It greatly reduces TLB misses on very large programs. But if you use it on small programs memory wastage is very high.
Agreed, 4MB is (still) too big.
I will guarantee you that you will see paging in any program. It is unavoidable. We just work to minimize it. But we do have to start a game from scratch, and we have to page there. But as I mentioned, I worry far less about that than I do just the idea of wasting memory in the _general_ case. For Crafty, I certainly prefer 4M pages. I have reported on that here in the past and pointed out that performance was _higher_ because of the limited TLB we have. And TLB misses are _bad_. But I would not want to run with 4mb pages exclusively, because of the 100+ system daemons/tasks that run and I don't want every one of them to have a 4mb page for code, a 4mb page for read-only data, and a 4mb page for read/write data, when 128K will do for the whole thing. That's where large page sizes break down and hurt performance, if they _must_ be used everywhere.
Yes, 4MB is too big. But with 256KB and 4 different flavors of access rights, your 100+ daemons will still only occupy 100MB, or 2.5% of a 4GB memory (10% if you have 1GB which seems bare minimum now). That sounds acceptable, and there is no reason to shy away from it if the alternative has its own, possibly worse problems (like inefficient use of I/O bandwidth during swapping). As I said, even with 4KB pages only 3 of the 32 Windows Vista daemons was smaller than 2MB. 256B is simply asymptotically small, nowadays.
none of that matters much in the grand scheme of an operating system. The issue is how much of main memory is actually _used_ vs how much is wasted because of the large page sizes? A large number of small jobs, such as those spawned by unix or windows, increases waste significantly. And since a single page must have a single access control type, the typical program will require at least two different access controls, sometimes 3. Bringing in such large chunks also invalidates a lot of cache, and requires overhead to force dirty lines back to memory before the I/O is started. Not much is free... So yes, for a chess engine, which is a large application, large page sizes are good. I've already measured that a year or two back. But if you force it on everyone by just supporting 256K (or whatever) page size, it is not so good.
This assumes that there exist programs smaller than a few times 256KB, and I just don't see those in my task manager...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:
bob wrote:First, I though we were discussing this in the context of a chess engine, not in terms of general-purpose applications?
I am not sure what you want to say with this. We are discussing the best way to do paging. For Chess programs you don't want to page at all. So in the context of a Chess program it is immaterial how you page.

Not necessarily. You do have to read the thing in the first time. You might get dislodged from memory on rare occasions. But if you talk in the context of "not paging" then what sense does it make to even discuss page size? It is irrelevant whether pages are 256 bytes or 256K bytes in that context... And yes, we've had a page that small. The VAX ultimately settled on 512 bytes per page. Even when 128M-512M memory sizes were very common.

Secondly, operating systems definitely use a fault-in-on-demand strategy on program startup and while it is running. I do not want to page my program, and therefore I limit the size of everything to be certain it will fit in physical memory. If nothing else needs to run along with me. But I know that on a hash probe, I do not want to read 1/4 megabyte of data when I am only going to use 16 bytes. On the large magic number tables, I do not want to read in 1/4 megabyte to get to a 64 bit or 32 bit value. The list goes on. Because if I do have to page, I do not want to have to page out 256K chunks when only 1, 2 or 16 bytes have been changed.
There seems to be no basis for this. If part of your hash table gets paged out, you want it paged back in as quickly as possible. You can't say that you only need 16 bytes, as the hash table is constantly bombarded with probes. The magic number tables even more so. Any part of it that is missing will bring you to a grinding halt very soon. It should be always loaded. The more you bring in in advance the better. The larger the pages, the more likely that they will be 'least recently used', and not swapped out. And if it has to be swapped out to shut down operations of your chess program, you want to put the OS under maximum pressure to revoke this disastrous decision. If swapping 256KB is hardly anymore expensive that swapping 4KB, it is just as well that 256KB is swapped in at once, rather than having to wait for 64 page faults (which will certainly come quite soon, but not soon enough to have missed the opportunity to load them on this rotation of the disk, and forcing you to wait for the next).
It would seem that our differences in background are making this more complicated that it needs to be. I've done operating systems development work for _many_ years. The solutions I try to find have to fit the general case, not just a specific case.

As I mentioned above, if you want to discuss this in the context of no paging at all, then page size has absolutely no significance, and the discussion becomes completely moot.
Large page sizes have their place, but then so do small ones. The O/S is tricky with buffering file I/O using page-sized buffers. 256K is too big unless your program spends all its time reading/writing.

In general, if we have to use a "one-size-fits-all" approach, smaller is better, because _most_ running applications are very small.
Well, as I reported before, applications smaller than 2MB (memory footprint) virtually seem non-existent.. So "very small" means something different then I think, or this is just not true at all.
I don't know what kind of system you are used to running on. But for windows and unix, there are _dozens_ of system tasks, each of which is far smaller than 2mb. My current SuSE 10.2 system has 108 processes running right now on my laptop. My office machine has even more running. Some of 'em have been completely paged out, some are completely memory resident, many are partially in, partially out. That is typical for windows or unix today.
How would you do that? One large function? Poor software engineering. Lots of small functions? No way to put 'em into one contiguous block then, since we don't have any semantics to cause that to happen.
Linkers usually pack code coming from the same source file in one contiguous block. The order in which you specify the arguments could be a clue.
I don't put 'em all in one large source file. For the obvious reason that I use Make and don't want to recompile the world when I change one module.
Again, the general feeling among computer architecure and O/S designers is that if the choice is large or small, small is better. Were it dynamically variable, then yes, you could tune page size to a specific application's behavior and help things.
Well, large or small are relative notions. I maintain that there also is something like 'too small': the optimum will not ly at asymptotically small page size (say 4 bytes). So the question is what is optimum. I agree that for general applications 4MB is definitely too large. But 4KB seems way too small.
From a long-term operating system background, the _only_ reason I would consider 4K to be too small is the TLB issue. But opterons have 1024 TLB entries (the last one I used anyway, an 875) which means 4mb programs don't have TLB misses, but anything larger will, introducing page table overhead.

That's the only reason the 4mb page size was introduced in the PC, because the TLB is so small, and without it, you end up with extra memory references to do a page translation. On an opteron, for example, a TLB miss produces 4 _extra_ memory accesses, which essentially slows memory down by a factor of 5. 4MB pages eliminate that unless you go past 4 gigs of ram, where it again begins to be a problem (although it only costs 2 extra accesses with that page size rather than 4, which is still a huge win.)

But right now it is more about memory wastage, rather than paging costs since most systems are configured to minimize actual paging I/O. And that's the bad mark on the 4M page size. It greatly reduces TLB misses on very large programs. But if you use it on small programs memory wastage is very high.
Agreed, 4MB is (still) too big.
I will guarantee you that you will see paging in any program. It is unavoidable. We just work to minimize it. But we do have to start a game from scratch, and we have to page there. But as I mentioned, I worry far less about that than I do just the idea of wasting memory in the _general_ case. For Crafty, I certainly prefer 4M pages. I have reported on that here in the past and pointed out that performance was _higher_ because of the limited TLB we have. And TLB misses are _bad_. But I would not want to run with 4mb pages exclusively, because of the 100+ system daemons/tasks that run and I don't want every one of them to have a 4mb page for code, a 4mb page for read-only data, and a 4mb page for read/write data, when 128K will do for the whole thing. That's where large page sizes break down and hurt performance, if they _must_ be used everywhere.
Yes, 4MB is too big. But with 256KB and 4 different flavors of access rights, your 100+ daemons will still only occupy 100MB, or 2.5% of a 4GB memory (10% if you have 1GB which seems bare minimum now). That sounds acceptable, and there is no reason to shy away from it if the alternative has its own, possibly worse problems (like inefficient use of I/O bandwidth during swapping). As I said, even with 4KB pages only 3 of the 32 Windows Vista daemons was smaller than 2MB. 256B is simply asymptotically small, nowadays.
none of that matters much in the grand scheme of an operating system. The issue is how much of main memory is actually _used_ vs how much is wasted because of the large page sizes? A large number of small jobs, such as those spawned by unix or windows, increases waste significantly. And since a single page must have a single access control type, the typical program will require at least two different access controls, sometimes 3. Bringing in such large chunks also invalidates a lot of cache, and requires overhead to force dirty lines back to memory before the I/O is started. Not much is free... So yes, for a chess engine, which is a large application, large page sizes are good. I've already measured that a year or two back. But if you force it on everyone by just supporting 256K (or whatever) page size, it is not so good.
This assumes that there exist programs smaller than a few times 256KB, and I just don't see those in my task manager...
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

Well, it isn't that bad, because usually the TLB misses are satisfied from L2. Usually the data misses L2 before you start missing the TLB, as manufacturers are careful to provide at least enough TLB entries to address all of L2. (I guess an increase of L2 size would hardly show up in the benchmarks if they didn't.) Like your opteron example, 4MB worth of TLB with an L2 of 2MB or 4MB. (Even if the actual L2 is smaller, the number of TLB entries is often inherited from bigger brothers with the same architecture.)

You can force TLB misses without L2 data misses (to test how long they take to process) by stepping through a 8MB array in steps of 4K+64. Each access is then a new page, (so you use 2K pages, overflowing the TLB), but only one in 64 accesses can map to the same L2 set (and on the average only one in 4K does, as the L2 cache way is 256KB), so typically there are no collisions that the number of L2 ways cannot handle. You would thus measure the impact of an L2 data hit with a TLB miss.

You might be counting yourself rich on the 4MB pages: on most CPUs the normal TLB entries cannot hold 4MB pages, and there are special entries for those. Of which there are then only 8 or so, even if you have 1024 entries for 4KB pages. So you can still only address 32MB without TLB misses, which is still not good enough to cover the hash table.

The 1024 TLB entries in the Opteron are TLB-L2 entries, IIRC. So you would still have some sort of TLB miss if you address 4MB through them, as they would have to be brought from TLB-L2 to TLB-L1 (which only has ~8 entries).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:Well, it isn't that bad, because usually the TLB misses are satisfied from L2. Usually the data misses L2 before you start missing the TLB, as manufacturers are careful to provide at least enough TLB entries to address all of L2. (I guess an increase of L2 size would hardly show up in the benchmarks if they didn't.) Like your opteron example, 4MB worth of TLB with an L2 of 2MB or 4MB. (Even if the actual L2 is smaller, the number of TLB entries is often inherited from bigger brothers with the same architecture.)
That isn't even close for my PIV xeon that has 62 TLB entries total, Each cache line can have data for a different page. Which means that with 512K on my PIV xeon, I have 8192 lines. Not enough. On my core-2 with 2mb of L2, I have 32K lines. The TLB can be crushed easily on any machine I have seen.

On the opteron, with the 4-level paging table approach, the tables end up being larger than L2 easily, so there are going to be misses. Just keeping the last 2 levels of the paging tables in L2 requires 2^18 * 4 bytes, or 1mb.

I've measured the performance of large programs on intel and opteron, and once you start to really do random memory accesses (which kills the TLB) memory performance goes right into the tank.

You can force TLB misses without L2 data misses (to test how long they take to process) by stepping through a 8MB array in steps of 4K+64. Each access is then a new page, (so you use 2K pages, overflowing the TLB), but only one in 64 accesses can map to the same L2 set (and on the average only one in 4K does, as the L2 cache way is 256KB), so typically there are no collisions that the number of L2 ways cannot handle. You would thus measure the impact of an L2 data hit with a TLB miss.

You might be counting yourself rich on the 4MB pages: on most CPUs the normal TLB entries cannot hold 4MB pages, and there are special entries for those. Of which there are then only 8 or so, even if you have 1024 entries for 4KB pages. So you can still only address 32MB without TLB misses, which is still not good enough to cover the hash table.

The 1024 TLB entries in the Opteron are TLB-L2 entries, IIRC. So you would still have some sort of TLB miss if you address 4MB through them, as they would have to be brought from TLB-L2 to TLB-L1 (which only has ~8 entries).
But, back to the main point. Page size is important if you have to page, it isn't if you don't, except for TLB considerations. I'm not one that buys into the "OK, we have huge memory, so we can afford to waste major chunks..." Shoot, supercomputers don't even do paging. Just the page translations are too much overhead to accept.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

Well, the main point is actually how best to do SMP search.. :?

So far the idea of having the owner of split-nodes hand out a limited number of moves to would-be slaves via the hash bucket seems a nice way to do it. Only the polling to know if the current search should be aborted might become a pain, if you have to poll too many nodes.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:Well, the main point is actually how best to do SMP search.. :?

So far the idea of having the owner of split-nodes hand out a limited number of moves to would-be slaves via the hash bucket seems a nice way to do it. Only the polling to know if the current search should be aborted might become a pain, if you have to poll too many nodes.
Simple is not always best. This has been tried and found wanting, over and over. Some form of dynamic splitting is the only way to get decent performance unless the target is to use no more than two processors. And then a good dynamic splitting algorithm will still out-perform the abadaba (or whatever it is called) type of simple parallel search.

And then there is the issue of taking great care about what is shared/updated, as that can murder cache...

I think it is more practical to just start at state-of-the-art rather than having to do it twice...
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

The problem is that the description of the state-of-the-art methods is either unduly long, or rather vague. The questions I am left with are mostly about implementation details, the various strategies are clear enough.

E.g., if it says that a thread, finding a cut-move in a split point, has to abort the searches of the other threads working on that node, it doesn't necessarily tell you how you could best do that. I don't think that SMP hardware allows you to cause an interrupt of a specific CPU (or of a thread with a given process ID), even if you would have taken the trouble of making a globally accessible list of which thread is currently working on which node. So such messages must be passed through shared memory, and polled by the receivers to notice their arrival.

The problem is that in most schemes a thread can be working on several split-points at once ('helpful master'). So either it polls all those nodes, or it just pols a single 'abort' flag. Polling all nodes it is working on makes the polling expensive, depite the fact that it is done by reading a shared cache line that only causes a miss if an actual abort occurs (because the thread requesting the abort writes it), because multiple polls have to be done every node, possibly thousands of times. You could of course save on this by polling only once every so many nodes, at the cost of doing some unnecessary work before you notice the cutoff.

With a single abort flag dedicated to the thread, the polling work is reduced. But the thread requesting the abort should now know who exactly is working on a given split-node, in order to abort the right threads. To keep track of this in a list shared by all threads working on the split-node requires a lot of message passing between these threads, as thin information often changes, even in the absence of any aborts.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:The problem is that the description of the state-of-the-art methods is either unduly long, or rather vague. The questions I am left with are mostly about implementation details, the various strategies are clear enough.

E.g., if it says that a thread, finding a cut-move in a split point, has to abort the searches of the other threads working on that node, it doesn't necessarily tell you how you could best do that. I don't think that SMP hardware allows you to cause an interrupt of a specific CPU (or of a thread with a given process ID),

You would probably never want to try to interrupt a specific CPU, but certainly (in Unix at least) it is trivial to interrupt (signal) any thread you would like to interrupt. But I would not resort to that even, since we already have a way to abort the entire search or do a time check. I explained my approach to this in my dissertation pretty carefully. I have a flag for each thread that they check frequently, and this flag says "stop now and back out..." The trick is to know who to set that for, because it is not only what is going on at this particular split point, but also any split points deeper into the tree, since they too are no longer useful.

even if you would have taken the trouble of making a globally accessible list of which thread is currently working on which node. So such messages must be passed through shared memory, and polled by the receivers to notice their arrival.
I have a thing called a "split block" that represents the local tree state a particular thread is working on. Any split block represents a point in the tree where two or more processors are working together. So I have those split blocks linked "horizontally" (sibling or brothers/sisters links). That gives me the list of primary threads that have to be stopped when an unexpected fail high happens. But threads are also linked "vertically" through this split block as well such that if this sub-tree gets split farther down into the three, N new split blocks are created to do that parallel search, and each of those new split blocks points backward to this split block (to identify the parent) and this split block points forward to each of those split blocks to identify its children. I can therefore just "walk this tree" setting stop flags at any active split block that is linked either horizontally to this split block, or linked vertically below it (and this is done recursively to get all the processes below me to stop as well).

The problem is that in most schemes a thread can be working on several split-points at once ('helpful master'). So either it polls all those nodes, or it just pols a single 'abort' flag. Polling all nodes it is working on makes the polling expensive, depite the fact that it is done by reading a shared cache line that only causes a miss if an actual abort occurs (because the thread requesting the abort writes it), because multiple polls have to be done every node, possibly thousands of times. You could of course save on this by polling only once every so many nodes, at the cost of doing some unnecessary work before you notice the cutoff.

Don't poll anything. At the top of search, I simply test tree->abort_search and when it is set, I get out right now. It can trivially be set by walking the links given above, which fortunately happens infrequently if split points are chosen reasonably well..




With a single abort flag dedicated to the thread, the polling work is reduced. But the thread requesting the abort should now know who exactly is working on a given split-node, in order to abort the right threads. To keep track of this in a list shared by all threads working on the split-node requires a lot of message passing between these threads, as thin information often changes, even in the absence of any aborts.
You don't pass messages of any kind (at least I don't). I am currently using split block N. And I find someone that can help me. I there copy split block N to two new blocks, P and Q. The current thread switches to use P and the new helper picks up Q. N is linked to both P and Q vertically, as well as to its siblings horizontally.
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: re-inventing the SMP wheel

Post by Zach Wegner »

bob wrote:I have a thing called a "split block" that represents the local tree state a particular thread is working on. Any split block represents a point in the tree where two or more processors are working together. So I have those split blocks linked "horizontally" (sibling or brothers/sisters links). That gives me the list of primary threads that have to be stopped when an unexpected fail high happens. But threads are also linked "vertically" through this split block as well such that if this sub-tree gets split farther down into the three, N new split blocks are created to do that parallel search, and each of those new split blocks points backward to this split block (to identify the parent) and this split block points forward to each of those split blocks to identify its children. I can therefore just "walk this tree" setting stop flags at any active split block that is linked either horizontally to this split block, or linked vertically below it (and this is done recursively to get all the processes below me to stop as well).
Let me tell you the way I do it, which is a little different. I don't have any connections between split points themselves, but rather each process has a stack of pointers to each split point it is part of, in order of depth. When an idle processor joins a split point deep in the tree below another split point, it simply joins all of the split points the "master" is part of. For a stop, you do not need to walk the "split point tree" to find all of the processors below the split point, just look at the list of children. It just passes a message to each one. This to me is a much simpler approach.

Zach
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

Zach Wegner wrote:
bob wrote:I have a thing called a "split block" that represents the local tree state a particular thread is working on. Any split block represents a point in the tree where two or more processors are working together. So I have those split blocks linked "horizontally" (sibling or brothers/sisters links). That gives me the list of primary threads that have to be stopped when an unexpected fail high happens. But threads are also linked "vertically" through this split block as well such that if this sub-tree gets split farther down into the three, N new split blocks are created to do that parallel search, and each of those new split blocks points backward to this split block (to identify the parent) and this split block points forward to each of those split blocks to identify its children. I can therefore just "walk this tree" setting stop flags at any active split block that is linked either horizontally to this split block, or linked vertically below it (and this is done recursively to get all the processes below me to stop as well).
Let me tell you the way I do it, which is a little different. I don't have any connections between split points themselves, but rather each process has a stack of pointers to each split point it is part of, in order of depth. When an idle processor joins a split point deep in the tree below another split point, it simply joins all of the split points the "master" is part of. For a stop, you do not need to walk the "split point tree" to find all of the processors below the split point, just look at the list of children. It just passes a message to each one. This to me is a much simpler approach.

Zach
I don't like the stack idea although I had originally considered it. The problem is this. A single thread splits at A, then B, then C, then D, then E. And while it is working on E, the other thread working with it at B completes and there is nothing else left to do. I simply clean that up immediately, where in a stack approach you either leave it sitting around which causes wildly excessive split block usage on large systems, or you violate the stack principle and squeeze that entry out instead.

It is just an issue of data structures to me, and with my approach, it is easy to take a snapshot to see who is doing what while debugging, since any processor/thread can see every split block and dump the output as it exists in that instant... From experience (my first parallel search was done on a Univac 1100/42 in 1978) I wanted to make debugging as easy as possible, which means when something hangs, "somebody" can display the state of the tree, who is working where and for whom, etc...