re-inventing the SMP wheel

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Zach Wegner
Posts: 1922
Joined: Wed Mar 08, 2006 11:51 pm
Location: Earth
Contact:

Re: re-inventing the SMP wheel

Post by Zach Wegner » Sat Aug 25, 2007 6:48 pm

hgm wrote:A much more interesting question would be what is the best place for split points. Should they be concentrated at remaining depth =4, or 8, or ~16 ply?
I'm glad this kind of discussion is coming up, because I'm at the point where this is pretty much all I have left to worry about (besides a few bugs I'm working out). I believe that splitpoints should have as big subtrees as possible, in order to minimize the overhead of splitting. Of course, the higher the depth of the node, the longer it is going to take to search the first N moves. You must strike a balance between probability of being a cut-node and available work to do. This means that when the number of processors gets larger, you must start to move downward, splitting at lower depths. I have in mind a strategy of processor "partitions" where each partition has a master processor (that is itself a child of the real master processor) and all the other processors in its partition must split only on nodes below its split point. So the question becomes, is it better to have one split point with depth=16 and 8 processors, or one split point at depth 16 with 2 processors and two more at depth=10 with 4 processors each, or a split point at depth=16 with four processors and four split points at depth=10 with two processors each? The partitions can even work recursively, I.e. with 8 processors you have the partition tree:

Code: Select all

cpu0
|          \
cpu0        cpu4
|    \      |    \
cpu0  cpu2  cpu4  cpu6
|  \  |  \  |  \  |  \
c0 c1 c2 c3 c4 c5 c6 c7
Probably this idea would not work below say, 16 processors, but above that, it might be useful. Is there a formula for the optimum branching factor of this partition tree for N processors? For low N, bf=N seems to make sense. But what happens at N=1024? I would guess somewhere near bf=4.

Zach

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob » Sat Aug 25, 2007 9:07 pm

Zach Wegner wrote:
hgm wrote:A much more interesting question would be what is the best place for split points. Should they be concentrated at remaining depth =4, or 8, or ~16 ply?
I'm glad this kind of discussion is coming up, because I'm at the point where this is pretty much all I have left to worry about (besides a few bugs I'm working out). I believe that splitpoints should have as big subtrees as possible, in order to minimize the overhead of splitting. Of course, the higher the depth of the node, the longer it is going to take to search the first N moves. You must strike a balance between probability of being a cut-node and available work to do. This means that when the number of processors gets larger, you must start to move downward, splitting at lower depths. I have in mind a strategy of processor "partitions" where each partition has a master processor (that is itself a child of the real master processor) and all the other processors in its partition must split only on nodes below its split point. So the question becomes, is it better to have one split point with depth=16 and 8 processors, or one split point at depth 16 with 2 processors and two more at depth=10 with 4 processors each, or a split point at depth=16 with four processors and four split points at depth=10 with two processors each? The partitions can even work recursively, I.e. with 8 processors you have the partition tree:

Code: Select all

cpu0
|          \
cpu0        cpu4
|    \      |    \
cpu0  cpu2  cpu4  cpu6
|  \  |  \  |  \  |  \
c0 c1 c2 c3 c4 c5 c6 c7
Probably this idea would not work below say, 16 processors, but above that, it might be useful. Is there a formula for the optimum branching factor of this partition tree for N processors? For low N, bf=N seems to make sense. But what happens at N=1024? I would guess somewhere near bf=4.

Zach
First, this is a non-trivial topic. But for pure SMP systems (lets ignore NUMA even though a quad opteron is really NUMA) the best idea is to split anywhere. But once you decide to go a real splitting approach, you still have to consider overhead, as splitting is not free. In crafty, I can limit the max depth at which splitting occurs to limit splitting overhead. And this depth varies depending on the hardware speed, and is something I tune for each different system I use.

also I use the idea of "groups". For NUMA a "group" is carefully defined as adjacent nodes, for non-numa, it is not. I generally set this to "4" for 1-16 processors and that works pretty well. It simply limits the number of processors that work together at any particular split point, otherwise you run into an overhead problem where you have more processors than moves to search, but you do the split first, and then some have nothing to do making their split work totally wasted. The "group" idea has worked pretty well for me from 1-32 processors. I have not run much beyond that although a few years ago, Eugene ran on a 64 cpu box he had up at Microsoft...

User avatar
hgm
Posts: 26128
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: re-inventing the SMP wheel

Post by hgm » Sat Aug 25, 2007 10:30 pm

bob wrote:Nobody does that. They run simulations with different page sizes to see what is optimal across a wide range of applications.
Even if they treat this as an experimental science, surely they must have learned some empirical laws from these simulations.

User avatar
hgm
Posts: 26128
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: re-inventing the SMP wheel

Post by hgm » Sat Aug 25, 2007 10:55 pm

So far I could not think of anything better than having each CPU continuously poll all split points from which it grabbed a move to search. A fail high from a move searched by another processor in any of those nodes could make it mandatory to abort. While aborting the thread would clean up all nodes it owns, which then could propagate the abort request to all processors having grabbed nodes from those. But it will often happen that a processor A is in a branch where nodes are owned as (root) A B A C A. A cutoff at the first A would then not propagate to A if it was not explicitly polled. So you cannot limit yourself to only polling the last split node you grabbed a move from.

Now polling the hash table (e.g. the alpha at the node, maintained in the Score field) seems a bad idea if it is also used for the shared (partial) move list. Many CPUs would do it all the time, and the move list would often be written to by a variety of CPUs, as they grab moves, or complete their search. Such write would then invalidate the entry in all other CPUs, that would immediately start to reload it on a poll. Even if there was no need to update alpha, because the searched moves failed low (the normal outcome).

So it seems very good to have the signal that requests the cutoff in a different cache line from the shared move list. Than other CPUs could freely grab moves and search them to confirm that they fail low, without invalidating the cutoff-abort signal from the cache of all these continuously polling CPUs, keeping the polling cheap. This argues for creating a separate array for such signals; one per cache line, where a new signal flag is allocated (and its number posted in the hash) for every split node created.

Rob

Re: re-inventing the SMP wheel

Post by Rob » Sun Aug 26, 2007 9:11 am

hgm wrote: Well, if you could show me a formula that calculates optimum page size from other system parameters, and it would evaluate to 4KB, I would be easily convinced.
http://gwyn.tux.org/~balsa/linux/cyrix/wip.html gives a formula:

page_size = sqrt (cache_line_size * memory)

64 byte cache line size and 1GB memory gives 256KB page size.
I don't understand what cache line size has to do with it.

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob » Mon Aug 27, 2007 1:21 am

Rob wrote:
hgm wrote: Well, if you could show me a formula that calculates optimum page size from other system parameters, and it would evaluate to 4KB, I would be easily convinced.
http://gwyn.tux.org/~balsa/linux/cyrix/wip.html gives a formula:

page_size = sqrt (cache_line_size * memory)

64 byte cache line size and 1GB memory gives 256KB page size.
I don't understand what cache line size has to do with it.
I know of several that agree with me that such a page size is _not_ good for operating systems that do demand paging. I don't want to have to read in 256K chunks because major parts of that will be unused and it is expensive. 8K-16K numbers have shown reasonable performance, but the larger one goes, the more unnecessary I/O one does. This includes the initial page-in operations as well. If you buy the old 90-10 rule, you waste about 90% of everything you bring in. 90% of 256K is a lot, multiplied over large numbers of pages. Large pages make sense for certain cases, but in general, they represent a lot of overhead...

User avatar
hgm
Posts: 26128
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: re-inventing the SMP wheel

Post by hgm » Mon Aug 27, 2007 6:51 am

What kind of logic is that? If 90% of everything you bring in is nonsense, it doesn't matter at all which page size you use. 256KB pages could only be bad if they would bring in a larger fraction of nonsense than 4KB pages. In particular that applies to the initial page-in, where you would just need a smaller number of pages if the were bigger.

This is where hard-disk architecture comes in. With today's write densities reading or writing chunks of 256KB hardly takes more time than reading or writing 4KB. It is all seek time and rotational delay. Combining pages that happen to be swapped out for larger chunks to write is no solution at all, as you would not need them to be swapped back in in the same combination. And bringing neighbor pages you scattered over the disk will be very inefficient usage of I/O, and thus excruciatingly slow.

If you want your system to work best under conditions of heavy swapping (which I still consider madness on a PC), swapping out pages in groups that are also likely to be needed at the same time (i.e. contiguous) would be the best solution. And that is equivalent to using larger pages. The only limit to that is that there should fit enough pages in physical memory to remain flexible enough to contain the typical number of tasks, without wasting too large a fraction of it.

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob » Tue Aug 28, 2007 3:00 am

hgm wrote:What kind of logic is that? If 90% of everything you bring in is nonsense, it doesn't matter at all which page size you use. 256KB pages could only be bad if they would bring in a larger fraction of nonsense than 4KB pages. In particular that applies to the initial page-in, where you would just need a smaller number of pages if the were bigger.
OK. easy enough. Lets take a 4K page system, and a 256K page system. Let's reference the _first_ byte of the 256K page. The probability we need that byte is therefore 1.0 since we referenced it and need to fault it in. The probability we need the next byte is something less than 1.0, because we won't need it _every_ time. And the farther out into that 256K you go, the less likely it is that you need that data. It is a problem associated with futile pre-fetching, where you pre-fetch something and then discover you don't need it.

With the 4K page size, we bring in 4K and that's it, unless we need the next 4K. With the 256K page size, we brought in all 256K which could be 63 4K chunks too much.

The math is simple. The theory is sound. The concept is well understood. Same problem with longer cache line sizes, as you have the same futile pre-fetching problem, yet you have to wait for all the I/O to be done, or all bytes in the cache line to be filled, before you can go on to something else.

I'm not sure how/why that is hard to follow. The article quoting a silly formula for page size is simply wrong. Optimal page size is a function of memory size for sure, but it is _also_ a function of I/O throughput since you have to read those pages in. That's the entire logic behind paging in the first place, to avoid reading in things you never need again. Rather than reading in the entire program on start-up, we let unused parts never get read in, and for parts that were read in, as they become unused (Crafty has a huge pot of initialization code that is executed once on startup and never again) they are phased out to free up memory for the stuff that is actually being used (the working set of the program).

This is where hard-disk architecture comes in. With today's write densities reading or writing chunks of 256KB hardly takes more time than reading or writing 4KB. It is all seek time and rotational delay. Combining pages that happen to be swapped out for larger chunks to write is no solution at all, as you would not need them to be swapped back in in the same combination. And bringing neighbor pages you scattered over the disk will be very inefficient usage of I/O, and thus excruciatingly slow.
The alternative is doing a _ton_ of I/O that is totally unnecessary. I'd rather swap out 2 4K pages out of 64 that were actually modified, rather than all 64 of them. Etc...

I/O reading/writing is a long way from being free. Rotational speeds for most disks are not that fast, so latency is a big part, yes. But transfer rates are not exactly blazingly fast either. Since operating systems do _not_ give contiguous disk space toa program where it sits on disk, you are not going to be able to read in big chunks with zero seeks...


If you want your system to work best under conditions of heavy swapping (which I still consider madness on a PC), swapping out pages in groups that are also likely to be needed at the same time (i.e. contiguous) would be the best solution. And that is equivalent to using larger pages. The only limit to that is that there should fit enough pages in physical memory to remain flexible enough to contain the typical number of tasks, without wasting too large a fraction of it.
My machines do not "swap heavily". Because of the 4K page sizes, I am able to keep the 10% of the program that is actually needed in memory, and the 90% that is not needed sticks out on disk. And as that 10% changes during the life of the program's execution, the unused stuff migrates out, and the useful stuff migrates in. Most programs are not so bloated that they execute everything in their code base. At least not anything I use...

User avatar
hgm
Posts: 26128
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: re-inventing the SMP wheel

Post by hgm » Tue Aug 28, 2007 7:25 am

bob wrote:The alternative is doing a _ton_ of I/O that is totally unnecessary. I'd rather swap out 2 4K pages out of 64 that were actually modified, rather than all 64 of them. Etc...
If writing/reading the 64 pages takes as long, of 1% longer than reading/writing the single page, I would prefer writing the 64 pages.

It is clear that demand paging does not work acceptably if the working set is scattered out in infinitesimal chunks over the entire address space (e.g. hash tables in a Chess engine). Neither with 4KB paging, nor with 256KB paging.

So programmers will always be able to write programs that will make performance drop to 0.0000000001 if they are not the only program running on the machine, or if the machine has too little memory. The burden to design their software such that it does not run like crap under adverse conditions is on them. In particular to design it with some locality in mind. Now that typical software is so bloated that you need GB memories to run it, working sets will also be larger. How many different stages does a program go through, and how many different working sets will it have during its execution? Millions? Or dozens? 256KB pages would offer enough flexibility to handle decently designed programs, which would put their initilization code in a contiguous block.

Your Crafty example is totally misleading. How much of the memory footprint do you save because the initialization code can be swapped out? I would be surprised if it is more than 1%. You don't have heavy swapping because you set your hash-table size to a value where you don't have it! Not because there is demand paging on your code segment.

If I would design an OS with 256KB pages, of course it would assign contiguous 256KB blocks in the swap file to each page. Not doing so would be asking for trouble. Disks typically run at 7200 rpm, i.e. 8.33 ms for a rotation. So average rotational delay would be 4.16 ms. Transfer rates ar 50-100 MB/s. So rotational delay alone on a random access corresponds to 200-400KB of data, a full rotation to 400-800KB. Wasting that I/O bandwidth to read a 4KB page would be madness, it would slow down the amount of data you could page in hundredfold! Even if there was only a 1% probability that a neighboring page would be needed in the near future, it would pay to speculatively read it as the read head is positioned to read it along with the page we wanted.

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob » Tue Aug 28, 2007 3:33 pm

hgm wrote:
bob wrote:The alternative is doing a _ton_ of I/O that is totally unnecessary. I'd rather swap out 2 4K pages out of 64 that were actually modified, rather than all 64 of them. Etc...
If writing/reading the 64 pages takes as long, of 1% longer than reading/writing the single page, I would prefer writing the 64 pages.

It is clear that demand paging does not work acceptably if the working set is scattered out in infinitesimal chunks over the entire address space (e.g. hash tables in a Chess engine). Neither with 4KB paging, nor with 256KB paging.

First, I though we were discussing this in the context of a chess engine, not in terms of general-purpose applications?

Secondly, operating systems definitely use a fault-in-on-demand strategy on program startup and while it is running. I do not want to page my program, and therefore I limit the size of everything to be certain it will fit in physical memory. If nothing else needs to run along with me. But I know that on a hash probe, I do not want to read 1/4 megabyte of data when I am only going to use 16 bytes. On the large magic number tables, I do not want to read in 1/4 megabyte to get to a 64 bit or 32 bit value. The list goes on. Because if I do have to page, I do not want to have to page out 256K chunks when only 1, 2 or 16 bytes have been changed.

Large page sizes have their place, but then so do small ones. The O/S is tricky with buffering file I/O using page-sized buffers. 256K is too big unless your program spends all its time reading/writing.

In general, if we have to use a "one-size-fits-all" approach, smaller is better, because _most_ running applications are very small.

So programmers will always be able to write programs that will make performance drop to 0.0000000001 if they are not the only program running on the machine, or if the machine has too little memory. The burden to design their software such that it does not run like crap under adverse conditions is on them. In particular to design it with some locality in mind. Now that typical software is so bloated that you need GB memories to run it, working sets will also be larger. How many different stages does a program go through, and how many different working sets will it have during its execution? Millions? Or dozens? 256KB pages would offer enough flexibility to handle decently designed programs, which would put their initilization code in a contiguous block.
How would you do that? One large function? Poor software engineering. Lots of small functions? No way to put 'em into one contiguous block then, since we don't have any semantics to cause that to happen.

Again, the general feeling among computer architecure and O/S designers is that if the choice is large or small, small is better. Were it dynamically variable, then yes, you could tune page size to a specific application's behavior and help things.

But right now it is more about memory wastage, rather than paging costs since most systems are configured to minimize actual paging I/O. And that's the bad mark on the 4M page size. It greatly reduces TLB misses on very large programs. But if you use it on small programs memory wastage is very high.

/quote]

[/quote]

Your Crafty example is totally misleading. How much of the memory footprint do you save because the initialization code can be swapped out? I would be surprised if it is more than 1%. You don't have heavy swapping because you set your hash-table size to a value where you don't have it! Not because there is demand paging on your code segment.[/quote]

I will guarantee you that you will see paging in any program. It is unavoidable. We just work to minimize it. But we do have to start a game from scratch, and we have to page there. But as I mentioned, I worry far less about that than I do just the idea of wasting memory in the _general_ case. For Crafty, I certainly prefer 4M pages. I have reported on that here in the past and pointed out that performance was _higher_ because of the limited TLB we have. And TLB misses are _bad_. But I would not want to run with 4mb pages exclusively, because of the 100+ system daemons/tasks that run and I don't want every one of them to have a 4mb page for code, a 4mb page for read-only data, and a 4mb page for read/write data, when 128K will do for the whole thing. That's where large page sizes break down and hurt performance, if they _must_ be used everywhere.


If I would design an OS with 256KB pages, of course it would assign contiguous 256KB blocks in the swap file to each page. Not doing so would be asking for trouble. Disks typically run at 7200 rpm, i.e. 8.33 ms for a rotation. So average rotational delay would be 4.16 ms. Transfer rates ar 50-100 MB/s. So rotational delay alone on a random access corresponds to 200-400KB of data, a full rotation to 400-800KB. Wasting that I/O bandwidth to read a 4KB page would be madness, it would slow down the amount of data you could page in hundredfold! Even if there was only a 1% probability that a neighboring page would be needed in the near future, it would pay to speculatively read it as the read head is positioned to read it along with the page we wanted.
none of that matters much in the grand scheme of an operating system. The issue is how much of main memory is actually _used_ vs how much is wasted because of the large page sizes? A large number of small jobs, such as those spawned by unix or windows, increases waste significantly. And since a single page must have a single access control type, the typical program will require at least two different access controls, sometimes 3. Bringing in such large chunks also invalidates a lot of cache, and requires overhead to force dirty lines back to memory before the I/O is started. Not much is free... So yes, for a chess engine, which is a large application, large page sizes are good. I've already measured that a year or two back. But if you force it on everyone by just supporting 256K (or whatever) page size, it is not so good.

Post Reply