re-inventing the SMP wheel

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:Well, the eval I have in mind doesn't really use much memory at all. 32KB is huge. Adding a few Piece-Square Tables for each piece type (16 x 64 bytes = 1KB) would not kill me. But playing a match without them will just tell you next to nothing.

About paging:

When you allocate a small object, you are a mere user, and it will be allocated within your user address space with whatever granularity you like. When your total demand on memory exceeds a 256KB boundary through the allocation, the OS will just allocate a new 256KB chunk to you, which will be able to satisfy your demand for small objects for a long time.

The idea that allocating in 256KB chunks would 'waste' memory is seriously flawed thinking. Trying to prevent this 'wastage' by packing the used fraction densely and collecting all the unused memory into one contiguous block would still leave the memory unused, and thus just as wasted. That you _could_ run other programs in that space is worth zilch if you don't have other tasks that _want_ to run. You are just trying to 'save' memory in order to leave it idle...
Sorry, but that's wrong. The _reason_ paging was developed in the first place was to avoid those large blocks, the resulting fragmentation, and the garbage-collection necessary to maintain free space in a large block. This idea takes us back 40 years into the past before the virtual machines of the 60's were developed.

There are places where big blocks are nice, and with today's hardware, one can choose 4K or 4M (sometimes 2M depending on machine) page sizes. Big pages make the TLB far more effective, and also reduces overhead for when a program grows big, but by having to grab a bunch of small pages.

But _most_ programs are not that big, and forcing 256K granularity is simply a mistake that has been corrected for many years.

If I look in my task manager, it turns out that almost any task uses about 2MB of memory, and there are only 5 (of 32) that use less than 1MB (the smallest using 168KB). It doesn't seem a significant fraction of memory would be wasted by rounding everything up to 256KB multiples.
There are lots of other issues. I want data memory, that is read/write accessable, so you have to give me 256K. I need 4K. I want read-only memory for constants. You have to give me 256K, but I only want 4k. I want execute-only memory for the code. same story. When I do a page replacement, I am forced to write out the entire chunk, and read in the entire chunk, since that is the only way I can use that memory for someone else. Even though there is only one page I need/want.

It would _not_ be efficient. More accurately, it would be impossibly slow.

We don't have 4KB pages for nothing? Are you claiming that the optimal page size is independent of total memory size and hard-disk I/O characteristics? That a page size that is optimal for a 4MB memory is also optimal for a 4GB memory? And that you would like to swap in equal chunks to a disk of 40MB as to a disk of 320GB? This seems rather unlikely, and I don't believe it for a minute!
optimal page size has _nothing_ to do with the disk drive size/speed. Nothing at al. Anything you can do with large page sizes, I can do with small page sizes. Optimal page size has to do with the way protection is done, and the way most programs grow, both large and small ones. Most programs have at least two types of pages, execute and read/write. I don't want to allocate and release 512K just to display today's date...

There are other significant issues with shared memory (such as shared libraries, shared executable pages, etc...

Believe me, this has not been developed in an ad hoc way over the past 30 years, it has been studied, simulated and measured to death, to get us to where we are today.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

bob wrote:Believe me, this has not been developed in an ad hoc way over the past 30 years, it has been studied, simulated and measured to death, to get us to where we are today.
This does not answer the most important question, though: is the optimum page size dependent on total memory size or not? If not, how can an optimal size, determined 30 years ago, still be valid today, where memories have expanded a thousand times? I predict you that 4KB pages will soon be a relic (like real mode and the floppy disk), and will be replaced by 4MB. Even hardware support for it will disappear.

The efficiency of disk I/O is determined by the tranfer rate vs rotational speed, and swapping out memory in very small chunks scattered over the disk is what is impossibly slow. If you swap in larger blocks, almost nothing is to be gained by scattering those blocks over memory in even smaller blocks. If your effective page size is small compared to the average footprint of user programs or their components, you are simply in the continuum limit, and nothing is gained by making it still smaller. We have long since passed the point where this is the case for 4KB pages.

Note that the hardware support for 4KB pages still allows the OS to declare almost arbitrary parts of a 256KB block as read-only or execute-only. There is no reason whatsoever to make the 256KB block homogeneous. Just place the stack at the end and the rest at the beginning. If there is unused stuff in between, that is great, for it allows the stack and user data (e.g. through sbrk) to grow without reshuffling the memory.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:
bob wrote:Believe me, this has not been developed in an ad hoc way over the past 30 years, it has been studied, simulated and measured to death, to get us to where we are today.
This does not answer the most important question, though: is the optimum page size dependent on total memory size or not? If not, how can an optimal size, determined 30 years ago, still be valid today, where memories have expanded a thousand times? I predict you that 4KB pages will soon be a relic (like real mode and the floppy disk), and will be replaced by 4MB. Even hardware support for it will disappear.

The answer is maybe. With 32bit addressing, which has been the norm since 1960's, you can find page sizes from 512 bytes (not k bytes) in the vax, to 4K in the Data Generals. The rest fall somewhere in between.

For huge memory machines and programs, the solution is already available in the large page model supported by X86 and X86-64. However, the LPM stuff is really more for addressing TLB misses than memory efficiency. The number of pages allocated is irrelevant so far as the O/S goes.

The efficiency of disk I/O is determined by the tranfer rate vs rotational speed, and swapping out memory in very small chunks scattered over the disk is what is impossibly slow. If you swap in larger blocks, almost nothing is to be gained by scattering those blocks over memory in even smaller blocks. If your effective page size is small compared to the average footprint of user programs or their components, you are simply in the continuum limit, and nothing is gained by making it still smaller. We have long since passed the point where this is the case for 4KB pages.

Two points.

(1) even with 2K or 4K pages, the O/S does _not_ swap out individual pages, for the very reason you mention. As pages are removed from resident sets, they collect in the free list, which is broken down into "clean" and "dirty" pages. Clean pages have not been modified, dirty pages have and must therefore be written out. They are written out in batches once enough have been collected to make it worthwhile to do.

(2) Swapping in large blocks is bad. Because it is a speculative operation. I know the user needs a specific byte. And I can assume that there is a reasonable probability he will use bytes close by. But not 256K bytes away. That's why cache lines are no longer than they are today, because the probability of using that last byte is a small fraction of the probability of using the next sequential byte. So no, I don't want to be reading in huge chunks just because it is more efficient I/O wise. The risk is that I end up doing I/O that is wasted, and which backs up other more critical pages that can't be read in until that one finishes, even though we are not going to need all the data. We "tolerate" the wasted I/O for large memory systems running large memory applications, to get around the TLB miss penalty that is substantial for large programs using small page sizes. But "tolerate" is the operative word. We just can't read in or write out less than a page, sensibly.



Note that the hardware support for 4KB pages still allows the OS to declare almost arbitrary parts of a 256KB block as read-only or execute-only. There is no reason whatsoever to make the 256KB block homogeneous. Just place the stack at the end and the rest at the beginning. If there is unused stuff in between, that is great, for it allows the stack and user data (e.g. through sbrk) to grow without reshuffling the memory.
You can't mix. How on earth is it going to traverse the page tables when each entry could point to a large page or a small page? You get either/or, but not both.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

bob wrote:You can't mix. How on earth is it going to traverse the page tables when each entry could point to a large page or a small page? You get either/or, but not both.
I am not talking about mixing sizes. Just setting different access rights on different 4K pages within a 256KB allocated blok. The page-table is what it is, defined by hardware. So if I want my OS to emulate 256KB pages, I will have to write 64 actual entries in the page table to map it. But that would not have to stop me from emulating them with a finer granularity for the access rights.

4KB and 4MB pages can be mixed because the 4KB page table is two-level, and the 4MB pages just skip one level. But from what I have heard 4MB pages do not work properly at all on current CPUs, performance-wise, and it is way faster to use 4KB pages and accepting the TLB misses. Even when scanning sequentially through a large memory range.

32bit addressing might have been standard for a long time, but physical memories were nowhere near 4GB. Furthermore, systems with big memories in those days were mainframes, with perhaps hundreds of users. The number of processes you expect to run is also a factor that is important for determining optimum page size. Our present PCs, which are single-user machines, have to satisfy totally different demands. For almost anything you do on a 1GB PC nowadays swapping is totally unnecessary. So it doesn't make sense to sacrifice anything to make it faster. In systems with 100 interactive users and 8MB memory it was totally unavoidable. Since it was _the_ performance-limiting factor, it made sense to design everything so that it could be done as fast and as efficient as possible.

So times have changed, and I would strongly mistrust anything that was optimal then, knowing that it was optimized for demands no longer in force, with parameters wildly different from what is usual now. It is like trying to predict optimal rodent size from digging up T.Rex bones...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:
bob wrote:You can't mix. How on earth is it going to traverse the page tables when each entry could point to a large page or a small page? You get either/or, but not both.
I am not talking about mixing sizes. Just setting different access rights on different 4K pages within a 256KB allocated blok. The page-table is what it is, defined by hardware. So if I want my OS to emulate 256KB pages, I will have to write 64 actual entries in the page table to map it. But that would not have to stop me from emulating them with a finer granularity for the access rights.
Then you are not using 256K pages. You are using 4K pages. The O/S has to keep up with them as 4k pages, not 256K as well. Because when we share stuff between different processes (two different users executing the same executable such as the shell, or VI or whatever) the code parts are shared. The data parts are not. If you mix those within one of your 256K blocks, then you are not using 256K blocks except in name only, with no benefit of any kind over what we currently do. If you split up the blocks into different types of access rights, I completely miss any advantage that would incur over the current operating systems that can also give you 256 k bytes if you want, and mix the access rights any way necessary.

If you really allocate/free in 256K chunks, then it has to be done in 256K chunks, else it is just obfuscation by renaming, without changing a thing.


4KB and 4MB pages can be mixed because the 4KB page table is two-level, and the 4MB pages just skip one level. But from what I have heard 4MB pages do not work properly at all on current CPUs, performance-wise, and it is way faster to use 4KB pages and accepting the TLB misses. Even when scanning sequentially through a large memory range.
No idea where you have heard that. I ran an experimental version of a modified linux kernel a while back to see what effect it had on an opteron-based system and it worked well and memory was faster, since suddenly the TLB is big enough to cache all virtual-to-real memory address translations for 4 gigs of physical RAM...


32bit addressing might have been standard for a long time, but physical memories were nowhere near 4GB. Furthermore, systems with big memories in those days were mainframes, with perhaps hundreds of users. The number of processes you expect to run is also a factor that is important for determining optimum page size. Our present PCs, which are single-user machines, have to satisfy totally different demands. For almost anything you do on a 1GB PC nowadays swapping is totally unnecessary. So it doesn't make sense to sacrifice anything to make it faster. In systems with 100 interactive users and 8MB memory it was totally unavoidable. Since it was _the_ performance-limiting factor, it made sense to design everything so that it could be done as fast and as efficient as possible.

So times have changed, and I would strongly mistrust anything that was optimal then, knowing that it was optimized for demands no longer in force, with parameters wildly different from what is usual now. It is like trying to predict optimal rodent size from digging up T.Rex bones...
On the other hand, since things have _not_ changed since back then, I might spend some time trying to figure out why that is. :) There are lots of reasons that small pages fit most applications best.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: re-inventing the SMP wheel

Post by Michael Sherwin »

Been giving SMP some thought.

It makes sense to use the extra processors for tree splitting when there is nothing else for them to do.

I would think that it would be more efficient and friendlier to the cash if say the evaluation routines and move generators were constructed to be done in chunks by multiple processors. Then if a processor becomes free and there is an evaluation or move generation to help with it could be done very efficiently as huge data need not be copied.

On quad or better machines a couple of processors can be reserved to act only as helpers in this respect as there will always be an eval or move generation to help with. $.02
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

Michael Sherwin wrote:Been giving SMP some thought.

It makes sense to use the extra processors for tree splitting when there is nothing else for them to do.

I would think that it would be more efficient and friendlier to the cash if say the evaluation routines and move generators were constructed to be done in chunks by multiple processors. Then if a processor becomes free and there is an evaluation or move generation to help with it could be done very efficiently as huge data need not be copied.

On quad or better machines a couple of processors can be reserved to act only as helpers in this respect as there will always be an eval or move generation to help with. $.02
the "holy grail" of parallel programming is load balancing. That is, you need to divide the task up into N equal chunks of work, assuming N processors. With the evaluation, that is a nearly impossible task. When different pieces are traded away, those parts of the eval become unused, and the processors doing those things will be idle. Ditto for move generation, and so forth. Even worse, that kind of parallelism is so fine-grained that the overhead of starting the parallel evaluation becomes far too high.

that's why we all split the tree instead, the work granularity is coarser, you can do better load balancing by letting A help B when A becomes idle. And you don't have the ultra-high overhead associated with fine-grained parallelism.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

bob wrote:Then you are not using 256K pages. You are using 4K pages. The O/S has to keep up with them as 4k pages, not 256K as well. Because when we share stuff between different processes (two different users executing the same executable such as the shell, or VI or whatever) the code parts are shared.
This is mainframe thinking. There won't be multiple users of VI on a PC, at least not enough to bother, and the benifit of sharing code in a PC is next to nothing. You could very well do without that feature. If you want it implemented, then logically the shared segments should be independent, and reside in another 256KB block. But that doesn't have to stop you from loading non-sharable code in the data segment, and allow partial write-protection (or execute permission) of the data segment. Of course the latter would in most cases (small programs) be far more efficient than sharing, as the space would be allocated either way and it would just remain unused in the 256K block, and the sharing would just force you to use an extra bloack that otherwise would not have been needed. But this is what you should expect when one insists on using obsolete features. (For huge codes it could of course still be usefull, so you would only do it there.)
No idea where you have heard that. I ran an experimental version of a modified linux kernel a while back to see what effect it had on an opteron-based system and it worked well and memory was faster, since suddenly the TLB is big enough to cache all virtual-to-real memory address translations for 4 gigs of physical RAM...
I read it somewhere on the net in a discussion about memory bandwidth, but it might just be P-IV crapiness.

On the other hand, since things have _not_ changed since back then, I might spend some time trying to figure out why that is. :) There are lots of reasons that small pages fit most applications best.
Well, if you could show me a formula that calculates optimum page size from other system parameters, and it would evaluate to 4KB, I would be easily convinced. On the other hand, conservatism is a very bad guidance in area's dominated by Moore's law, and I wouldn't believe anything that was solely based on conservatism.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: re-inventing the SMP wheel

Post by hgm »

Michael Sherwin wrote:Been giving SMP some thought.

It makes sense to use the extra processors for tree splitting when there is nothing else for them to do.
What I had in mind was a scheme where all 'active' nodes are indicated in the hash table, together with the moves out of them currently being searched. (For non-parallel nodes this can be only one move, which could be displayed in the hash Move field, but for split points you would have to hide a list of a number of moves somewhere in the table (e.g. requisitioning the always-replace entry in the same cache line for this). It does not have to be the full move list, I think there is little harm in limiting the parallelism in split points to, say, 6 moves per node maximum. (Even if you have many more CPUs than 6, as usually there will be many possible split points amongst the active nodes.)

All information needed to recursively walk this tree of active nodes is then present in the hash. (No move generation would be necessary). Idle processors would just repeatedly make such a tree walk (using the lowest level for which they have generated moves as a root, in order not to have to discard those when it returns to a lower level), until they stumble on a split-point with some moves currently not being searched. (During most of the search the tree of active nodes will contain many such 'lose ends'.)

Upon finding a lose end, they would mark the move as being searched, make it, and do move generation for ('become owner of') the daughter node to start their own sub-tree. As their subtree is added to the tree of active nodes, other (idle) processors could enter it searching for nodes with lose ends, which will sooner or later originate in there when their owner obtains a score for the first (or first N) move(s). The owner will then open the node to parallel search by posting some of the yet unsearched moves of its move list in the hash, for idle processors to find.

Once searched, moves would be removed from the lists posted in the hash table. The owner of the node might than post new unsearched moves in their place to attract helpers. If there is no unsearched move in the same node, the processor reverts to the idle state (scanning the hash for work). The owner of a node would be confined in its scan to the sub-tree sprouting from its node, so that it is still around there to integrate scores that helpers return into the final result, and return that result.

I still have to think how to efficiently handle fail highs.

P.S. Parallelism of the kind you suggest is more suitable for hyper-threading. It seems the Nehalem processor will add hyper-threading to the Core 2 Duo. Since each CPU can support 2 hyper-threads, you could e.g. do black and white eval separately, and subtract afterwards.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: re-inventing the SMP wheel

Post by bob »

hgm wrote:
bob wrote:Then you are not using 256K pages. You are using 4K pages. The O/S has to keep up with them as 4k pages, not 256K as well. Because when we share stuff between different processes (two different users executing the same executable such as the shell, or VI or whatever) the code parts are shared.
This is mainframe thinking. There won't be multiple users of VI on a PC, at least not enough to bother, and the benifit of sharing code in a PC is next to nothing. You could very well do without that feature.
Please... So you never run two C programs simultaneously? If you don't, the O/S certainly does. Ever heard of the shared libraries? Any idea how many different shared libs there are? What about the application programs those system tasks create on demand? This is a non-trivial memory savings. You are falling into the "we have nearly infinite memory" trap. We really don't...

If you want it implemented, then logically the shared segments should be independent, and reside in another 256KB block. But that doesn't have to stop you from loading non-sharable code in the data segment, and allow partial write-protection (or execute permission) of the data segment. Of course the latter would in most cases (small programs) be far more efficient than sharing, as the space would be allocated either way and it would just remain unused in the 256K block, and the sharing would just force you to use an extra bloack that otherwise would not have been needed. But this is what you should expect when one insists on using obsolete features. (For huge codes it could of course still be usefull, so you would only do it there.)
No idea where you have heard that. I ran an experimental version of a modified linux kernel a while back to see what effect it had on an opteron-based system and it worked well and memory was faster, since suddenly the TLB is big enough to cache all virtual-to-real memory address translations for 4 gigs of physical RAM...
I read it somewhere on the net in a discussion about memory bandwidth, but it might just be P-IV crapiness.

On the other hand, since things have _not_ changed since back then, I might spend some time trying to figure out why that is. :) There are lots of reasons that small pages fit most applications best.
Well, if you could show me a formula that calculates optimum page size from other system parameters, and it would evaluate to 4KB, I would be easily convinced. On the other hand, conservatism is a very bad guidance in area's dominated by Moore's law, and I wouldn't believe anything that was solely based on conservatism.
Nobody does that. They run simulations with different page sizes to see what is optimal across a wide range of applications.

But again, your "big page" idea is basically flawed. Because if you mix things within one of those "big pages" then the O/S can no longer treat the page as a single entity, it has to be treated as a collection of different pages with different access permissions. And that means you are not using big pages at all, you are just allocating a large chunk of small pages at one time, but then treating the separately. On a page fault it would be impossibly expensive to bring in one of those 256K chunks, since large pieces would never be used enough to bring the entire thing in.

In essence, either we use 4K pages or 256K pages, but any combination is really just 4K pages.