Question to hardware/OS experts, on file/HD access

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 28123
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Question to hardware/OS experts, on file/HD access

Post by hgm »

In designing a tablebase generator for 6- and 7-men end-games, there is the following I want the computer to do, but am not sure that any operating system would allow me to do it efficiently (or how):

During building, my tablebase is stored as contiguous 'chunks' of 15MB on the hard-disk (4K or 32K of them). Each chunk consists of a leading 'section' S0 of 2MB, followed by a number of variable-size sections S1, S2, ...Sn, ranging in size from almost nothing to ~1MB.

Each cycle, I want to read S0, and the two last sections Sn-1 and Sn. After some processing, I then want to overwrite S0, and Sn, where the Sn will be replaced by a (usually larger) Sn, and a new segment Sn+1, that is appended to the list (slowly filling up he 15MB reserved area). I suppose I will need separate seeks for reading/writing the S0 section and the Sn, Sn+1 combination (perhaps unless n=1). I want to limit the number of seeks (which is the reason I want to store the Sk sections contiguously, so that Sn and Sn+1 can be written using only a single seek. A seek might take 7ms, and writing 1-2MB at 100MB/sec takes 10-20ms, so doing extra seeks will quickly make the seek time dominate the problem.

Is it possible to do this without resorting to using the bare machine? I suppose that I could start with allocating a 480GB large file on an empty disk, to make it likely that it is allocated contiguously, and use Unix-type seek system calls to position the read/write pointer to the segments I want, and start reading writing. But will the OS know where to find the corresponding disk sectors / clusters? With a FAT file system, I don't really see how it could do that without buffering the entire FAT in memory, and even then it would have to follow all the FAT pointers from the beginning of the file to the end.

I could of course make every 15MB chunk a separate file. Then at least the start sector of each chunk would be pointed to from the directory. But the later sections could still only be used by following pointers in the FAT. With clusters of 4K (is this a usual size for FAT32 systems on modern disks?) the file spans ~4K clusters, needing 16K of pointer info. In itself this is not so much data. But would the OS be able to cache this info for all segments? (That is about 480MB of FAT data for the entire 480GB tabebase!) If not, it would take an extra seek to read the FAT info as soon as I open the file for the chunk, or seek to its beginning. (And even then I should hope the OS would be smart enough to prefetch the FAT data for the entire file, so that it would not require a new seek when I want to read the Sn section after reading the S0 section...)

I am not sure how an NTFS volume is organized, but I guess the problem is similar. Even if it stores the file as a tree, rather than a linked list, you would need a similar amount of info in 'indirection blocks' as he FAT table. What I remember from Unix ws that it used several levels of indirection for large files, starting at the i-nodes, and Linux probably still does this. Is Linux smart enough to keep all indirection blocks of an open file cached at all times? Can a single process keep 32K files open at once?

If no OS is smart enough, would it be feasible to by-pass the OS file-manager code, and directly allocate the disk at a lower level? E.g. use the 'raw' images of the disks in the Linux /dev directory, of a disk (or partition) that contains no other data than the tablebase? And then keep trach of which logical device blocks would contain the data I need, and read those? (This is not such a big job, as I would allocate the logical blocks in sequential order, so that I can trivially step through them without consulting any tables.) Would the OS implementing the raw disk I/O be smart enough to combine a number of read requests for contiguous sectors into a single read of the entire disk track, without issuing a had seek for every new block? Does perhaps the disk controller in the HD already do this?

Any ideas / views on this problem are welcome!
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Question to hardware/OS experts, on file/HD access

Post by bob »

hgm wrote:In designing a tablebase generator for 6- and 7-men end-games, there is the following I want the computer to do, but am not sure that any operating system would allow me to do it efficiently (or how):

During building, my tablebase is stored as contiguous 'chunks' of 15MB on the hard-disk (4K or 32K of them). Each chunk consists of a leading 'section' S0 of 2MB, followed by a number of variable-size sections S1, S2, ...Sn, ranging in size from almost nothing to ~1MB.

Each cycle, I want to read S0, and the two last sections Sn-1 and Sn. After some processing, I then want to overwrite S0, and Sn, where the Sn will be replaced by a (usually larger) Sn, and a new segment Sn+1, that is appended to the list (slowly filling up he 15MB reserved area). I suppose I will need separate seeks for reading/writing the S0 section and the Sn, Sn+1 combination (perhaps unless n=1). I want to limit the number of seeks (which is the reason I want to store the Sk sections contiguously, so that Sn and Sn+1 can be written using only a single seek. A seek might take 7ms, and writing 1-2MB at 100MB/sec takes 10-20ms, so doing extra seeks will quickly make the seek time dominate the problem.

Is it possible to do this without resorting to using the bare machine? I suppose that I could start with allocating a 480GB large file on an empty disk, to make it likely that it is allocated contiguously, and use Unix-type seek system calls to position the read/write pointer to the segments I want, and start reading writing. But will the OS know where to find the corresponding disk sectors / clusters? With a FAT file system, I don't really see how it could do that without buffering the entire FAT in memory, and even then it would have to follow all the FAT pointers from the beginning of the file to the end.

I could of course make every 15MB chunk a separate file. Then at least the start sector of each chunk would be pointed to from the directory. But the later sections could still only be used by following pointers in the FAT. With clusters of 4K (is this a usual size for FAT32 systems on modern disks?) the file spans ~4K clusters, needing 16K of pointer info. In itself this is not so much data. But would the OS be able to cache this info for all segments? (That is about 480MB of FAT data for the entire 480GB tabebase!) If not, it would take an extra seek to read the FAT info as soon as I open the file for the chunk, or seek to its beginning. (And even then I should hope the OS would be smart enough to prefetch the FAT data for the entire file, so that it would not require a new seek when I want to read the Sn section after reading the S0 section...)

I am not sure how an NTFS volume is organized, but I guess the problem is similar. Even if it stores the file as a tree, rather than a linked list, you would need a similar amount of info in 'indirection blocks' as he FAT table. What I remember from Unix ws that it used several levels of indirection for large files, starting at the i-nodes, and Linux probably still does this. Is Linux smart enough to keep all indirection blocks of an open file cached at all times? Can a single process keep 32K files open at once?

If no OS is smart enough, would it be feasible to by-pass the OS file-manager code, and directly allocate the disk at a lower level? E.g. use the 'raw' images of the disks in the Linux /dev directory, of a disk (or partition) that contains no other data than the tablebase? And then keep trach of which logical device blocks would contain the data I need, and read those? (This is not such a big job, as I would allocate the logical blocks in sequential order, so that I can trivially step through them without consulting any tables.) Would the OS implementing the raw disk I/O be smart enough to combine a number of read requests for contiguous sectors into a single read of the entire disk track, without issuing a had seek for every new block? Does perhaps the disk controller in the HD already do this?

Any ideas / views on this problem are welcome!
If you want to "roll your own" format a new filessytem, then create one big file and write each and every byte in order, so that every block is physically allocated, and with any luck the data will be organized pretty efficiently. Now, inside your program, use any chunks you want in any way you want to use them, and you should be able to do what you want, and efficiently, without going low-level and trying to bypass the O/S...

In windows, you might even run a "defrag" on that file system once you create the big file, before you start using it, in case things get scattered around somehow.
User avatar
hgm
Posts: 28123
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Question to hardware/OS experts, on file/HD access

Post by hgm »

bob wrote:If you want to "roll your own" format a new filessytem, then create one big file and write each and every byte in order, so that every block is physically allocated, and with any luck the data will be organized pretty efficiently. Now, inside your program, use any chunks you want in any way you want to use them, and you should be able to do what you want, and efficiently, without going low-level and trying to bypass the O/S...

In windows, you might even run a "defrag" on that file system once you create the big file, before you start using it, in case things get scattered around somehow.
I guess that when would do this in a freshly formatted partition, I would probably automatically get the contguous file.

Having the actual data of the file contiguous is only part of the problem, though. What I am afraid of, is that the OS start consulting the FAT table (or indirect blocks) located elsewhere on the disk all the time. E.g. when I do a software seek to a certain offet in the file, it would locate the first cluster of the file in the FAT, position the head over the FAT table, read some clusters there, plough through countless links untill it arrives at the requested offset, and only then position the head over the data.
And then when I start reading, discover that I read so much that it does not have all the successor links for the clusters I am reading, and hence does a head seek back to the FAT, reads another sector of links to know which block is next, and then must seek back to the data. Such extra seeks could easily multiply the total time by a significant factor, as even with the unavoidable seeks to position the head over the data 30% of the total time is in the seeks.

It would not be so bad if the FAT or (indirect blocks) were stored on the same track as to the sectors the refer to, so that no head seeks are required to get the FAT data for the sectors we are reading. But from what I remember using the Norton disk editor, the FAT is all at th beginning of the partition, on track 1.

Is there an OS that inerleaves the lower-level access information with the data itself? If a file system was organized like a tree, each physical 512-byte disk sector could hold 128 4-byte pointers to sectors holding data. If, on allocating a big file each new indirect block that was needed was allocated from the same contiguous stream of disk sectors grabbed from the free list, I guess it would allow for very efficient reading. But FAT16 or FAT32 volumes are not organized this way, AFAIK. Perhaps I should try NTFS volumes.

Does anyone know how this works on Linux? Are files organized there as a tree, and does the OS automaticaly slip in other levels of indirection near the root when I exceed the maximum file size for the current number of levels? How is the free list organized. Just like a very big file, or as a linked list? (Random access in the free list is never needed, so I could imagine a simpler system is used there, such as a linked list containing lists of free blocks.)
Dann Corbit
Posts: 12662
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Question to hardware/OS experts, on file/HD access

Post by Dann Corbit »

Memory map the file.
To handle your chunks do e.g. "MapViewOfFile()" for Windows or "mmap()" for Unix.

The nice thing about memory mapping is that the memory in the hunk of file in RAM is just a big array. You manipulate it like any other array.

If you want to zero-init the file, it's just a flag.
trojanfoe

Re: Question to hardware/OS experts, on file/HD access

Post by trojanfoe »

I don't understand why you are focusing on the storage. If you are working on 15MB chunks then why not read the whole chunk into memory, perhaps keeping each section in it's own contiguous area of memory linked to its sibling by pointers, perform your calculation and then write the final chunk back to disk overwriting the old chunk?

Are there alot of inter-chunk references that would make this impractical? Even then computer memory is generally large enough to accomodate many being in memory at once. Perhaps consider a 'buffer cache' approach where chunks are swapped in and out of memory depending on when they were last referenced. This would abstract the whole problem and you could then concentrate on the processing problem and ignore the storage problem.

If the final data on disks needs to be re-organised to make it easy to reference by the target program then you could do that as the final step once processing is complete; while processing keep the data chunks as separate files to make them easy to process.
Harald Johnsen

Re: Question to hardware/OS experts, on file/HD access

Post by Harald Johnsen »

Dann Corbit wrote:Memory map the file.
To handle your chunks do e.g. "MapViewOfFile()" for Windows or "mmap()" for Unix.

The nice thing about memory mapping is that the memory in the hunk of file in RAM is just a big array. You manipulate it like any other array.

If you want to zero-init the file, it's just a flag.
Memory mapped is simple and efficient. Simple because you access your in-memory tables as usual (no file i/o api), efficient because it's the standard technique used for virtual memory+swap file.

And there is no such thing like 'fat' with ntfs, files allocation is done as under extfs, ie if you allocate one chunk of 500mb and it exists 500mb of continuous space you'll have only one entry in the allocation table (and thus no overhead to know where on the disk is located a cluster).

iirc the createfile api of nt has some param to use for initial file size and then it will try to find continuous space on disk.

HJ.
User avatar
hgm
Posts: 28123
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Question to hardware/OS experts, on file/HD access

Post by hgm »

trojanfoe wrote:I don't understand why you are focusing on the storage. If you are working on 15MB chunks then why not read the whole chunk into memory, perhaps keeping each section in it's own contiguous area of memory linked to its sibling by pointers, perform your calculation and then write the final chunk back to disk overwriting the old chunk?

Are there alot of inter-chunk references that would make this impractical? Even then computer memory is generally large enough to accomodate many being in memory at once. Perhaps consider a 'buffer cache' approach where chunks are swapped in and out of memory depending on when they were last referenced. This would abstract the whole problem and you could then concentrate on the processing problem and ignore the storage problem.

If the final data on disks needs to be re-organised to make it easy to reference by the target program then you could do that as the final step once processing is complete; while processing keep the data chunks as separate files to make them easy to process.
Perhaps I should give some more background info to answer this.

There are indeed many inter-chunk references, and I am already doing what you suggest for part of them. The entire data set is a three-dimensional array of chunks, TB[piece1][piece2][piece3], 64x64x64, indexed by the location of the white pieces. The inter-chunk reference are the moves of the pieces, but only one piece moves at a time. So they stay in the same 'column' of chunks.

So what I do is bring the columns into memory one by one, i.e. read 64 chunks, and then process all the references between the chunks. After that the 64 chunks are written back to disk, and the next column in the same direction is then loaded and processed. This is repeated 4096 times. Then the whole thing is repeated for all colums in one of the other directions (moving another piece), etc. After three pieces have been treated this way, we have advanced the tablebse by one (white) move.

I don't want to read the entire 15MB all the time when I need only ~3MB of it, as it would increase the transfer time from 30ms to 150ms (at 100MB/sec). That is 120 ms extra. If I can save that time by just a few extra seek operarations of 7ms each, (say 2 of them), I speed up the process by more than a factor 3. Plus I require a lot less DRAM storage. By only having upto 4MB per chunk in DRAM, I can afford to do 1.5 piece at once, so that I can do a complete move in two passes. (The half pice is a King, for which the inter-chunk references only go to nearby chunks, because a King does not step far. I can split up and order the King moves such that I only need to have 5 chunks present at once in the King dimension to catch all the references.) But doing 1.5 a piece at a time means I need to have 64x5 = 320 chunks in memory simultaneously. At upto 4MB per chunk that makes 1.25GB. With 15MB per chunk it would be 4.8 GB. And I would like to be able to do this on a 32-bit PC with 2GB of DRAM.
User avatar
hgm
Posts: 28123
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Question to hardware/OS experts, on file/HD access

Post by hgm »

Harald Johnsen wrote:And there is no such thing like 'fat' with ntfs, files allocation is done as under extfs, ie if you allocate one chunk of 500mb and it exists 500mb of continuous space you'll have only one entry in the allocation table (and thus no overhead to know where on the disk is located a cluster).
This sounds very good. So each file has an allocation table with it that contains size and first cluster of every contiguous fragment? So if I allocate the file on a virgin disk, the allocation table will be next to nothing, and certainly be cached by the OS all the time.

This is exactly what I need. I will have to think of the memory mapping, it sounds convenient.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Question to hardware/OS experts, on file/HD access

Post by bob »

hgm wrote:
bob wrote:If you want to "roll your own" format a new filessytem, then create one big file and write each and every byte in order, so that every block is physically allocated, and with any luck the data will be organized pretty efficiently. Now, inside your program, use any chunks you want in any way you want to use them, and you should be able to do what you want, and efficiently, without going low-level and trying to bypass the O/S...

In windows, you might even run a "defrag" on that file system once you create the big file, before you start using it, in case things get scattered around somehow.
I guess that when would do this in a freshly formatted partition, I would probably automatically get the contguous file.
I'm not sure. Which is why I suggested the defrag after, which would guarantee it. I'm not sure what NTFS does as far as where does each new cluster go.


Having the actual data of the file contiguous is only part of the problem, though. What I am afraid of, is that the OS start consulting the FAT table (or indirect blocks) located elsewhere on the disk all the time. E.g. when I do a software seek to a certain offet in the file, it would locate the first cluster of the file in the FAT, position the head over the FAT table, read some clusters there, plough through countless links untill it arrives at the requested offset, and only then position the head over the data.
And then when I start reading, discover that I read so much that it does not have all the successor links for the clusters I am reading, and hence does a head seek back to the FAT, reads another sector of links to know which block is next, and then must seek back to the data. Such extra seeks could easily multiply the total time by a significant factor, as even with the unavoidable seeks to position the head over the data 30% of the total time is in the seeks.

It would not be so bad if the FAT or (indirect blocks) were stored on the same track as to the sectors the refer to, so that no head seeks are required to get the FAT data for the sectors we are reading. But from what I remember using the Norton disk editor, the FAT is all at th beginning of the partition, on track 1.

Is there an OS that inerleaves the lower-level access information with the data itself? If a file system was organized like a tree, each physical 512-byte disk sector could hold 128 4-byte pointers to sectors holding data. If, on allocating a big file each new indirect block that was needed was allocated from the same contiguous stream of disk sectors grabbed from the free list, I guess it would allow for very efficient reading. But FAT16 or FAT32 volumes are not organized this way, AFAIK. Perhaps I should try NTFS volumes.

Does anyone know how this works on Linux? Are files organized there as a tree, and does the OS automaticaly slip in other levels of indirection near the root when I exceed the maximum file size for the current number of levels? How is the free list organized. Just like a very big file, or as a linked list? (Random access in the free list is never needed, so I could imagine a simpler system is used there, such as a linked list containing lists of free blocks.)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Question to hardware/OS experts, on file/HD access

Post by bob »

hgm wrote:
bob wrote:If you want to "roll your own" format a new filessytem, then create one big file and write each and every byte in order, so that every block is physically allocated, and with any luck the data will be organized pretty efficiently. Now, inside your program, use any chunks you want in any way you want to use them, and you should be able to do what you want, and efficiently, without going low-level and trying to bypass the O/S...

In windows, you might even run a "defrag" on that file system once you create the big file, before you start using it, in case things get scattered around somehow.
I guess that when would do this in a freshly formatted partition, I would probably automatically get the contguous file.
I'm not sure. Which is why I suggested the defrag after, which would guarantee it. I'm not sure what NTFS does as far as where does each new cluster go.


Having the actual data of the file contiguous is only part of the problem, though. What I am afraid of, is that the OS start consulting the FAT table (or indirect blocks) located elsewhere on the disk all the time. E.g. when I do a software seek to a certain offet in the file, it would locate the first cluster of the file in the FAT, position the head over the FAT table, read some clusters there, plough through countless links untill it arrives at the requested offset, and only then position the head over the data.
And then when I start reading, discover that I read so much that it does not have all the successor links for the clusters I am reading, and hence does a head seek back to the FAT, reads another sector of links to know which block is next, and then must seek back to the data. Such extra seeks could easily multiply the total time by a significant factor, as even with the unavoidable seeks to position the head over the data 30% of the total time is in the seeks.

It would not be so bad if the FAT or (indirect blocks) were stored on the same track as to the sectors the refer to, so that no head seeks are required to get the FAT data for the sectors we are reading. But from what I remember using the Norton disk editor, the FAT is all at th beginning of the partition, on track 1.

Is there an OS that inerleaves the lower-level access information with the data itself? If a file system was organized like a tree, each physical 512-byte disk sector could hold 128 4-byte pointers to sectors holding data. If, on allocating a big file each new indirect block that was needed was allocated from the same contiguous stream of disk sectors grabbed from the free list, I guess it would allow for very efficient reading. But FAT16 or FAT32 volumes are not organized this way, AFAIK. Perhaps I should try NTFS volumes.

Does anyone know how this works on Linux? Are files organized there as a tree, and does the OS automaticaly slip in other levels of indirection near the root when I exceed the maximum file size for the current number of levels? How is the free list organized. Just like a very big file, or as a linked list? (Random access in the free list is never needed, so I could imagine a simpler system is used there, such as a linked list containing lists of free blocks.)
Linux uses the classic unix approach. A file has an i-node associated with it. In that i-node there are N pointers to the first N blocks of data for the file (I believe N is 12, but don't hold me to that as it has changed several times). Once you go thru 12 blocks, you now have an indirect-1 pointer that points to one data block (usually 4K bytes unless you adjust the blocksize when you create the filesystem). This is 1K of pointers to 1K of disk blocks. This is memory resident so there is no overhead once it is read in once. Once you are past 1024 + 12 blocks, you have a indirect-2 pointer that points to a 4K byte data block which contains 1024 pointers to 1024 additional 4K data blocks, each of which contains a pointer to a block of real data. This gives you a total of 12 + 1024*4k + 1024*1024*4k of data. If needed there is an indirect pointer that is a 3-level tree.

There are no links among blocks, just the index blocks that point to them.