Question to hardware/OS experts, on file/HD access

bob · Post by **bob** » Sun Sep 07, 2008 7:09 am

Dann Corbit wrote:Memory map the file.
To handle your chunks do e.g. "MapViewOfFile()" for Windows or "mmap()" for Unix.

The nice thing about memory mapping is that the memory in the hunk of file in RAM is just a big array. You manipulate it like any other array.

If you want to zero-init the file, it's just a flag.

However you _still_ have the filesystem overhead, which is what I believe he is tryng to get around. That is still there. Your "interface" is simplified using mmap() or whatever, but file overhead is still present and accounted for.

sje · Post by **sje** » Sun Sep 07, 2008 8:25 am

bob wrote:Linux uses the classic unix approach. A file has an i-node associated with it. In that i-node there are N pointers to the first N blocks of data for the file (I believe N is 12, but don't hold me to that as it has changed several times). Once you go thru 12 blocks, you now have an indirect-1 pointer that points to one data block (usually 4K bytes unless you adjust the blocksize when you create the filesystem). This is 1K of pointers to 1K of disk blocks. This is memory resident so there is no overhead once it is read in once. Once you are past 1024 + 12 blocks, you have a indirect-2 pointer that points to a 4K byte data block which contains 1024 pointers to 1024 additional 4K data blocks, each of which contains a pointer to a block of real data. This gives you a total of 12 + 1024*4k + 1024*1024*4k of data. If needed there is an indirect pointer that is a 3-level tree.

The above, or something very close to the above, is a good description of the original Unix filesystem i-node/block allocation.

But Linux uses various kinds of filesystems. I'm not sure that the current most popular flavor (ext3), or the upcoming flavor (ext4), handles i-node/blocks in the same way as did Bell Labs Unix of old.

bob · Post by **bob** » Sun Sep 07, 2008 5:21 pm

sje wrote:
bob wrote:Linux uses the classic unix approach. A file has an i-node associated with it. In that i-node there are N pointers to the first N blocks of data for the file (I believe N is 12, but don't hold me to that as it has changed several times). Once you go thru 12 blocks, you now have an indirect-1 pointer that points to one data block (usually 4K bytes unless you adjust the blocksize when you create the filesystem). This is 1K of pointers to 1K of disk blocks. This is memory resident so there is no overhead once it is read in once. Once you are past 1024 + 12 blocks, you have a indirect-2 pointer that points to a 4K byte data block which contains 1024 pointers to 1024 additional 4K data blocks, each of which contains a pointer to a block of real data. This gives you a total of 12 + 1024*4k + 1024*1024*4k of data. If needed there is an indirect pointer that is a 3-level tree.
The above, or something very close to the above, is a good description of the original Unix filesystem i-node/block allocation.

But Linux uses various kinds of filesystems. I'm not sure that the current most popular flavor (ext3), or the upcoming flavor (ext4), handles i-node/blocks in the same way as did Bell Labs Unix of old.

That hasn't changed. All that changed in the ext* file systems were how files were physically located (ext2 was very good here) so that they can grow and still stay nearly contiguous on disk. ext3 added the "journal" which was an attempt to eliminate FSCK on a crash, because the journal can be used to complete transactions that were interrupted by some kind of failure.

But the base file structure has not changed at all, and really shouldn't as it works extremely well.

sje · Post by **sje** » Sun Sep 07, 2008 6:32 pm

bob wrote:All that changed in the ext* file systems were how files were physically located (ext2 was very good here) so that they can grow and still stay nearly contiguous on disk. ext3 added the "journal" which was an attempt to eliminate FSCK on a crash, because the journal can be used to complete transactions that were interrupted by some kind of failure.

But the base file structure has not changed at all, and really shouldn't as it works extremely well.

Agreed that it works quite well and that the design involved good foresight.

However, there are capacity differences in ext3/ext4:

See: http://en.wikipedia.org/wiki/Comparison ... ems#Limits

bob · Post by **bob** » Sun Sep 07, 2008 7:44 pm

sje wrote:
bob wrote:All that changed in the ext* file systems were how files were physically located (ext2 was very good here) so that they can grow and still stay nearly contiguous on disk. ext3 added the "journal" which was an attempt to eliminate FSCK on a crash, because the journal can be used to complete transactions that were interrupted by some kind of failure.

But the base file structure has not changed at all, and really shouldn't as it works extremely well.
Agreed that it works quite well and that the design involved good foresight.

However, there are capacity differences in ext3/ext4:

See: http://en.wikipedia.org/wiki/Comparison ... ems#Limits

That only an issue of "how big are the block pointers and blocksizes?" I use ext3 in my linux boxes, and on occasion I would like to have a larger blocksize than 4096. Huge TB files would be far more efficient if I could use 64K or bigger blocksizes because there are so many blocks.

hgm · Post by **hgm** » Sun Sep 07, 2008 9:38 pm

OK, thanks everyone for the advice. It gave me some pointers to look up information, an I really learned a lot about file systems. The most important thing I learned was the NTFS and Linux are quite different, and both very different from, (and superior to) FAT.

It seems that for what I want, NTFS is by far the best. The Unix / Linux system is good, but it still needs on pointer per allocated cluster. For a 480GB file, that is a lot of pointers, and even a lot of 'indirection blocks' to hold those pointers. They might or might not be cleverly interleaved with contiguous stretches of data blocks of the file, and they might be cached. But why take the risk?

In NTFS the allocaton table does not need an entry for every data block, but just one for every contiguous stretch of data blocks. For a mostly contiguous file, this gives an enormous reduction of the information. It might not be possible to indicate a stretch of 480 GB in a single entry, due to limitations of the length field. But with 4KB clusters, there would be 120M clusters in the file, so if 32-bit ints are used, it would be no problem. I could not find how exactly the format of the MFT is, so far, but even if it would use 16-byte length field, a single entry in the allocation table would still be able to indicate data runs of 64K clusters, far more than an entire indirection block would be able to contain in the Unix system.

NTFS shoud have no problem at all doing exactly what I want. That is good news, as it would allow 6-men tablebase building using only 1 min of disk I/O time per cycle (i.e. to calculate DTC=N+1 from DTC=N).

hgm · Post by **hgm** » Thu Sep 11, 2008 2:21 pm

OK, second round!

I made a small test routine, for simulating the basic action my tablebase generator will be involved in: gather a 5-men slice of the TB by reading 64 selected 4-men chunks, process the moves that fall within this slice, and write the updated chunks back to the 64 places in the TB reserved for them.

The code below leaves out the processing, and only does the reading / writing for the 2MB bitmaps that indicate won (wtm) positions. (In real life a similar amount of info would have to be read in for the most recent generation of lost (btm) positions, that makes up another part of the 15MB disk space reserved for each chunk.)

When I run the code on a 960MB file I created before (copying some other huge file first, to flush possibly cached parts of this file), the reading part takes ~3 sec. But the writing takes 23 sec!

Now 3 sec is already not very fast; it amounts to 50 ms per 2MB chunk, for the seek + transfer (e.g. 10 msec seek time, and 50 MB/sec transfer rate.) But this is on a laptop, so I guess on a big machine it can be a lot faster. So I can live with that.

But the write is disastrously slow! Does anyone have an idea why? Am I doing something sub-optimal? In itself, reading and writing should be equally fast, not? I can imagine that the OS wants to verify what it has written, but that would merey double the time, then. Where is the remaining factor 4 coming from? Is there something I could switch off, which is eating up most of the time now?

Code: Select all

#define MB (1024*1024)

void ReadBitmaps()
{
        int i; clock_t t;
        FILE *f;

        f = fopen("D:\\EGT.tst", "rb");
        t = clock();
        for(i=0; i<64; i++)
        {   fseek(f, i*15*MB, SEEK_SET);
            fread(&buffer[0][i*2*MB], 1, 2*MB, f);
            printf("chunk %2d done\n", i);
        }
        fclose(f);

        t = clock() - t;
        printf("reading slice: %5.3f sec\n", (1./CLOCKS_PER_SEC)*t);

        f = fopen("D:\\EGT.tst", "wb");
        t = clock();
        for(i=0; i<64; i++)
        {   fseek(f, i*15*MB, SEEK_SET);
            fwrite(&buffer[0][i*2*MB], 1, 2*MB, f);
            printf("chunk %2d done\n", i);
        }
        fclose(f);

        t = clock() - t;
        printf("writing slice: %5.3f sec\n", (1./CLOCKS_PER_SEC)*t);
}

sje · Post by **sje** » Thu Sep 11, 2008 3:07 pm

A slow write speed can be due to journaling being active on the filesystem. However, a factor of eight slowdown is hard to explain.

hgm · Post by **hgm** » Thu Sep 11, 2008 5:17 pm

I know very little about HD technology (I am currently trying to read up a little on it), and I don't really know which tasks are typically done by the OS, which by the controller chips on the motherboard, and which by the drive electronics. If drives are advertized as having 32MB of cache, I would be inclined to think that low-level tasks, like verifying the written sectors, will be done by the drive electronics. 32MB seems more than enough to save an entire track that is being written, to verify every sector of it on the next rotation.

Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access

Re: Question to hardware/OS experts, on file/HD access