Question about files

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Question about files

Post by hgm »

If two independent processes, which opened the same file for writing independently, both write...

will data be lost, or does the OS guarantee that data will always be appended at the end?
stevenaaus
Posts: 608
Joined: Wed Oct 13, 2010 9:44 am
Location: Australia

Re: Question about files

Post by stevenaaus »

hgm wrote:If two independent processes, which opened the same file for writing independently, both write...

will data be lost, or does the OS guarantee that data will always be appended at the end?
In my experience, all OSes attempting this will lose data. OTTOMH, on unix systems, the last process to perform a close() will overwrite the other's file. There's no substitute for testing though...
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Question about files

Post by Daniel Shawul »

I think it will be lost. I had this problem when running multiple games on a cluster where all games write to the same PGN.
Sometimes I get a couple of lost games out of 10000 games. Then I wrote a script so that each process writes to a different pgn, and merge all files later for bayeselo processing. Maybe some OS can do automatic file locking to avoid this inconvenience.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Question about files

Post by Evert »

hgm wrote:If two independent processes, which opened the same file for writing independently, both write...

will data be lost, or does the OS guarantee that data will always be appended at the end?
I suspect it depends on the OS.
I've done it by accident and neatly got the output from both instances interleaved in the output file, but I didn't bother to check whether any data was lost in the process since the result was useless anyway. It possibly makes a difference how and when data is flushed too.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Question about files

Post by Sven »

hgm wrote:If two independent processes, which opened the same file for writing independently, both write...

will data be lost, or does the OS guarantee that data will always be appended at the end?
The OS will usually NOT guarantee to always append the data at the end. You are free to TRY TO achieve that on your own, though, by always opening the file for appending, e.g. in C:

Code: Select all

FILE * fp = fopen(path, "a+");
If you don't, and instead open the file in normal writing mode ("w") then usually the last writer wins by overwriting the file with his view of the contents to be written.

The point for the "a+" (append) way is, however, that you can still get an unexpected mixture of both outputs. Whether some part of the data will be lost or not is undefined, too, to my knowledge. Therefore it is usually advisable to use file locking in important cases, unless performance is very critical and you have frequent write access to such files.

There are also OS dependencies. For instance I think that Windows is more restrictive in even allowing two processes to open the same file for writing.

Sven
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Question about files

Post by hgm »

OK, this is what I was afraid of. I guess I could do a seek to the end of the file just before writing, but there still would be no guarantee that two processes would not do the seek at the same time, and then still write in the same place.

What is the standard way to make accessing a file an atomic operation?

The reason I want this is for a tournament manager: I would like there to be a tournament file that contains the resuts of all games that have been played so far, as +, - or = character. A process A playing games would then 'grab' the next game that has to be played, by appending its own ID character (say an A) to the file, so that other game processes B, C, ... can see that the game is being played, and they should skip it. In due time A will finish the game, overwrite the A by the game result, and seek to the end of the file to grab a new game.

In a preceding section of that file there should be all info needed to figure out which games have to be played in which order. (I.e. a list of participants, tournament type, nr of games per pairing, number of cycles.) So you could start new game processes at any time.
ldesnogu

Re: Question about files

Post by ldesnogu »

Maybe file locking can help.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Question about files

Post by hgm »

Indeed, flock(2) seems to be what I need on Linux. Thanks!

I guess I should start implementing such a lock on the -saveGameFile of XBoard anyway, so that it will be sure that PGN games of one writer will not be interleaved with those of another, if they use the same file.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Question about files

Post by bob »

hgm wrote:If two independent processes, which opened the same file for writing independently, both write...

will data be lost, or does the OS guarantee that data will always be appended at the end?
It will be lost. The processes each have a file descriptor for that file, with their own private file pointer that points to the next byte to write. The output will be badly scrambled. I had to solve this on my cluster with all of my referees wanting to write to some shared files. I went to a different game result output file for each referee and combine them when all games are played. You can fix this, but you have to use a locking scheme and random I/O...

You can try flock() but that is probably not exactly what you want, either. Best bet is to write to different files and combine the results when everything is done. There is still the issue of two separate file pointers that will tell a process where the next write goes. If you use fseek() after a flock() you might get something workable, as you first lock, then position to the end of the file, which can't change while you have it locked. Then you write, flush() and then release the lock...

In my cluster testing, I pass the referee a "save file name". If I run 256 games at a time, I will have 256 different files open at the same time, one in each instance of the referee. After all of the games are done (30,000 games total) I combine the individual files (I generally have one instance of a referee play 16 games, 8 with black, 8 with white, 8 different positions). This avoids that multiple writer data corruption completely and simply.

You could do this on a non-cluster by writing a master program that creates a different thread for each game, and have the "manager" for that game write to just one file that is unique for that game.
Last edited by bob on Thu Apr 21, 2011 6:54 pm, edited 1 time in total.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Question about files

Post by Sven »

hgm wrote:OK, this is what I was afraid of. I guess I could do a seek to the end of the file just before writing, but there still would be no guarantee that two processes would not do the seek at the same time, and then still write in the same place.

What is the standard way to make accessing a file an atomic operation?

The reason I want this is for a tournament manager: I would like there to be a tournament file that contains the resuts of all games that have been played so far, as +, - or = character. A process A playing games would then 'grab' the next game that has to be played, by appending its own ID character (say an A) to the file, so that other game processes B, C, ... can see that the game is being played, and they should skip it. In due time A will finish the game, overwrite the A by the game result, and seek to the end of the file to grab a new game.

In a preceding section of that file there should be all info needed to figure out which games have to be played in which order. (I.e. a list of participants, tournament type, nr of games per pairing, number of cycles.) So you could start new game processes at any time.
You could also provide the complete set of games at once in your tournament file, where initially all games are in state "open", and then update that file by changing "open" into "now playing in A" and later into the result. This way you need only one file while in the other case you would also need a file where the games are found that are not played yet.

Still you need something like flock().

Sven