Writing to a Text File (Thread Safe)

mar · Post by **mar** » Tue Aug 13, 2013 8:22 pm

syzygy wrote:We want to know whether fprintf() is atomic, so we should test fprintf() and not something lower level such as write(). This is the main point of the discussion: (non-)atomicity of lower level functions says nothing about (non-)atomicity of higher level functions.

If fprintf is not thread-safe on Windows (buffering/...), then it can't be atomic either. I suggest to try setbuf(f, 0), which may (or may not of course) force it to behave like low level wrapper.
I wouldn't rely on that though because as you said, Microsoft doesn't care about standards which is why a lot of programmers love it.

bob · Post by **bob** » Tue Aug 13, 2013 9:07 pm

syzygy wrote:
Sven Schüle wrote:
syzygy wrote:
Sven Schüle wrote:Insert a random sleep of few milliseconds into the worker loop and see what happens
Nothing happened.
I tested with 4 threads and num=100 on Windows 7, after inserting a random sleep of 1..10 msec into the loop body. No single line of the resulting log file looks suspicious but obviously many lines are overwritten, e.g. the line containing the number 100 appears more than 4 times while others appear less than 4 times, etc.
Windows 7 might not be POSIX compliant...

How did you verify that everything worked correctly in your case, other than with the "sort -n bla.txt | md5sum" command?
Do a few runs and check that the md5sums stay constant. On a POSIX-compliant platform.

mar wrote:It is possible that stdio is thread-safe on Linux but not on Windows (the standard doesn't mandate it).
As I understand it, POSIX does mandate it (and not just thread-safety of fprintf() but also atomicity). Of course Windows does not care about POSIX.

Plus you don't know what fancy stuff stdio functions do behind the curtains,
so if you want to actually measure anything, you should be using low level functions for writing stuff instead of testing thread safety of CRT.
We want to know whether fprintf() is atomic, so we should test fprintf() and not something lower level such as write(). This is the main point of the discussion: (non-)atomicity of lower level functions says nothing about (non-)atomicity of higher level functions.

???

fprintf/printf use write() to do the I/O. There's no other option.

I've not seen any interlaced I/O, that's not what I stated. I stated that the file position pointer seems to generate a race condition in threads, and that is outside the scope of the "atomic with regard to writing a complete line without interleaving other output..."

That is certainly a problem on the flavors of Linux I run, as I had to debug it about a week or so prior to 23.6 being released.

The problem I had was that there were obvious overwrites on a line here and there, and an occasional "gap" where no data was written to a few specific bytes leaving zeros/garbage there. Inspecting the logs would expose that. Most everything I had done regarding fprintf() was inside a lock if multiple threads could do output (something VERY rare in Crafty except for when debugging the SMP code). But printing the null-window fail high could happen from any thread and it caused problems until it was protected via a lock, then the problem disappeared permanently.

syzygy · Post by **syzygy** » Tue Aug 13, 2013 11:01 pm

bob wrote:???

What is so difficult to understand:

syzygy wrote:We want to know whether fprintf() is atomic, so we should test fprintf() and not something lower level such as write(). This is the main point of the discussion: (non-)atomicity of lower level functions says nothing about (non-)atomicity of higher level functions.

This is all legal English and technically correct. If too difficult, bad luck.

fprintf/printf use write() to do the I/O. There's no other option.

So what. What I wrote is correct, but maybe too difficult.

syzygy · Post by **syzygy** » Tue Aug 13, 2013 11:48 pm

bob wrote:That is certainly a problem on the flavors of Linux I run, as I had to debug it about a week or so prior to 23.6 being released.

The problem I had was that there were obvious overwrites on a line here and there, and an occasional "gap" where no data was written to a few specific bytes leaving zeros/garbage there. Inspecting the logs would expose that. Most everything I had done regarding fprintf() was inside a lock if multiple threads could do output (something VERY rare in Crafty except for when debugging the SMP code). But printing the null-window fail high could happen from any thread and it caused problems until it was protected via a lock, then the problem disappeared permanently.

You mean this code?

Code: Select all

    Lock(lock_io);
    Print(2, "         %2i   %s     %2s   ", iteration_depth,
        Display2Times(end_time - start_time), fh_indicator);
    if (display_options & 64)
      Print(2, "%d. ", move_number);
    if ((display_options & 64) && !wtm)
      Print(2, "... ");
    Print(2, "%s! ", OutputMove(tree, tree->curmv[1], 1, wtm));
    Print(2, "(%c%s)                  \n", (wtm) ? '>' : '<',
        DisplayEvaluationKibitz(value, wtm));
    Unlock(lock_io);

You're doing multiple Print()s here resulting in a single line of output, so yes, you have to lock. I said so from the start.

There is already a Lock(lock_root) around this piece of code, but I suppose other threads may issue Print()s from other places. Those other Print()s can get between the individual Print()s in the code above, resulting in corrupted lines. If you had looked carefully, you could have pieced the pieces back together.

bob wrote:So yes, I have unintentionally tried this, and I have seen the resulting corrupted log file with garbage in the middle, and lines partially overwriting previous (shorter) lines...

Of course, you're using multiple fprintf()s to write a single line....

bob wrote:The problem seems to not be the actual characters being written, it seems that fprintf() updates the file position pointer outside the lock. If thread A writes 120 bytes, then thread b writes 80 bytes, every now and then the file pointer comes out wrong.

Nothing to do with "file position pointers" getting confused. Your Print()s simply got interleaved with other Print()s.

If you had anything else than interchanged outputs from individual Print()s, it was not on a POSIX-compliant platform.

sje · Post by **sje** » Wed Aug 14, 2013 12:12 am

You are all wrong.

The point is not about the atomicity/re-entrancy/thread-safe nature of any particular library output function. At least some of you are making unjustified assumptions about function attributes which cannot be easily proven or disproven. And from an engineering standpoint, that is a serious error which may lead to unpredictable, hard-to-reproduce program failures.

Further, making assumptions of which library routines call which other routines and under what conditions is a poor practice as it depends so much upon the particulars of a given library on a given O/S at a given time.

Also, even if you're absolutely sure, you can still be wrong. Recently I was backpacking Symbolic to a 32 bit PowerPC environment and one of the threads which had never had a problem was now acting in a highly peculiar manner. I had followed the library documentation very closely and had gotten the identical code to work flawlessly on all the other platforms. It took me three whole hours to find the fault: the Mac O/X 10.4.9 library routine poll() was incorrectly implemented and would spuriously indicate pending I/O activity when there was none. But only under subtle, non-obvious circumstances. A switch to the select() routine fixed the problem. Decades of real world experience have taught me to not rely upon any particular library routine conforming to specification. Without that doubt, I might still be looking for that poll() bug.

The real point is about writing working code which is easy to understand. (See my first post in this thread.) And that means minimizing assumptions about hidden details. Six pages of theological argumentation does nothing to change that simple fact.

bob · Post by **bob** » Wed Aug 14, 2013 2:48 am

syzygy wrote:
bob wrote:???
What is so difficult to understand:
syzygy wrote:We want to know whether fprintf() is atomic, so we should test fprintf() and not something lower level such as write(). This is the main point of the discussion: (non-)atomicity of lower level functions says nothing about (non-)atomicity of higher level functions.
This is all legal English and technically correct. If too difficult, bad luck.

fprintf/printf use write() to do the I/O. There's no other option.
So what. What I wrote is correct, but maybe too difficult.

As I said, there is a flaw. Not easy to expose, but there IS a flaw. I've already explained why. It was not my imagination...

If you think what you are doing is fine, go for it. You WILL get to debug it again, one day...

bob · Post by **bob** » Wed Aug 14, 2013 2:50 am

syzygy wrote:
bob wrote:That is certainly a problem on the flavors of Linux I run, as I had to debug it about a week or so prior to 23.6 being released.

The problem I had was that there were obvious overwrites on a line here and there, and an occasional "gap" where no data was written to a few specific bytes leaving zeros/garbage there. Inspecting the logs would expose that. Most everything I had done regarding fprintf() was inside a lock if multiple threads could do output (something VERY rare in Crafty except for when debugging the SMP code). But printing the null-window fail high could happen from any thread and it caused problems until it was protected via a lock, then the problem disappeared permanently.
You mean this code?
Code: Select all
    Lock(lock_io);
    Print(2, "         %2i   %s     %2s   ", iteration_depth,
        Display2Times(end_time - start_time), fh_indicator);
    if (display_options & 64)
      Print(2, "%d. ", move_number);
    if ((display_options & 64) && !wtm)
      Print(2, "... ");
    Print(2, "%s! ", OutputMove(tree, tree->curmv[1], 1, wtm));
    Print(2, "(%c%s)                  \n", (wtm) ? '>' : '<',
        DisplayEvaluationKibitz(value, wtm));
    Unlock(lock_io);
You're doing multiple Print()s here resulting in a single line of output, so yes, you have to lock. I said so from the start.

There is already a Lock(lock_root) around this piece of code, but I suppose other threads may issue Print()s from other places. Those other Print()s can get between the individual Print()s in the code above, resulting in corrupted lines. If you had looked carefully, you could have pieced the pieces back together.

bob wrote:So yes, I have unintentionally tried this, and I have seen the resulting corrupted log file with garbage in the middle, and lines partially overwriting previous (shorter) lines...
Of course, you're using multiple fprintf()s to write a single line....

bob wrote:The problem seems to not be the actual characters being written, it seems that fprintf() updates the file position pointer outside the lock. If thread A writes 120 bytes, then thread b writes 80 bytes, every now and then the file pointer comes out wrong.
Nothing to do with "file position pointers" getting confused. Your Print()s simply got interleaved with other Print()s.

If you had anything else than interchanged outputs from individual Print()s, it was not on a POSIX-compliant platform.

No, not that code. The code is in SearchFH(), which is a single print that displays the fail-high move. I had it in Search() where I called SearchFH() and it corrupted my log file. Not in every game, but several times in 300 games. Not acceptable. I moved it to the SearchFH() code which is protected by a lock since I am updating the root move list and such and can't afford to do that in two threads simultaneously, which can happen when you split at the root as I do.

bob · Post by **bob** » Wed Aug 14, 2013 2:52 am

sje wrote:You are all wrong.

The point is not about the atomicity/re-entrancy/thread-safe nature of any particular library output function. At least some of you are making unjustified assumptions about function attributes which cannot be easily proven or disproven. And from an engineering standpoint, that is a serious error which may lead to unpredictable, hard-to-reproduce program failures.

Further, making assumptions of which library routines call which other routines and under what conditions is a poor practice as it depends so much upon the particulars of a given library on a given O/S at a given time.

Also, even if you're absolutely sure, you can still be wrong. Recently I was backpacking Symbolic to a 32 bit PowerPC environment and one of the threads which had never had a problem was now acting in a highly peculiar manner. I had followed the library documentation very closely and had gotten the identical code to work flawlessly on all the other platforms. It took me three whole hours to find the fault: the Mac O/X 10.4.9 library routine poll() was incorrectly implemented and would spuriously indicate pending I/O activity when there was none. But only under subtle, non-obvious circumstances. A switch to the select() routine fixed the problem. Decades of real world experience have taught me to not rely upon any particular library routine conforming to specification. Without that doubt, I might still be looking for that poll() bug.

The real point is about writing working code which is easy to understand. (See my first post in this thread.) And that means minimizing assumptions about hidden details. Six pages of theological argumentation does nothing to change that simple fact.

Sounds like you are making assumptions about and depending on select().

In any case, I have never used anything but select(), which has worked on every platform I have ever tried, unix-wise.

sje · Post by **sje** » Wed Aug 14, 2013 4:28 am

bob wrote:Sounds like you are making assumptions about and depending on select().

Select() is a very old routine going back to pre-release Unix and if it didn't work, then the kernel wouldn't likely work either. But I originally wanted to avoid select() because it did more than what I needed by handling multiple descriptors simultaneously.

I don't know how the bad poll() code got into the Mac OS/X PowerPC C/C++ library. Perhaps because this kind of work is often done by summer interns or by persons with only a few weeks left until a new job or retirement.

jundery · Post by **jundery** » Wed Aug 14, 2013 6:52 am

sje wrote: The real point is about writing working code which is easy to understand. (See my first post in this thread.) And that means minimizing assumptions about hidden details. Six pages of theological argumentation does nothing to change that simple fact.

If you want minimal assumptions, without locking never assume any IO ordering beyond the output of a single writer being sequentially. The reason for this is simple, every OS wants to be correct yet efficient, so it will process its kernel buffers sequentially but not enforce that user space buffers are contiguously mapped to kernel buffers. This is done so that if only a single user space writer exists the kernel doesn't waste resources locking. If multiple writers exist it is the responsibility of the user space code to enforce contiguous writes via its own locking. The other side effect is that it is often most efficient for the OS to service a single writer at a time, which can make it look like no locking is required when it really is.

tl;dr If you have multiple writers use a lock to ensure IO is contiguous in portable code.

Writing to a Text File (Thread Safe)

Re: Writing to a Text File (Thread Safe)

Re: Writing to a Text File (Thread Safe)

Re: Writing to a Text File (Thread Safe)

Re: Writing to a Text File (Thread Safe)

You are all wrong

Re: Writing to a Text File (Thread Safe)

Re: Writing to a Text File (Thread Safe)

Re: You are all wrong

Re: You are all wrong

Re: You are all wrong