Writing bugs

chrisw · Post by **chrisw** » Fri Jan 11, 2019 10:49 am

mar wrote: ↑Fri Jan 11, 2019 5:58 am
chrisw wrote: ↑Thu Jan 10, 2019 9:11 pm Each parallel batch contains identical games is (was) the problem
Code: Select all
int random() {return 42;}
Wait - could this possibly be related to the fact that you used Python, where threads are completely useless due to GIL and you have to use multiple processes?

Then your "bug" is directly related to the language "features" you used, not at all to the pseudo-code which tells absolutely nothing.

Yes, it's Python parallel/delayed code which then calls up (8) C-threads, each thread plays out 1024 games (not chess). At the beginning of each game the piece square table is varied a bit, each entry has a small-ish randint based on rand() added to it.. The speed up is near linear, multiple parallel threads for game batches work fine. (Selecting "multiprocessing" slows it down by a factor of two, btw).
You're reading the scrappy pseudo code too literally! Sorry, my fault.
Each 1024 game batch comes back with identical game data, moves, result, everything as every other game batch, although the games within the batches are mostly different. Whereas, if I compute 8*1024 batches in serial, the games are all mostly different.
I have more or less fixed it, but in quite a scrappy way, and I've another quick fix idea to try, but was mostly wondering if anyone else ever faced a similar problem and how they got round it.

chrisw · Post by **chrisw** » Fri Jan 11, 2019 10:51 am

mar wrote: ↑Fri Jan 11, 2019 6:59 am Ok, seriously: you forgot to reseed, no big deal.
This is suspicious as well:
Code: Select all
engine_weights += initial_weights + random(small_variation)
as it would skyrocket the weights quickly. I believe the first += should actually be =

srand(seed) doesn't help

grahamj · Post by **grahamj** » Fri Jan 11, 2019 12:39 pm

chrisw wrote: ↑Fri Jan 11, 2019 10:49 am Yes, it's Python parallel/delayed code which then calls up (8) C-threads, each thread plays out 1024 games (not chess). At the beginning of each game the piece square table is varied a bit, each entry has a small-ish randint based on rand() added to it.. The speed up is near linear, multiple parallel threads for game batches work fine. (Selecting "multiprocessing" slows it down by a factor of two, btw).
You're reading the scrappy pseudo code too literally! Sorry, my fault.
Each 1024 game batch comes back with identical game data, moves, result, everything as every other game batch, although the games within the batches are mostly different. Whereas, if I compute 8*1024 batches in serial, the games are all mostly different.
I have more or less fixed it, but in quite a scrappy way, and I've another quick fix idea to try, but was mostly wondering if anyone else ever faced a similar problem and how they got round it.

You can make a lot of random numbers first (and write to file(s) if need be). If you need a huge number of random numbers per game, at least make 8*1024 high quality seeds. Using a fast PRNG which is well-seeded per game should be good enough for your purpose.

You said srand(seed) doesn't work. Perhaps it is not threadsafe. It's not hard to roll your own, eg https://en.wikipedia.org/wiki/Xorshift.

chrisw · Post by **chrisw** » Fri Jan 11, 2019 3:19 pm

grahamj wrote: ↑Fri Jan 11, 2019 12:39 pm
chrisw wrote: ↑Fri Jan 11, 2019 10:49 am Yes, it's Python parallel/delayed code which then calls up (8) C-threads, each thread plays out 1024 games (not chess). At the beginning of each game the piece square table is varied a bit, each entry has a small-ish randint based on rand() added to it.. The speed up is near linear, multiple parallel threads for game batches work fine. (Selecting "multiprocessing" slows it down by a factor of two, btw).
You're reading the scrappy pseudo code too literally! Sorry, my fault.
Each 1024 game batch comes back with identical game data, moves, result, everything as every other game batch, although the games within the batches are mostly different. Whereas, if I compute 8*1024 batches in serial, the games are all mostly different.
I have more or less fixed it, but in quite a scrappy way, and I've another quick fix idea to try, but was mostly wondering if anyone else ever faced a similar problem and how they got round it.
You can make a lot of random numbers first (and write to file(s) if need be). If you need a huge number of random numbers per game, at least make 8*1024 high quality seeds. Using a fast PRNG which is well-seeded per game should be good enough for your purpose.

You said srand(seed) doesn't work. Perhaps it is not threadsafe. It's not hard to roll your own, eg https://en.wikipedia.org/wiki/Xorshift.

Thanks for the Xorshift reference, will use that, I was trying to avoid building big tables.
I'm still intrigued as to why rand() will always produce the same stream of random numbers in parallel versions of the same thread. Even trying srand(seed) with different seeds in each thread (I used thread_id as the seed) didn't work, it resets the random number stream, but each thread then continues to use identical random streams.
My tacky solution was to mangle the output of rand() with the thread_id, but I think giving each thread its own xorshift state, eg state[8] for 8 threads, and initialising the 8 states randomly beforehand should work. Will try it ...
The other thought, was to call rand() thread_id times before using it, so each thread uses out of sequence values from the random stream, but that seems a bit tacky too.

Rebel · Post by **Rebel** » Fri Jan 11, 2019 5:03 pm

chrisw wrote: ↑Fri Jan 11, 2019 3:19 pm
grahamj wrote: ↑Fri Jan 11, 2019 12:39 pm
chrisw wrote: ↑Fri Jan 11, 2019 10:49 am Yes, it's Python parallel/delayed code which then calls up (8) C-threads, each thread plays out 1024 games (not chess). At the beginning of each game the piece square table is varied a bit, each entry has a small-ish randint based on rand() added to it.. The speed up is near linear, multiple parallel threads for game batches work fine. (Selecting "multiprocessing" slows it down by a factor of two, btw).
You're reading the scrappy pseudo code too literally! Sorry, my fault.
Each 1024 game batch comes back with identical game data, moves, result, everything as every other game batch, although the games within the batches are mostly different. Whereas, if I compute 8*1024 batches in serial, the games are all mostly different.
I have more or less fixed it, but in quite a scrappy way, and I've another quick fix idea to try, but was mostly wondering if anyone else ever faced a similar problem and how they got round it.
You can make a lot of random numbers first (and write to file(s) if need be). If you need a huge number of random numbers per game, at least make 8*1024 high quality seeds. Using a fast PRNG which is well-seeded per game should be good enough for your purpose.

You said srand(seed) doesn't work. Perhaps it is not threadsafe. It's not hard to roll your own, eg https://en.wikipedia.org/wiki/Xorshift.
Thanks for the Xorshift reference, will use that, I was trying to avoid building big tables.
I'm still intrigued as to why rand() will always produce the same stream of random numbers in parallel versions of the same thread. Even trying srand(seed) with different seeds in each thread (I used thread_id as the seed) didn't work, it resets the random number stream, but each thread then continues to use identical random streams.
My tacky solution was to mangle the output of rand() with the thread_id, but I think giving each thread its own xorshift state, eg state[8] for 8 threads, and initialising the 8 states randomly beforehand should work. Will try it ...
The other thought, was to call rand() thread_id times before using it, so each thread uses out of sequence values from the random stream, but that seems a bit tacky too.

Not sure if it is helpful but you can start from a number that is always different and that is the clock.

Code: Select all

        r=clock();
        clock_t a; a=clock(); randomize();
        r=random(9999);

Ras · Post by **Ras** » Fri Jan 11, 2019 6:45 pm

chrisw wrote: ↑Fri Jan 11, 2019 3:19 pmI'm still intrigued as to why rand() will always produce the same stream of random numbers in parallel versions of the same thread.

rand() has a global variable behind it, the state of the random generator. If you do stuff on a global variable from one thread, another thread will not see the update, at least not immediately, because it's in the CPU cache and not written back to main memory. So what happens is that the next thread doesn't notice that the rand state is different now, so it fetches the same state as the thread before, and consequently gets the same random number.

What you want in this case is having a thread using rand() while also pushing out the modified randomness state in a way that other threads will see it. This write back costs time and therefore happens only when you tell the CPU to do that, i.e. when you use thread synchronisation primitives.

chrisw · Post by **chrisw** » Fri Jan 11, 2019 7:06 pm

Rebel wrote: ↑Fri Jan 11, 2019 5:03 pm
chrisw wrote: ↑Fri Jan 11, 2019 3:19 pm
grahamj wrote: ↑Fri Jan 11, 2019 12:39 pm
chrisw wrote: ↑Fri Jan 11, 2019 10:49 am Yes, it's Python parallel/delayed code which then calls up (8) C-threads, each thread plays out 1024 games (not chess). At the beginning of each game the piece square table is varied a bit, each entry has a small-ish randint based on rand() added to it.. The speed up is near linear, multiple parallel threads for game batches work fine. (Selecting "multiprocessing" slows it down by a factor of two, btw).
You're reading the scrappy pseudo code too literally! Sorry, my fault.
Each 1024 game batch comes back with identical game data, moves, result, everything as every other game batch, although the games within the batches are mostly different. Whereas, if I compute 8*1024 batches in serial, the games are all mostly different.
I have more or less fixed it, but in quite a scrappy way, and I've another quick fix idea to try, but was mostly wondering if anyone else ever faced a similar problem and how they got round it.
You can make a lot of random numbers first (and write to file(s) if need be). If you need a huge number of random numbers per game, at least make 8*1024 high quality seeds. Using a fast PRNG which is well-seeded per game should be good enough for your purpose.

You said srand(seed) doesn't work. Perhaps it is not threadsafe. It's not hard to roll your own, eg https://en.wikipedia.org/wiki/Xorshift.
Thanks for the Xorshift reference, will use that, I was trying to avoid building big tables.
I'm still intrigued as to why rand() will always produce the same stream of random numbers in parallel versions of the same thread. Even trying srand(seed) with different seeds in each thread (I used thread_id as the seed) didn't work, it resets the random number stream, but each thread then continues to use identical random streams.
My tacky solution was to mangle the output of rand() with the thread_id, but I think giving each thread its own xorshift state, eg state[8] for 8 threads, and initialising the 8 states randomly beforehand should work. Will try it ...
The other thought, was to call rand() thread_id times before using it, so each thread uses out of sequence values from the random stream, but that seems a bit tacky too.
Not sure if it is helpful but you can start from a number that is always different and that is the clock.
Code: Select all
        r=clock();
        clock_t a; a=clock(); randomize();
        r=random(9999);

doesn't make a difference in this case, although actually I just found that:
if I use the Python randint() function it's fine, but if I use rand() within the C code it isn't fine, see results below ....

Code: Select all


def print_something_random(i, j):
    if (j==0):
        x = random.randint(0, 100)
        y = random.randint(0, 100)
        z = random.randint(0, 100)
    else:
        x = BB.rand64(100)
        y = BB.rand64(100)
        z = BB.rand64(100)
    print(i, x, y, z)
    return

    workers = 8
    print('serial ....')
    for i in range(workers):
        print_something_random(i, 0)
    time.sleep(1)

    print('parallel python ....')
    Parallel(n_jobs=workers)(delayed(print_something_random)(thread_id, 0) for thread_id in range(workers))
    time.sleep(1)

    print('parallel C Python API ....')
    Parallel(n_jobs=workers)(delayed(print_something_random)(thread_id, 1) for thread_id in range(workers))

Code: Select all

PyObject* rand64_impl(PyObject *dummy, PyObject* o)
{
	int x = PyLong_AsLong(o);
	int y = rand();
	int z = y % x;
	return PyLong_FromLong(z);
}

Code: Select all

serial ....
0 99 59 70
1 4 49 75
2 11 71 58
3 74 89 77
4 76 37 86
5 72 6 45
6 86 28 44
7 63 15 87
parallel python ....
0 55 2 74
1 34 5 39
2 80 43 42
3 7 31 51
5 60 51 71
6 66 87 37
4 75 13 12
7 7 37 77
parallel C Python API....
0 41 67 34
1 41 67 34
2 41 67 34
3 41 67 34
4 41 67 34
5 41 67 34
6 41 67 34
7 41 67 34

hgm · Post by **hgm** » Fri Jan 11, 2019 7:14 pm

Ras wrote: ↑Fri Jan 11, 2019 6:45 pmIf you do stuff on a global variable from one thread, another thread will not see the update, at least not immediately, because it's in the CPU cache and not written back to main memory.

No, that cannot happen. Caching is fully transparent and coherent. From the very beginning i86 CPUs have been doing "cache snooping" through the MESI protocol, i.e. if one CPU/core dirties its private cache by writing in it, all other CPUs/cores will clear the corresponding entry if it should be in their cache. And if CPU A requests a memory location that is in the cache of CPU B and 'dirty', CPU B will provide it instead of the memory.

I would have expected the state of the PRNG to reside in global memory, shared by all threads. Apparently it is not, and each thread has its own. That sound like a bug in the PRNG.

Joost Buijs · Post by **Joost Buijs** » Fri Jan 11, 2019 8:04 pm

That is why it is usually better to write everything yourself instead of relying on buggy library implementations.

Ras · Post by **Ras** » Fri Jan 11, 2019 9:09 pm

hgm wrote: ↑Fri Jan 11, 2019 7:14 pmNo, that cannot happen. Caching is fully transparent and coherent.

OK, that's a point, but the load-modify-store cycle is interruptible, and memory ordering is not guaranteed across cores because the execution order can vary. Not sure whether this alone could already explain the observations.

That sound like a bug in the PRNG.

No, rand() is specified as not reentrant and not thread safe. So what actually happens with multithreaded use will always depend on the libc in question. It's even legal to have the state per thread and not global.

Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs

Re: Writing bugs