pthread weirdness

bob · Post by **bob** » Thu Jun 07, 2007 4:12 am

Note that my original testing was done when I was locking the hash table for each lookup/store, prior to adding the lockless hash from Cray Blitz. So there I was locking a _bunch_ of times per second and mutex was horrible. If you already have the lock in another thread, you block in the current thread and blocking is bad when the other thread will only hold the lock for 4-5-6 instructions max...

diep · Post by **diep** » Sat Jun 09, 2007 3:23 pm

Tord Romstad wrote:
bob wrote:I never used mutex locks as the overhead is beyond bearable and it murders efficiency if things are locked/unlocked frequently.
I have seen you make this claim before, but I have never been able to find anything faster than pthread_mutex_t locks for Glaurung. I made another experiment with my program last night, comparing the performance using pthread_mutex_t, OSSpinLock and Crafty's x86 assembly language locks. The tests were done on an iMac Core Duo 2 GHz running Mac OS X 10.4.9. The compiler was gcc 4.0.1.

I tried several positions, and the result was the same for all of them: pthread_mutex_t was fastest, and OSSpinLock and x86 assembly locks were about 2% slower. Am I doing something wrong, or is OS X somehow very different from Linux with respect to the performance of mutex locks?

Source code available on request, if somebody wants to repeat my experiment.

Tord

Hi Tord,

I tend to agree with Bob on this one. Note there is a difference between the type of cpu you use the different assembly locks that crafty has been using. The current one might be P4 optimized.

Doing dumb tests with nonstop locking and unlocking is not a good idea. Better is simply to run Glaurung and see how it works out.

At NUMA hardware it is obvious that there is a performance issue possibly when using Bob's spinlock.

the OS-es implementation is usually doing this (IRIX for example) : if after X loops the lock doesn't get taken, then a process gets idled.

The disadvantage of this is that to wake up that process you lose another 10-20 milliseconds, yet for your application that might be better, if it is bad programmed.

In short for locking is completely system and cpu dependant not to mention application dependant what is the best approach to lock.

Note that AMD is total superior in locking speed over intel.

Vincent

Tord Romstad · Post by **Tord Romstad** » Sat Jun 09, 2007 7:37 pm

diep wrote:Doing dumb tests with nonstop locking and unlocking is not a good idea. Better is simply to run Glaurung and see how it works out.

I'm sorry for not expressing myself more clearly: Comparing the speeds of Glaurung with different types of locks was of course exactly what I did. Plain old mutex locks turned out to be marginally faster than the alternatives. Perhaps the results would have been different with a different CPU or OS, as you point out.

Tord

Guetti · Post by **Guetti** » Sat Jun 09, 2007 9:57 pm

Tord Romstad wrote:
diep wrote:Doing dumb tests with nonstop locking and unlocking is not a good idea. Better is simply to run Glaurung and see how it works out.
I'm sorry for not expressing myself more clearly: Comparing the speeds of Glaurung with different types of locks was of course exactly what I did. Plain old mutex locks turned out to be marginally faster than the alternatives. Perhaps the results would have been different with a different CPU or OS, as you point out.

Tord

Older crafty versions (<= 19.17) had a switch -DMUTEX to use mutex locks instead of spinlocks. I think compiling with and without this flag and comparing bench times should indicate how much the difference is for crafty, if you are interested in numbers...

bob · Post by **bob** » Sun Jun 10, 2007 12:01 am

Tord Romstad wrote:
diep wrote:Doing dumb tests with nonstop locking and unlocking is not a good idea. Better is simply to run Glaurung and see how it works out.
I'm sorry for not expressing myself more clearly: Comparing the speeds of Glaurung with different types of locks was of course exactly what I did. Plain old mutex locks turned out to be marginally faster than the alternatives. Perhaps the results would have been different with a different CPU or OS, as you point out.

Tord

If mutex locks are faster than my spinlocks, something is wrong. As _nothing_ is faster than the inline asm shadow locks I use. You only have to look at all the stuff done in the posix threads library mutex stuff to see why this is true.

bob · Post by **bob** » Wed Jun 13, 2007 3:26 am

Pradu wrote:
CRoberson wrote:Pradu,

It has to do with threads being asynchronus. In normal programing,
the value in a memory location that a pointer points to is predictable
because the programming is synchronized (the calling function waits on
the called function to return). With threads, the calling function continues
and may iterate (in James' case) more than once before the thread can get started.
I guess so, but when I was learning about pthreads and win32 threads all the example code look just the same. For example these websites:
http://www.llnl.gov/computing/tutorials ... #Compiling
or
http://www.mhpcc.edu/training/workshop2 ... /MAIN.html
http://math.arizona.edu/~swig/documentation/pthreads/
http://www.yolinux.com/TUTORIALS/LinuxT ... reads.html

Also I think the create thread function must return before another thread will be created in this loop. Here's an API reference:
http://cs.pub.ro/~apc/2003/resources/pt ... ers-14.htm

"The pthread_create() function creates a thread with the specified attributes and runs the C function start_routine in the thread with the single pointer argument specified. The new thread may, but will not always, begin running before pthread_create() returns. If pthread_create() completes successfully, the Pthread handle is stored in the contents of the location referred to by thread."

pthread_create() returns 0 if the thread starts normally and error codes if not. So I really don't see how the index can change before the thread is created. I'm not saying that either Prof. Hyatt or the websites are wrong or anything as I'm rather quite new to the field of threads and wouldn't know anyways. I just hoped that Prof. Hyatt could post some example code to show the right way of starting many threads with pthreads or win32 threads.

Sorry, I missed your request for a "show me how to do it right"... here are a couple of ways to avoid this:

approach 1:

for (i=1; i<= nthreads; i++)
id = pthread_create(.., .., entry_point, (void *) i)

then in your entry point, the value passed in can be recast as an int, and it will have a value of 1 in thread 1, 2 in thread 2, etc... Since we are not passing the address of a variable that can change, this will work just fine and is the way I used to do it when using posix threads.

apprach 2:

int ids[n] = {1, 2, 3, 4, 5, ...}

for (i=1; i<= nthreads; i++)
id = pthread_create(.., .., entry_point, &ids)

Here we pass the address of ids which is an integer value that does not change, making passing the address perfectly safe.

Hope that helps...

sje · Post by **sje** » Wed Jun 13, 2007 6:11 am

bob wrote:Sorry, I missed your request for a "show me how to do it right"... here are a couple of ways to avoid this:

approach 1:

for (i=1; i<= nthreads; i++)
id = pthread_create(.., .., entry_point, (void *) i)

then in your entry point, the value passed in can be recast as an int, and it will have a value of 1 in thread 1, 2 in thread 2, etc... Since we are not passing the address of a variable that can change, this will work just fine and is the way I used to do it when using posix threads.

apprach 2:

int ids[n] = {1, 2, 3, 4, 5, ...}

for (i=1; i<= nthreads; i++)
id = pthread_create(.., .., entry_point, &ids)

Here we pass the address of ids which is an integer value that does not change, making passing the address perfectly safe.

Hope that helps...

In approach 1: On some architectures, an odd or otherwise misaligned pointer value can induce a bus error. This may be rare nowadays, though, and might only happen during debugging.

In approach 2: The array ids[] better not have automatic (stack) allocation as the caller may exit prior to one or more threads having run long enough to dereference its input pointer. An explicit "static" storage declaration makes it safe.

pthread weirdness

Re: pthread weirdness

Re: pthread weirdness

Re: pthread weirdness

Re: pthread weirdness

Re: pthread weirdness

Re: pthread weirdness

Re: pthread weirdness