multithreading questions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: multithreading questions

Post by Pradu »

Gerd Isenberg wrote:But isn't it possible that one thread never gets the ressource because some ather spinning thread always takes the chance to write a "one" between dirty read a "zero" and lock xchg?
Yes it is possible, but maybe this will be the same thing as you never getting the lock because other threads always gets the locked exchange before you. Maybe there's some way to come up with a spin-lock to ensure fairness among threads. Perhaps a wait-queue? Maybe implementing that will be too expensive.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: multithreading questions

Post by Gerd Isenberg »

Pradu wrote:
Gerd Isenberg wrote:But isn't it possible that one thread never gets the ressource because some ather spinning thread always takes the chance to write a "one" between dirty read a "zero" and lock xchg?
Yes it is possible, but maybe this will be the same thing as you never getting the lock because other threads always gets the locked exchange before you. Maybe there's some way to come up with a spin-lock to ensure fairness among threads. Perhaps a wait-queue? Maybe implementing that will be too expensive.
Of course you may introduce deadlocks. But with lock xchg only, if you receive a zero, you already have the the shared ressource exclusively - while after "dirty"-reading the zero, you may have bad luck (again) getting a one with xchg.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: multithreading questions

Post by hgm »

Which sections of an SMP Chess engine are critical? Are there in which the threads really spend a significant amount of time, so that waiting for access can become a serious bottle neck?
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: multithreading questions

Post by Gerd Isenberg »

hgm wrote:Which sections of an SMP Chess engine are critical? Are there in which the threads really spend a significant amount of time, so that waiting for access can become a serious bottle neck?
I think others may give more profound advice - specially with SMP versus NUMA issues. I still use only a shared hashtable with a dual core.

I think all places are critical, where multiple threads/processes access a shared, volatile ressource. The longer the individual average access time, and the more threads are running simultaniously on different processors, the higher the probability that a critical section becomes a bottleneck.

That is why most use lockless hashing, either with xor - or legal hashmove error detection, considering the latency of huge tables with tlb-issues. I actually use SSE 128-bit move-instructions (atomic on K10?) to access the slots of my table - and hide the first read by some pure xmm-register fill-stuff.

It might be worth, to use thread-local tables for other stuff, like a secondary smaller transposition-table (used also in qsearch), pawnhash-table, eval-table and material-tables etc. - simply to avoid bottlenecks while entering critcal sections amd despite the drawback to get less hits overall.
Pradu
Posts: 287
Joined: Sat Mar 11, 2006 3:19 am
Location: Atlanta, GA

Re: multithreading questions

Post by Pradu »

Gerd wrote:I actually use SSE 128-bit move-instructions (atomic on K10?) to access the slots of my table - and hide the first read by some pure xmm-register fill-stuff.
That's a pretty neat idea. I suppose many of us use 128-bit hash-entries (I do atleast). Is this how one does it?

Code: Select all

typedef struct _hashEntryUsuable { .... } hashEntryUsuable; /*16-bytes*/

#ifndef __128_BIT_TYPE_DEFINED__
	#define __128_BIT_TYPE_DEFINED__
	#if defined(_MSC_VER)
		#include <xmmintrin.h>
		typedef __m128 T128;
	#else //GCC
		typedef float T128 __attribute__ ((__vector_size__ &#40;16&#41;));
	#endif //defined&#40;_MSC_VER&#41; 
#endif //__128_BIT_TYPE_DEFINED__ 

typedef union &#123;hashEntryUsuable entry; T128 datamove;&#125; hashEntry;
Then you use an assignment operator "=" on the T128 datatype to store a hashEntry atomically?
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: multithreading questions

Post by Gerd Isenberg »

Pradu wrote:
Gerd wrote:I actually use SSE 128-bit move-instructions (atomic on K10?) to access the slots of my table - and hide the first read by some pure xmm-register fill-stuff.
That's a pretty neat idea. I suppose many of us use 128-bit hash-entries (I do atleast). Is this how one does it?

Code: Select all

typedef struct _hashEntryUsuable &#123; .... &#125; hashEntryUsuable; /*16-bytes*/

#ifndef __128_BIT_TYPE_DEFINED__
	#define __128_BIT_TYPE_DEFINED__
	#if defined&#40;_MSC_VER&#41;
		#include <xmmintrin.h>
		typedef __m128 T128;
	#else //GCC
		typedef float T128 __attribute__ ((__vector_size__ &#40;16&#41;));
	#endif //defined&#40;_MSC_VER&#41; 
#endif //__128_BIT_TYPE_DEFINED__ 

typedef union &#123;hashEntryUsuable entry; T128 datamove;&#125; hashEntry;
Then you use an assignment operator "=" on the T128 datatype to store a hashEntry atomically?

Yes, I have a wrapper class for 128-byte aka bitboard[2]. One base for the 16-byte aligned memory layout with a union based on __m128i and two derivates (XMM, GPR) with all kind of operators overloaded. I can use type templated functions with either XMM or 128-bit GPR with usual C-expressions. That is quite convenient rather than using the intrinsics all the time. A bit strange but fitting to my main purpose: add, sub is byte(rank) wise, shifts qword wise logical. I have some other functions around for the unpack stuff and extracting bytes/words etc..
For access of the hash-table I explicitly use XMM. The emitted opcode of a simple assignment to a XMM register variable is movaps xmmi, [address] if xmmi is not needed at once, otherwise movdqa. I have not tried non temporal writes yet, but i think bypassing L1 does not make much sense, if data is already in L1.
vc2005 produces really good assembly and nice scheduling - in the mentioned routine with kogge-stone fillstuf 15 xmm registers of 16 available are used. The small drawback with vc2005 is that it has only six xmm scratch registers, xmm0...xmm5 - all the others need to be caller-safe - even if you don't use double/floats at all in a recursive search.
Martin Fierz

Re: multithreading questions

Post by Martin Fierz »

aloha!

thanks for all your answers. i looked up bob's lockless hashing paper, and also the joint paper with anthony cozzie where the effect of hash errors on the search is described. from this i gather that lockless hashing is cool, but not really necessary, as the hashtable can be used without locking. however, it would appear to me that the lockless hashing costs nearly nothing and would therefore be the method of choice.

of course, i have some more questions :-)

1) my checkers program uses 64-bit hash records. if i understand bob's paper right, then on a 64-bit system, a hashtable read or write would be "atomic", i.e. on such systems i wouldn't have to care about hashing and locks at all?

2) there is another paper on bob's site on DTS, while tord seems to be using YBW in viper. is there any reason for choosing one or the other in your experience? in particular, which is easiest to implement?

3) in checkers, we have huge endgame databases, which are very important for the playing strength (very different than in chess). the endgame databases are too big to fit in the computer's RAM, and therefore we load and unload blocks of the database during the search into the database cache. i am using 1K-sized blocks. obviously, i need to keep track of this cache somehow: every block has a unique number assigned to it, and i have an array which tells me which slot of the database cache is holding which block. whenever a block is loaded, i need to update that. i also have a doubly-linked-list to implement a LRU (least recently used) caching scheme, that is, when a block has to be loaded, i can look up which one of my N blocks was not used for the longest time, then i throw it out. to keep track of this list, i also need to update it every time a block is used which is in memory. so i have:

dblookup with block in memory:
- find out in which block the current position is
- find out where the block is in memory
- look up the current position in the block
- move this block to the front of the LRU-list

dblookup with block on disk
- find out in which block the current position is
- find out that the block is not in memory
- find out which block has not been used for the longest time
- replace that block with the one i want to use
- update the index array which tells me which block is where
- move the block to the front of the LRU-list

compared to hashing, this seems like a much bigger problem to make thread-safe.... my experience with the locks (EnterCriticalSection, LeaveCriticalSection) has been very bad, i.e. they slow down the search a lot. is there anything else that i can do about this database lookup? in particular, how are you handling your database lookups in chess?

cheers
martin
Harald
Posts: 318
Joined: Thu Mar 09, 2006 1:07 am

Re: multithreading questions

Post by Harald »

Martin Fierz wrote:3) in checkers, we have huge endgame databases, which are very important for the playing strength (very different than in chess). the endgame databases are too big to fit in the computer's RAM, and therefore we load and unload blocks of the database during the search into the database cache. i am using 1K-sized blocks. obviously, i need to keep track of this cache somehow: every block has a unique number assigned to it, and i have an array which tells me which slot of the database cache is holding which block. whenever a block is loaded, i need to update that. i also have a doubly-linked-list to implement a LRU (least recently used) caching scheme, that is, when a block has to be loaded, i can look up which one of my N blocks was not used for the longest time, then i throw it out. to keep track of this list, i also need to update it every time a block is used which is in memory. so i have:

dblookup with block in memory:
- find out in which block the current position is
- find out where the block is in memory
- look up the current position in the block
- move this block to the front of the LRU-list

dblookup with block on disk
- find out in which block the current position is
- find out that the block is not in memory
- find out which block has not been used for the longest time
- replace that block with the one i want to use
- update the index array which tells me which block is where
- move the block to the front of the LRU-list
My answer is not related to threading and locking. But I want you to
compare your solution to the one I used when I had a similar problem.

- Each data block can be identified with an ID. Its ID can be transformed
to index numbers of one (ore more) arrays.

- The indexed array(s) holds a pointer to the data blocks in memory
(or NULL) and its size and a pointer to the position in the list.
The operations of the data loader always start with this array(s).

- A LRU list (double linked) contains the data IDs. I insert/move entries of
used data to the front and kick out the LRU data at the back if needed.

- If the data blocks do not all have the same size I do not use a memory
array of slots but I use malloc/free or new/delete. I just count and track
the sum of all used bytes and compare it to a fixed maximum. If the
memory is 'full', I throw away LRU data until the free space is big enough
for new data.

- The operations are nearly the same as the ones you described.

Harald
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: multithreading questions

Post by hgm »

Well, busy waiting is not an option if you might be waiting for disk access. This seems a good application for the 'virtual thread' scheme I described above, with many more threads than CPUs, as each search thread is likely to spend the largest fraction of its time waiting for disk. The OS would have to support asynchronous I/O, though, so that the CPU could place the read request, and then switch to another thread.
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: multithreading questions

Post by Tord Romstad »

Martin Fierz wrote:2) there is another paper on bob's site on DTS, while tord seems to be using YBW in viper. is there any reason for choosing one or the other in your experience? in particular, which is easiest to implement?
DTS is the better algorithm, but the difference seems to be very small on computers with less than 8 CPUs. YBW is a lot easier to implement; so much easier that even Bob himself has chosen to use YBW rather than DTS in Crafty.

Tord