Laziest Lazy SMP

ianm · Post by **ianm** » Sat Aug 15, 2020 4:02 am

I thought I’d have a go at adding Lazy SMP to Plankton. I did the easiest, nastiest, quickest implementation I could. Rightly or wrongly, but it seems to work pretty well. He’s what I did:

Made all the variables thread local (except the hash table)

Created a “global” variable for things like the board, the move history list, etc

Created a “helper_thread()” function which is essentially the same as the search function with iterative deepening logic. Before ID, it initialises the thread local variables from the global variables, then does a search to a depth based on a function of the thread number which is passed as a parameter – so different threads search to slightly different depths.

Modified the main search thread to trigger the “helper” threads. The main thread sets a global flag so that the helper threads know if they are not the main thread. Only the main thread updates global variables, checks for exit time and reports progress. The only synced object is the number of nodes searched at the end of an ID loop.

Hash table updates were modified to use the xor trick by Bob and Tim.

I looked at the cost of using TLS and it seems pretty minimal if you have a statically built project, ie, no dynamically linked functions. Having looked at the assembly code I see nothing more than a standard dereference of a pointer var.

Results against TSCP:

1 thread on i7-6700 3.40GHz (4 cores)

Code: Select all

-----------------Plankton64_03-----------------
Plankton64_03 - TSCP181c : 84.0/100 75-7-18 (=11=111=101=11=1110=1011111111==0=11111110=11=11=1111=1111=1111111111110111110111==1111=1=1111111111)  84%  +288
-----------------TSCP181c-----------------
TSCP181c - Plankton64_03 : 16.0/100 7-75-18 (=00=000=010=00=0001=0100000000==1=00000001=00=00=0000=0000=0000000000001000001000==0000=0=0000000000)  16%  -288

4 threads on i7-6700 3.40GHz (4 cores)

Code: Select all

-----------------Plankton64mt_03-----------------
Plankton64mt_03 - TSCP181c : 89.5/100 81-2-17 (11=1=11==111111=1=111=111111=110111111=11=111=11111111=1111111111111111110=1=11111=11=11111=11111111)  90%  +382
-----------------TSCP181c-----------------
TSCP181c - Plankton64mt_03 : 10.5/100 2-81-17 (00=0=00==000000=0=000=000000=001000000=00=000=00000000=0000000000000000001=0=00000=00=00000=00000000)  11%  -363

I don’t know how significant this is but there is obviously a difference and it was fun to do!. TC was 3s per move.

Cheers, Ian

JVMerlino · Post by **JVMerlino** » Sat Aug 15, 2020 7:02 pm

ianm wrote: ↑Sat Aug 15, 2020 4:02 am Results against TSCP:

1 thread on i7-6700 3.40GHz (4 cores)

Code: Select all

-----------------Plankton64_03-----------------
Plankton64_03 - TSCP181c : 84.0/100 75-7-18 (=11=111=101=11=1110=1011111111==0=11111110=11=11=1111=1111=1111111111110111110111==1111=1=1111111111)  84%  +288
-----------------TSCP181c-----------------
TSCP181c - Plankton64_03 : 16.0/100 7-75-18 (=00=000=010=00=0001=0100000000==1=00000001=00=00=0000=0000=0000000000001000001000==0000=0=0000000000)  16%  -288

4 threads on i7-6700 3.40GHz (4 cores)

Code: Select all

-----------------Plankton64mt_03-----------------
Plankton64mt_03 - TSCP181c : 89.5/100 81-2-17 (11=1=11==111111=1=111=111111=110111111=11=111=11111111=1111111111111111110=1=11111=11=11111=11111111)  90%  +382
-----------------TSCP181c-----------------
TSCP181c - Plankton64mt_03 : 10.5/100 2-81-17 (00=0=00==000000=0=000=000000=001000000=00=000=00000000=0000000000000000001=0=00000=00=00000=00000000)  11%  -363

I don’t know how significant this is but there is obviously a difference and it was fun to do!. TC was 3s per move.

Cheers, Ian

Nice job. Although, well, the error bar is massive because it's only 100 games. But, forgetting that (although you shouldn't), going from 84 to 89.5 winning percentage is a jump of 84 elo.

I also made a VERY lazy Lazy SMP implementation based on an idea from HG Muller. I simply created multiple engine processes (not threads), and created a shared hash table for them. Process 0 is the master, and is the one that "acts normally" and eventually determines the best move, PV, etc. All of the other processes are just put in analysis mode and fill up the hash table, with half of them searching one or two plies deeper, depending on how "far away" they are from Process 0.

According to CCRL, it is 88 elo stronger at 4 CPU than at 1 CPU. I'm fine with that.

jm

silentshark · Post by **silentshark** » Sat Aug 15, 2020 7:13 pm

JVMerlino wrote: ↑Sat Aug 15, 2020 7:02 pm
ianm wrote: ↑Sat Aug 15, 2020 4:02 am Results against TSCP:

1 thread on i7-6700 3.40GHz (4 cores)
Code: Select all
-----------------Plankton64_03-----------------
Plankton64_03 - TSCP181c : 84.0/100 75-7-18 (=11=111=101=11=1110=1011111111==0=11111110=11=11=1111=1111=1111111111110111110111==1111=1=1111111111)  84%  +288
-----------------TSCP181c-----------------
TSCP181c - Plankton64_03 : 16.0/100 7-75-18 (=00=000=010=00=0001=0100000000==1=00000001=00=00=0000=0000=0000000000001000001000==0000=0=0000000000)  16%  -288
4 threads on i7-6700 3.40GHz (4 cores)
Code: Select all
-----------------Plankton64mt_03-----------------
Plankton64mt_03 - TSCP181c : 89.5/100 81-2-17 (11=1=11==111111=1=111=111111=110111111=11=111=11111111=1111111111111111110=1=11111=11=11111=11111111)  90%  +382
-----------------TSCP181c-----------------
TSCP181c - Plankton64mt_03 : 10.5/100 2-81-17 (00=0=00==000000=0=000=000000=001000000=00=000=00000000=0000000000000000001=0=00000=00=00000=00000000)  11%  -363
I don’t know how significant this is but there is obviously a difference and it was fun to do!. TC was 3s per move.

Cheers, Ian
Nice job. Although, well, the error bar is massive because it's only 100 games. But, forgetting that (although you shouldn't), going from 84 to 89.5 winning percentage is a jump of 84 elo.

I also made a VERY lazy Lazy SMP implementation based on an idea from HG Muller. I simply created multiple engine processes (not threads), and created a shared hash table for them. Process 0 is the master, and is the one that "acts normally" and eventually determines the best move, PV, etc. All of the other processes are just put in analysis mode and fill up the hash table, with half of them searching one or two plies deeper, depending on how "far away" they are from Process 0.

According to CCRL, it is 88 elo stronger at 4 CPU than at 1 CPU. I'm fine with that.

jm

With the nasty legacy code I have (lots of global variables etc.), the multiple process approach might be the easiest thing for me to use.

So, here's my lazy question, what's the easiest way to create multiple processes and share memory between them, in the world of Windows?

JVMerlino · Post by **JVMerlino** » Sat Aug 15, 2020 7:28 pm

silentshark wrote: ↑Sat Aug 15, 2020 7:13 pm With the nasty legacy code I have (lots of global variables etc.), the multiple process approach might be the easiest thing for me to use.
So, here's my lazy question, what's the easiest way to create multiple processes and share memory between them, in the world of Windows?

And that's exactly the reason I did it - tons of global variables due to a first-time engine writer.

It's easy to create background processes in Windows. Here's my code:

Code: Select all

void StartProcesses(int nCPUs)
{
	if ((nCPUs > 1) && !bSlave)
	{
		int	n;
		char	commandline[256];

		// create slave processes
		for (n = 0; n < nCPUs - 1; n++)
		{
			ZeroMemory(&si, sizeof(STARTUPINFO));
			si.cb = sizeof(si);
			ZeroMemory(&pi[n], sizeof(pi[n]));

			sprintf(commandline, "\"%s\" slave sharedmem=%s sharedhash=%s numslave=%d", szProgName, szSharedMemName, szSharedHashName, n);
			printf("Launching slave process %d with '%s'\n", n, commandline);

			// Start the child process. 
			if( !CreateProcess( NULL,   // No module name (use command line)
				commandline,
				NULL,           // Process handle not inheritable
				NULL,           // Thread handle not inheritable
				FALSE,          // Set handle inheritance to FALSE
				CREATE_NO_WINDOW,              // No creation flags, no console window
				NULL,           // Use parent's environment block
				NULL,           // Use parent's starting directory 
				&si,              // Pointer to STARTUPINFO structure
				&pi[n] )        // Pointer to PROCESS_INFORMATION structure
			) 
			{
				printf( "CreateProcess failed (%d).\n", GetLastError() );
			}
		}
	}
}

Creating the shared memory almost as easy. There are plenty of code examples you can find, but here's mine. You probably shouldn't use it because I suspect it's crap. But, as I said, at least it worked.

Code: Select all

#if USE_SMP
    typedef  union  hilo {
        unsigned long a[2];
        unsigned long long b;
    } hilo;

    if (!bSlave)	// master process
    {
		// create shared memory for slave processes (even if they're aren't any, because of the "cores" command)
		time_t	t;
		hilo h={0};
                h.b = dwHashSize * sizeof (HASH_ENTRY) + dwPawnHashSize * sizeof (PAWN_HASH_ENTRY) + dwEvalHashSize * sizeof (EVAL_HASH_ENTRY);

		sprintf(szSharedHashName, "MSH-%lld", (long long) time(&t));
		sprintf(szSharedMemName, "MSM-%lld", (long long) time(&t));
		hSharedHash = CreateFileMapping(
							INVALID_HANDLE_VALUE,    // use paging file
							NULL,                    // default security
							PAGE_READWRITE,          // read/write access
							h.a[1],   // maximum object size (high-order DWORD)
							h.a[0],  // maximum object size (low-order DWORD)
							szSharedHashName);       // name of mapping object

		if (hSharedHash != NULL)
		{
			HashTable = (HASH_ENTRY *)MapViewOfFile(hSharedHash,   // handle to map object
													FILE_MAP_ALL_ACCESS, // read/write permission
													0,
													0,
													0);

			PawnHashTable = (PAWN_HASH_ENTRY *) ((char *) HashTable + dwHashSize * sizeof (HASH_ENTRY));
			EvalHashTable = (EVAL_HASH_ENTRY *) ((char *) PawnHashTable +  dwPawnHashSize * sizeof (PAWN_HASH_ENTRY));

			if (HashTable)
				ClearHash();
		}

		hSharedMem = CreateFileMapping(
							INVALID_HANDLE_VALUE,    // use paging file
							NULL,                    // default security
							PAGE_READWRITE,          // read/write access
							0,                       // maximum object size (high-order DWORD)
							sizeof(SHARED_MEM),     // maximum object size (low-order DWORD)
							szSharedMemName);       // name of mapping object

		if (hSharedMem != NULL)
		{
			smSharedMem = (PSHARED_MEM) MapViewOfFile(hSharedMem,   // handle to map object
							FILE_MAP_ALL_ACCESS, // read/write permission
							0,
							0,
							sizeof(SHARED_MEM));
		}
    }
    else if (bSlave)	// all other processes
    {
        hSharedHash = OpenFileMapping(
                          FILE_MAP_ALL_ACCESS,   // read/write access
                          FALSE,                 // do not inherit the name
                          szSharedHashName);     // name of mapping object

        if (hSharedHash != NULL)
        {
            HashTable = (HASH_ENTRY *)MapViewOfFile(hSharedHash, // handle to map object
                                                    FILE_MAP_ALL_ACCESS,  // read/write permission
                                                    0,
                                                    0,
                                                    dwHashSize * sizeof (HASH_ENTRY) + dwPawnHashSize * sizeof (PAWN_HASH_ENTRY) + dwEvalHashSize * sizeof (EVAL_HASH_ENTRY));

            PawnHashTable = (PAWN_HASH_ENTRY *) ((char *) HashTable + dwHashSize * sizeof (HASH_ENTRY));
            EvalHashTable = (EVAL_HASH_ENTRY *) ((char *) PawnHashTable  +  dwPawnHashSize * sizeof (PAWN_HASH_ENTRY));
        }

        hSharedMem = OpenFileMapping(
                         FILE_MAP_ALL_ACCESS,   // read/write access
                         FALSE,                 // do not inherit the name
                         szSharedMemName);     // name of mapping object

        if (hSharedMem != NULL)
        {
            smSharedMem = (PSHARED_MEM)MapViewOfFile(hSharedMem, // handle to map object
                          FILE_MAP_ALL_ACCESS,  // read/write permission
                          0,
                          0,
                          sizeof(SHARED_MEM));

			ZeroMemory(&smSharedMem->sdSlaveData[nSlaveNum], sizeof(SLAVE_DATA));
        }
    }

Joost Buijs · Post by **Joost Buijs** » Sat Aug 15, 2020 7:59 pm

silentshark wrote: ↑Sat Aug 15, 2020 7:13 pm With the nasty legacy code I have (lots of global variables etc.), the multiple process approach might be the easiest thing for me to use.

So, here's my lazy question, what's the easiest way to create multiple processes and share memory between them, in the world of Windows?

IMHO it is easier to use threads than processes. Just put all the globals into a struct and create an array of structs indexed by thread number, you only have to make sure that the different structs don't fall into the same cache-line by aligning them at 64 byte to avoid false sharing. Adding a dummy array of sufficient size works too.

In my implementation al the search procedures are struct/class member functions.

What works best for me is an aspiration search for each thread, all threads search the PV-move in parallel, the master (thread 0) always searches each move at each depth, the slave threads search the remaining moves (non PV moves) only if there is no other thread searching this move at the same or higher depth yet. If a slave thread skips all moves it iterates to the next depth. At the end I scan all the results from the different threads to get the a move with the best combination of score and depth.

It behaves differently than my old YBWC search, sometimes time to depth is lower, sometimes higher, but I have the impression that it works a lot better in the endgame when the number of available moves is low. It scales very well on my 32 core 3970x too.

No4b · Post by **No4b** » Sun Aug 16, 2020 1:17 am

ianm wrote: ↑Sat Aug 15, 2020 4:02 am I thought I’d have a go at adding Lazy SMP to Plankton. I did the easiest, nastiest, quickest implementation I could. Rightly or wrongly, but it seems to work pretty well. He’s what I did:

Made all the variables thread local (except the hash table)

Created a “global” variable for things like the board, the move history list, etc

Created a “helper_thread()” function which is essentially the same as the search function with iterative deepening logic. Before ID, it initialises the thread local variables from the global variables, then does a search to a depth based on a function of the thread number which is passed as a parameter – so different threads search to slightly different depths.

Modified the main search thread to trigger the “helper” threads. The main thread sets a global flag so that the helper threads know if they are not the main thread. Only the main thread updates global variables, checks for exit time and reports progress. The only synced object is the number of nodes searched at the end of an ID loop.

Hash table updates were modified to use the xor trick by Bob and Tim.
I looked at the cost of using TLS and it seems pretty minimal if you have a statically built project, ie, no dynamically linked functions. Having looked at the assembly code I see nothing more than a standard dereference of a pointer var.

Results against TSCP:

1 thread on i7-6700 3.40GHz (4 cores)
Code: Select all
-----------------Plankton64_03-----------------
Plankton64_03 - TSCP181c : 84.0/100 75-7-18 (=11=111=101=11=1110=1011111111==0=11111110=11=11=1111=1111=1111111111110111110111==1111=1=1111111111)  84%  +288
-----------------TSCP181c-----------------
TSCP181c - Plankton64_03 : 16.0/100 7-75-18 (=00=000=010=00=0001=0100000000==1=00000001=00=00=0000=0000=0000000000001000001000==0000=0=0000000000)  16%  -288
4 threads on i7-6700 3.40GHz (4 cores)
Code: Select all
-----------------Plankton64mt_03-----------------
Plankton64mt_03 - TSCP181c : 89.5/100 81-2-17 (11=1=11==111111=1=111=111111=110111111=11=111=11111111=1111111111111111110=1=11111=11=11111=11111111)  90%  +382
-----------------TSCP181c-----------------
TSCP181c - Plankton64mt_03 : 10.5/100 2-81-17 (00=0=00==000000=0=000=000000=001000000=00=000=00000000=0000000000000000001=0=00000=00=00000=00000000)  11%  -363
I don’t know how significant this is but there is obviously a difference and it was fun to do!. TC was 3s per move.

Cheers, Ian

I think you should maybe try and play Plankton with different (stronger) opponent.
Differences could be even more considerable, because Plankton seems to steamroll TSCP even without Lazy SMP.

Great job anyway!

Dann Corbit · Post by **Dann Corbit** » Sun Aug 16, 2020 2:39 am

silentshark wrote: ↑Sat Aug 15, 2020 7:13 pm So, here's my lazy question, what's the easiest way to create multiple processes and share memory between them, in the world of Windows?

Use CreateProcess.
You can almost simulate a fork with it.

JVMerlino · Post by **JVMerlino** » Sun Aug 16, 2020 2:51 am

Dann Corbit wrote: ↑Sun Aug 16, 2020 2:39 am
silentshark wrote: ↑Sat Aug 15, 2020 7:13 pm So, here's my lazy question, what's the easiest way to create multiple processes and share memory between them, in the world of Windows?
Use CreateProcess.
You can almost simulate a fork with it.

Yep, that's how I did it in my code sample above.

ianm · Post by **ianm** » Sun Aug 16, 2020 5:52 am

No4b wrote: ↑Sun Aug 16, 2020 1:17 am
I think you should maybe try and play Plankton with different (stronger) opponent.
Differences could be even more considerable, because Plankton seems to steamroll TSCP even without Lazy SMP.

Great job anyway!

Yeah, I tried some of the default engines that came with Arena and Plankton got eaten alive. I've been testing against TSCP from the beginning with my goal being if I write a program that can occasionally win against TSCP then job done. As I added features Plankton started winning more frequently. At its core it's a very simple low memory program because I wanted to be able to port it to microcontrollers with only about 8k RAM. I just got hooked and wanted to understand how the big guns searched so deeply so quickly - still don't really understand how they manage to do that and find winning lines of play!

ianm · Post by **ianm** » Sun Aug 16, 2020 8:25 am

A followup question, does it help to increase the number of threads to the number of hyper-threaded threads available on the CPU (eg, say 8 threads on a 4 core CPU) or just use the same number of threads as there are physical cores? Yeah, I should experiment but just wondering what others experiences are.

Laziest Lazy SMP

Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP

Re: Laziest Lazy SMP