NUMA & TT

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
flok
Posts: 611
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

NUMA & TT

Post by flok »

Hi,

I read somewhere in this forum that on a NUMA system, you need to divide the TT in a number of parts that equal the number of threads and then for each part do a memset with the appropriate cpu affinity.

Now my threadripper is a numa system:

Code: Select all

folkert@oensoens:~$ lscpu | grep NUMA
NUMA node(s):                    2
NUMA node0 CPU(s):               0-15
NUMA node1 CPU(s):               16-31
so I decided to check how much difference this makes, this allocating per thread-affinity.

Both on cpu 0:

Code: Select all

folkert@oensoens:~$ ./a.out 0 0
cpu: 0
cpu: 0
9402
malloc + memset & check on different numa nodes:

Code: Select all

folkert@oensoens:~$ ./a.out 0 16
cpu: 0
cpu: 16
9390
That is −0,13%: hardly worth it I would think?

Maybe my test-code is broken?

Code: Select all

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <thread>
#include <time.h>

#define N (1024ll * 1024ll)

#define DT 5000000000ll

void select_core(pthread_t h, int core)
{
	cpu_set_t cpuset;
	CPU_ZERO(&cpuset);
	CPU_SET(core, &cpuset);
	if (pthread_setaffinity_np(h, sizeof(cpu_set_t), &cpuset))
		printf("pthread_setaffinity_np failed\n");

	pthread_yield();

	printf("cpu: %d\n", sched_getcpu());
}

uint64_t get_ns()
{
	struct timespec tp { 0 };

	if (clock_gettime(CLOCK_MONOTONIC, &tp) == -1) {
		perror("clock_gettime");
		return 0;
	}

	return tp.tv_sec * 1000ll * 1000ll * 1000ll + tp.tv_nsec;
}

int main(int argc, char *argv[])
{
	int core1 = atoi(argv[1]);
	int core2 = atoi(argv[2]);

	select_core(pthread_self(), core1);

	uint8_t *p = (uint8_t *)malloc(N);
	memset(p, 0x01, N);

	select_core(pthread_self(), core2);

	uint64_t n = 0;
	uint64_t dummy = 0;
	uint64_t start_ts = get_ns();

	do {
		n++;

		for(int i=1; i<N; i++) {
			if (p[i])
				p[i - 1] = dummy++;
		}
	}
	while(get_ns() - start_ts <= DT);

	printf("%ld\n", n);

	free(p);

	return 0;
}
Note: the odd complexity in the for-loop is to prevent caching. And yes, it makes quite a difference :-)
User avatar
flok
Posts: 611
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: NUMA & TT

Post by flok »

This may work better:

Code: Select all

#include <numa.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define N (1024ll * 1024ll)

#define DT 5000000000ll

uint64_t get_ns()
{
	struct timespec tp { 0 };

	if (clock_gettime(CLOCK_MONOTONIC, &tp) == -1) {
		perror("clock_gettime");
		return 0;
	}

	return tp.tv_sec * 1000ll * 1000ll * 1000ll + tp.tv_nsec;
}

int main(int argc, char *argv[])
{
	int node1 = atoi(argv[1]);
	int node2 = atoi(argv[2]);

        uint8_t *p = (uint8_t *)numa_alloc_onnode(N, node1);
        memset(p, 0x01, N);

	numa_set_preferred(node2);

	uint64_t n = 0;
	uint64_t dummy = 0;
	uint64_t start_ts = get_ns();

	do {
		n++;

		for(int i=1; i<N; i++) {
			if (p[i])
				p[i - 1] = dummy++;
		}
	}
	while(get_ns() - start_ts <= DT);

	printf("%ld\n", n);

	return 0;
}
Compile using:

Code: Select all

g++ -Ofast numalatency.cpp -lnuma
Joost Buijs
Posts: 1663
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: NUMA & TT

Post by Joost Buijs »

By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
User avatar
flok
Posts: 611
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: NUMA & TT

Post by flok »

Joost Buijs wrote: Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?
Joost Buijs
Posts: 1663
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: NUMA & TT

Post by Joost Buijs »

flok wrote: Sun Oct 24, 2021 2:31 pm
Joost Buijs wrote: Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?
I don't know because I never tried with RAM in only 1 memory channel.

In my BIOS the NUMA options are under: Advanced\AMD CBS\DF Common Options\Memory Adressing
User avatar
flok
Posts: 611
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: NUMA & TT

Post by flok »

Joost Buijs wrote: Sun Oct 24, 2021 3:00 pm
flok wrote: Sun Oct 24, 2021 2:31 pm
Joost Buijs wrote: Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?
I don't know because I never tried with RAM in only 1 memory channel.

In my BIOS the NUMA options are under: Advanced\AMD CBS\DF Common Options\Memory Adressing
Does your system give 1 with the setting you use?

Code: Select all

#include <numa.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
        printf("%d\n", numa_available());
        return 0;
}

Code: Select all

g++ test.cpp -lnuma && ./a.out
User avatar
flok
Posts: 611
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: NUMA & TT

Post by flok »

flok wrote: Sun Oct 24, 2021 3:17 pm
Joost Buijs wrote: Sun Oct 24, 2021 3:00 pm
flok wrote: Sun Oct 24, 2021 2:31 pm
Joost Buijs wrote: Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?
I don't know because I never tried with RAM in only 1 memory channel.

In my BIOS the NUMA options are under: Advanced\AMD CBS\DF Common Options\Memory Adressing
Does your system give 1 with the setting you use?

Code: Select all

#include <numa.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
        printf("%d\n", numa_available());
        return 0;
}

Code: Select all

g++ test.cpp -lnuma && ./a.out
Or maybe:

Code: Select all

numactl -H
Joost Buijs
Posts: 1663
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: NUMA & TT

Post by Joost Buijs »

I'm on Windows and cannot compare it one to one.

With default BIOS settings it appears to be 1 NUMA node with all memory available to it, basically the same as UMA.
My Threadripper has 4 CCX, I can adjust the number of NUMA nodes in the BIOS from 0 to 1, 2 and 4. For a single socket board 0 and 1 seem to be the same.

BTW: This forum makes me crazy, each time when I try to post a message I appear to be logged off, when I logon again my message is gone and can only get it back by hitting the back button in my browser a few times. I wonder when this will be fixed, as it is right now this forum is pretty unusable. I also get CloudFlare 520 error messages, whatever it means.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: NUMA & TT

Post by diep »

heh Folkert - there is a big difference between 1 chip that's inside numa and the 2 nodes epyc nodes.

Which of the 2 do you have?

For the first follow Joost. For the second it matters quite a lot - but it all depends upon how your algorithms work versus price of a lookup versus how you do your lookups versus how many nps your engine gets.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: NUMA & TT

Post by diep »

Joost Buijs wrote: Sun Oct 24, 2021 4:58 pm I'm on Windows and cannot compare it one to one.

With default BIOS settings it appears to be 1 NUMA node with all memory available to it, basically the same as UMA.
My Threadripper has 4 CCX, I can adjust the number of NUMA nodes in the BIOS from 0 to 1, 2 and 4. For a single socket board 0 and 1 seem to be the same.

BTW: This forum makes me crazy, each time when I try to post a message I appear to be logged off, when I logon again my message is gone and can only get it back by hitting the back button in my browser a few times. I wonder when this will be fixed, as it is right now this forum is pretty unusable. I also get CloudFlare 520 error messages, whatever it means.