NUMA & TT

flok · Post by **flok** » Sun Oct 24, 2021 11:08 am

Hi,

I read somewhere in this forum that on a NUMA system, you need to divide the TT in a number of parts that equal the number of threads and then for each part do a memset with the appropriate cpu affinity.

Now my threadripper is a numa system:

Code: Select all

folkert@oensoens:~$ lscpu | grep NUMA
NUMA node(s):                    2
NUMA node0 CPU(s):               0-15
NUMA node1 CPU(s):               16-31

so I decided to check how much difference this makes, this allocating per thread-affinity.

Both on cpu 0:

Code: Select all

folkert@oensoens:~$ ./a.out 0 0
cpu: 0
cpu: 0
9402

malloc + memset & check on different numa nodes:

Code: Select all

folkert@oensoens:~$ ./a.out 0 16
cpu: 0
cpu: 16
9390

That is −0,13%: hardly worth it I would think?

Maybe my test-code is broken?

Code: Select all

#include <pthread.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <thread>
#include <time.h>

#define N (1024ll * 1024ll)

#define DT 5000000000ll

void select_core(pthread_t h, int core)
{
	cpu_set_t cpuset;
	CPU_ZERO(&cpuset);
	CPU_SET(core, &cpuset);
	if (pthread_setaffinity_np(h, sizeof(cpu_set_t), &cpuset))
		printf("pthread_setaffinity_np failed\n");

	pthread_yield();

	printf("cpu: %d\n", sched_getcpu());
}

uint64_t get_ns()
{
	struct timespec tp { 0 };

	if (clock_gettime(CLOCK_MONOTONIC, &tp) == -1) {
		perror("clock_gettime");
		return 0;
	}

	return tp.tv_sec * 1000ll * 1000ll * 1000ll + tp.tv_nsec;
}

int main(int argc, char *argv[])
{
	int core1 = atoi(argv[1]);
	int core2 = atoi(argv[2]);

	select_core(pthread_self(), core1);

	uint8_t *p = (uint8_t *)malloc(N);
	memset(p, 0x01, N);

	select_core(pthread_self(), core2);

	uint64_t n = 0;
	uint64_t dummy = 0;
	uint64_t start_ts = get_ns();

	do {
		n++;

		for(int i=1; i<N; i++) {
			if (p[i])
				p[i - 1] = dummy++;
		}
	}
	while(get_ns() - start_ts <= DT);

	printf("%ld\n", n);

	free(p);

	return 0;
}

Note: the odd complexity in the for-loop is to prevent caching. And yes, it makes quite a difference

flok · Post by **flok** » Sun Oct 24, 2021 1:07 pm

This may work better:

Code: Select all

#include <numa.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define N (1024ll * 1024ll)

#define DT 5000000000ll

uint64_t get_ns()
{
	struct timespec tp { 0 };

	if (clock_gettime(CLOCK_MONOTONIC, &tp) == -1) {
		perror("clock_gettime");
		return 0;
	}

	return tp.tv_sec * 1000ll * 1000ll * 1000ll + tp.tv_nsec;
}

int main(int argc, char *argv[])
{
	int node1 = atoi(argv[1]);
	int node2 = atoi(argv[2]);

        uint8_t *p = (uint8_t *)numa_alloc_onnode(N, node1);
        memset(p, 0x01, N);

	numa_set_preferred(node2);

	uint64_t n = 0;
	uint64_t dummy = 0;
	uint64_t start_ts = get_ns();

	do {
		n++;

		for(int i=1; i<N; i++) {
			if (p[i])
				p[i - 1] = dummy++;
		}
	}
	while(get_ns() - start_ts <= DT);

	printf("%ld\n", n);

	return 0;
}

Compile using:

Code: Select all

g++ -Ofast numalatency.cpp -lnuma

Joost Buijs · Post by **Joost Buijs** » Sun Oct 24, 2021 1:57 pm

By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.

flok · Post by **flok** » Sun Oct 24, 2021 2:31 pm

Joost Buijs wrote: ↑Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.

Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?

Joost Buijs · Post by **Joost Buijs** » Sun Oct 24, 2021 3:00 pm

flok wrote: ↑Sun Oct 24, 2021 2:31 pm
Joost Buijs wrote: ↑Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?

I don't know because I never tried with RAM in only 1 memory channel.

In my BIOS the NUMA options are under: Advanced\AMD CBS\DF Common Options\Memory Adressing

flok · Post by **flok** » Sun Oct 24, 2021 3:17 pm

Joost Buijs wrote: ↑Sun Oct 24, 2021 3:00 pm
flok wrote: ↑Sun Oct 24, 2021 2:31 pm
Joost Buijs wrote: ↑Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?
I don't know because I never tried with RAM in only 1 memory channel.

In my BIOS the NUMA options are under: Advanced\AMD CBS\DF Common Options\Memory Adressing

Does your system give 1 with the setting you use?

Code: Select all

#include <numa.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
        printf("%d\n", numa_available());
        return 0;
}

Code: Select all

g++ test.cpp -lnuma && ./a.out

flok · Post by **flok** » Sun Oct 24, 2021 4:21 pm

flok wrote: ↑Sun Oct 24, 2021 3:17 pm
Joost Buijs wrote: ↑Sun Oct 24, 2021 3:00 pm
flok wrote: ↑Sun Oct 24, 2021 2:31 pm
Joost Buijs wrote: ↑Sun Oct 24, 2021 1:57 pm By default Threadripper emulates a UMA architecture, when you want to use NUMA you have to enable this in the BIOS explicitly. You can set the number of NUMA nodes per socket e.g. 0, 1, 2, 4 etc.

Of course you can make your program NUMA aware, but I don't think this is useful for the TT because you have to address every location in the TT from each CCX.

At least I don't get it working, NUMA is always a few percent slower than without.
Do you know if the bios is intelligent about it? E.g. if you have only ram in 1 memory channel, that it won't show the item?
I don't know because I never tried with RAM in only 1 memory channel.

In my BIOS the NUMA options are under: Advanced\AMD CBS\DF Common Options\Memory Adressing
Does your system give 1 with the setting you use?
Code: Select all
#include <numa.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
        printf("%d\n", numa_available());
        return 0;
}
Code: Select all
g++ test.cpp -lnuma && ./a.out

Or maybe:

Code: Select all

numactl -H

Joost Buijs · Post by **Joost Buijs** » Sun Oct 24, 2021 4:58 pm

I'm on Windows and cannot compare it one to one.

With default BIOS settings it appears to be 1 NUMA node with all memory available to it, basically the same as UMA.
My Threadripper has 4 CCX, I can adjust the number of NUMA nodes in the BIOS from 0 to 1, 2 and 4. For a single socket board 0 and 1 seem to be the same.

BTW: This forum makes me crazy, each time when I try to post a message I appear to be logged off, when I logon again my message is gone and can only get it back by hitting the back button in my browser a few times. I wonder when this will be fixed, as it is right now this forum is pretty unusable. I also get CloudFlare 520 error messages, whatever it means.

diep · Post by **diep** » Tue Oct 26, 2021 6:26 pm

heh Folkert - there is a big difference between 1 chip that's inside numa and the 2 nodes epyc nodes.

Which of the 2 do you have?

For the first follow Joost. For the second it matters quite a lot - but it all depends upon how your algorithms work versus price of a lookup versus how you do your lookups versus how many nps your engine gets.

diep · Post by **diep** » Tue Oct 26, 2021 6:31 pm

Joost Buijs wrote: ↑Sun Oct 24, 2021 4:58 pm I'm on Windows and cannot compare it one to one.

With default BIOS settings it appears to be 1 NUMA node with all memory available to it, basically the same as UMA.
My Threadripper has 4 CCX, I can adjust the number of NUMA nodes in the BIOS from 0 to 1, 2 and 4. For a single socket board 0 and 1 seem to be the same.

BTW: This forum makes me crazy, each time when I try to post a message I appear to be logged off, when I logon again my message is gone and can only get it back by hitting the back button in my browser a few times. I wonder when this will be fixed, as it is right now this forum is pretty unusable. I also get CloudFlare 520 error messages, whatever it means.

NUMA & TT

NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT

Re: NUMA & TT