NUMA Stockfish Speed Bug And How To Apply The Fix According To ChatGTP

Damir · Post by **Damir** » Fri Nov 29, 2024 9:09 pm

1.General Observation:

The performance difference between binding (using taskset to pin instances to NUMA nodes) and no binding (letting the system schedule the instances) has changed dramatically over time. Initially, the difference was around 10%, but now it’s as high as 200% (as seen from your results).
This sharp performance drop in NUMA binding (pinning) suggests that there may be an issue with the NUMA-aware configuration or implementation in Stockfish, especially when compared to the performance with no binding.

2.Performance Breakdown:

Looking at the performance data from the two systems you mentioned:

On the first system (presumably with NUMA support):
As the split increases (i.e., as you run more instances), the performance of no binding is consistently lower than the binding approach, but the binding performance starts to decrease sharply when you hit higher splits (e.g., 32 threads).
In the 16-thread and 32-thread splits, the no bind scenario exhibits significant lower performance compared to binding, which indicates that the system is not optimally scheduling the processes without explicit NUMA binding.

On the second system (with 4 NUMA sockets):
The performance gap between binding and no binding is more pronounced. For instance, in the 8-thread split, there’s a dramatic drop in performance when the processes are not pinned to specific NUMA domains. This suggests that the system is heavily relying on NUMA locality for good performance and the scheduler might not be handling NUMA nodes efficiently.
NUMA Binding and the Performance Drop:
The significant performance loss observed with NUMA binding (pinning) in recent versions, compared to older versions, suggests a possible regression or misconfiguration in how NUMA nodes are being handled by Stockfish. Let's dive into a few possible reasons:
Potential Causes for Performance Drop:

1.NUMA Memory Access Latency:

When you explicitly pin Stockfish instances to NUMA domains using taskset, memory access latency may be a key factor. If Stockfish's memory access patterns do not match the optimal locality for NUMA nodes, you might be seeing significant performance degradation.
Issue: It's possible that recent updates to Stockfish introduced memory access patterns or threading models that are less NUMA-friendly, causing remote NUMA node memory accesses, which are slower than local accesses.

2.Improper Thread to NUMA Node Mapping:

Thread Pinning Configuration: In your code, the CPU cores are manually assigned using taskset. If the mapping between threads and NUMA domains is not done properly, there could be significant performance overhead from having threads running on NUMA nodes that are far from the memory they're accessing.
Issue: The way you're distributing threads across cores might not be optimal, especially if the CPU list for taskset does not match the actual NUMA domain layout. If you are using a system with multiple NUMA nodes, you should ensure that the threads are allocated to NUMA nodes in a way that minimizes cross-node traffic.

3.Increased Threading Overhead or Lock Contention:

Stockfish’s recent changes might involve additional synchronization mechanisms or more complex thread scheduling that could introduce overhead in NUMA systems. Thread contention, particularly when multiple threads attempt to access shared data (such as hash tables), could lead to synchronization overhead.
Issue: NUMA architectures are sensitive to lock contention when threads on different NUMA nodes try to synchronize or share data. If the latest version introduced new locks or less efficient memory access patterns, this could cause a significant performance hit.

4.NUMA-Aware Memory Allocation and Management:

If Stockfish is not correctly allocating memory based on the NUMA node that each thread is running on, it could result in memory being allocated on remote NUMA nodes, which can be much slower.
Issue: Stockfish’s NUMA optimizations might not be fully implemented or could have regressed, leading to poor memory allocation patterns, especially when running multiple instances across different NUMA domains.

5.Operating System Scheduler Behavior:

The operating system’s NUMA scheduler might be suboptimal in managing multiple instances of Stockfish. In systems with multiple NUMA nodes, the OS scheduler may not always place processes in the most efficient NUMA node, particularly when multiple threads are used.
Issue: If Stockfish’s NUMA pinning is not aligned with the OS's own NUMA scheduler, you could be seeing performance degradation, especially if the OS is not handling memory locality properly.

Steps to Diagnose and Fix:

1.Ensure Proper Thread-to-NUMA Node Mapping:

Validate the NUMA layout: Ensure that your taskset configuration correctly matches the NUMA domain layout of your system. You can use tools like lscpu or numactrl to verify the NUMA nodes and their CPU assignments.
Consider Auto-Pinning: Instead of manually setting taskset, you could try using automated NUMA-aware thread pinning tools or configuration options that come with Stockfish, if available.

2.Profiling and Memory Access Analysis:

Profiling: Use tools like perf to profile the memory access patterns of Stockfish. This will help identify if there’s a lot of remote memory access or if there are unnecessary cache misses due to poor NUMA locality.
Memory Affinity: Check if Stockfish is properly utilizing memory on the NUMA node closest to each thread. If not, investigate whether recent changes in the code introduced inefficient memory access patterns.

3.Investigate Stockfish’s NUMA-related Changes:

Compare Performance: If the problem only appeared in newer commits, try to revert Stockfish to an earlier version (one that worked well for NUMA) and compare the performance.
Commit History: Look through the recent Stockfish commits for any changes related to threading, NUMA, or memory allocation. You might find that the NUMA implementation was altered in a way that reduced efficiency.

4.Optimize the NUMA Policy:

NUMA Balancing: On Linux, enable numactrl or numactrl -p to see and adjust how the OS handles NUMA memory balancing. If necessary, you could configure the system to prioritize local memory access or adjust the OS’s NUMA policies to be more aggressive in optimizing memory placement.

5.Test with Different Configurations:

Experiment with Different Thread Counts and Splits: Try adjusting the number of threads and splits in different configurations (e.g., testing 1, 2, 4, 8, 16, 32 splits) and observe how they affect performance on both bound and non-bound setups.
Disable NUMA and Compare: As a baseline, try disabling NUMA altogether and compare performance with the NUMA-aware setup. If there’s a large difference, it confirms that NUMA optimizations in Stockfish need to be improved or corrected.

Conclusion:

The substantial drop in performance after NUMA pinning suggests a regression or inefficiency in how Stockfish handles NUMA nodes, memory allocation, or thread scheduling. Investigating recent code changes, profiling memory access patterns, and validating thread pinning will help identify the root cause. Adjusting the thread-to-core mapping or improving memory locality could resolve the performance degradation.

Jouni · Post by **Jouni** » Fri Nov 29, 2024 10:06 pm

I think there is no NUMA problem anymore. This was the latest issue https://github.com/official-stockfish/S ... ssues/5635 . TCEC system is slow because of outdated Xeon prosessor.

Ciekce · Post by **Ciekce** » Sat Nov 30, 2024 7:56 am

there ain't no way you post a whole wall of AI slop to try and give advice to actual devs without even understanding the problem

NUMA Stockfish Speed Bug And How To Apply The Fix According To ChatGTP

NUMA Stockfish Speed Bug And How To Apply The Fix According To ChatGTP

Re: NUMA Stockfish Speed Bug And How To Apply The Fix According To ChatGTP

Re: NUMA Stockfish Speed Bug And How To Apply The Fix According To ChatGTP