Today I've added multi-threading to my Perft routine. (In Rust you'll get hit on the head for some time by the compiler and borrow checker, but once you figure out the correct way of "Fearless concurrency in Rust", it becomes ridiculously easy and logical to do.) I just created a function that splits up the initial move list of the position, creating one chunk per thread, and then launch perft for each chunk. It doesn't use a transposition table yet but I think the results are still interesting, because they seem to fluctuate measurably, depending on the position you're testing, and the number of threads you have.
The test was run on an Intel Core i7-6700K; 4 real cores, 4 logical cores. No bulk counting, hashing, or other tricks.
Before the implementation of multi-threading, Rustic's perft routine ran at +/- 40 - 41 million leaves per second (Mlps). In that case the perft function ran within the main thread. After implementation, Perft always uses its own thread(s), even if only running on one thread. That did cost about 2% of performance. (The Kiwipete position would normally complete perft 6 in 198.x-200.x seconds when running single-threaded on the main thread; now it runs at +/- 204 seconds when running single-threaded in its own thread.)
Startpos Perft depth 7:
Code: Select all
Physical cores:
1T: 38.72 Mlps 1.00x
2T: 73.64 Mlps 1.90x
3T: 99.36 Mlps 2.57x
4T: 118.33 Mlps 3.06x
Logical cores:
5T: 132.11 Mlps 3.41x
6T: 129.04 Mlps (probably poor move distribution on this position with this thread count)
7T: 145.94 Mlps 3.76x
8T: 148.11 Mlps 3.83x
The start position doesn't do that well, but I've noticed more often than not that it isn't the best position to test with because it's so simple. Kiwipete (CPW perft results position 2) is often much more representative.
Kiwipete, perft depth 6:
Code: Select all
Physical cores:
1T: 39.26 Mlps 1.00x
2T: 78.16 Mlps 1.99x
3T: 113.20 Mlps 2.88x
4T: 132.41 Mlps 3.37x
Logical cores:
5T: 135.73 Mlps 3.46x
6T: 147.23 Mlps 3.75x
7T: 159.49 Mlps 4.06x
8T: 166.18 Mlps 4.23x
So, the speed keeps increasing, even when using the logical cores. When running more than 8 threads, performance tanks. That's no surprise because at that point the CPU gets overloaded. It is clear that scaling is noticeably better on the Kiwipete position than it is on the starting position. (The starting position has a drop at 6 threads; I assume that move distribution among the threads is not so great, and that some threads are returning early.)
Maybe the scaling will change after I implement a hash table. I suspect it will, but I don't know by how much.