A comparison of some Perft programs

hgm · Post by **hgm** » Fri Jan 16, 2026 9:44 am

The latest (mailbox) engine I built used incrementally updated mobility. I wonder if this system could also be used in perft. After all, the final ply of a perft basically just returns the mobility. With the catch that it must be the legal mobility, rather than the pseudo-legal one. So you would have to deal with discovery and blocking of pins as well as of slider moves.

The most expensive part of incremental (pseudo-legal) mobility is adjusting the mobility for the blocking of opponent slider moves by non-captures in the forelast ply. The discovery is less of a problem, as it is caused by evacuating the from-square. And the piece on that square usually has many moves, that can share the result of that calculation. The to-squares typically are scattered much more, so they have to be treated individually. Pre-calculation of, say, how many opponent slider moves go over each square is not competitive if the result of the caculation is not used multiple times, especially if for many squares it would not be used at all.

As an individual test for slider blocking one can test for alignment with each of the opponent sliders, which is a comparatively cheap table access based on the difference of square numbers, and which rules out most sliders. That an 'expensive' ray scan is then necessary for sliders that pass the test does not hurt very much. (And even then such a ray scan is of course extremely cheap compared to full move generation of all opponent's pieces.) I still don't like it that the alignment test has to be done for all opponent sliders individually, though.

Yesterday the following idea occurred to me. You could make a board-size table that holds a 'signature' for every square. This signature is then a set of ranks, files, diagonals and anti-diagonals (8 + 8 + 13 + 13 = 42 in total), where the bits are set for those rays the square is on. Alignment of two squares can then be tested by ANDing their signatures. If a set bit remains, there would be alignment, and the bit would tell you on which ray.

A pre-calculation could then OR the signature for all enemy sliders, after ANDing those with the directions they move in. (E.g. for a Rook we would clear the diagonal bits.) We could then AND the signature at the to-square of each move with this 'slider summary' as a primary filter. Moves for which this result is zero will immediately be eliminated as blocking any slider moves. If any bits are set these identify the ray on which blocking might occur, and as a consequence to some extent the slider thay could be blocked (orthogonal or diagonal). So you would typically only have to test 2 or 3 individually. It would also be possible to prepare a small table that for each ray that contains sliders specifies the sliders (moving along that ray) they contain. So you can simply look up which sliders the to-square is aligned with.

Sopel · Post by **Sopel** » Sun Feb 01, 2026 9:27 pm

phhnguyen wrote: ↑Mon Jan 12, 2026 11:30 am
Sopel wrote: ↑Sun Jan 11, 2026 9:42 pm FWIW on 7800x3d on windows I get ~300Mnps for SF 17.1 and 1650Mnps for gigantua on perft 7
Your information is ambiguous to me. SF doesn't print out both elapsed and speed when computing Perft. How can you get the speed of 300 Mbps? If it is the speed of normal search, we cannot use it to compare with Gigantua since (AFAK) it uses leave nodes from bulk counting, but not nodes by making/undoing to calculate Perft speed.

The best if you could do, please measure the time for calculating Perft 8 for both programs. SF doesn't print out elapsed, but you can easily add a few commands to SF code (e.g., such as: TimePoint startPoint = now();... elapsed = now() - startPoint;), recompile and run. You may find a 3rd party program to run it for timing, too. For Gigantua, you can download its 64-bit exe file from its GitHub and run it as below. It will calculate elapsed time itself:
Code: Select all
Gigantua.exe "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 8
FYI: I have just re-run some Perft programs from my PC after resetting. There is no chance for Gigantua on my old PC: it is still lagging behind SF. Hope it performs better on your PC.

time can be measured by outside forces, and I can do division

abulmo2 · Post by **abulmo2** » Tue Feb 03, 2026 2:19 am

phhnguyen wrote: ↑Mon Jan 12, 2026 11:03 am
petero2 wrote: ↑Sun Jan 11, 2026 8:51 pm
phhnguyen wrote: ↑Sat Jan 10, 2026 11:25 pm 7) MPerft by Richard Delorme
https://github.com/abulmo/MPerft

The project was updated 6 years ago. I downloaded and compiled it via the make command.

It is the fastest and significantly faster than other programs in this comparison, 17 times faster than Stockfish. On the screen, it prints clearly that it uses hashing and multithreading.
Are you sure it is using multithreading? On my computer it says it is using hashing and bulk counting, but it doesn't say anything about multiple threads, and only uses one thread. Also looking at the source code I see nothing about threads in it.
You are right! It is my mistake! MPerft uses bulk counting and hashing only!

I fixed this in version 3.0. Now mperft uses also multithreading. On my old ryzen 9 5950:

Code: Select all

$ mperft-3.0-x86-64-v3 -d 8 -b -h 256 -t 32 -q
perft  8 :     84998978956 leaves in      0.553 s 153729146951 leaves/s

abulmo2 · Post by **abulmo2** » Sat Feb 07, 2026 11:49 am

phhnguyen wrote: ↑Mon Jan 12, 2026 10:59 am
hgm wrote: ↑Sun Jan 11, 2026 11:46 am The key to doing fast perfts seems to be hashing.
Agreed. Hashing is one of the largest gain methods for Perft.

However, from my experience as well as reading somewhere, hashing may give a factor of about 2 times speed. My computer has 8 cores, and multithreading could bring me a factor of 8 times faster

With the latest MPerft that implement multithreading, I got this on an AMD Ryzen 9 5950x.

bulk counting: x10
multithreading with 16 threads: x14
transposition table of 256Mbytes (16M entries): x17

ankan · Post by **ankan** » Fri May 15, 2026 7:14 am

Hello perft fans,

I’m posting on this forum after almost a decade. I haven’t had much time to spend on chess lately, but recent advances in AI coding agents have completely changed the game for hobby projects like this.

I pointed the agents at my old codebase and asked them to:
- analyze the implementation,
- experiment with optimization ideas,
- benchmark and profile different approaches,
- incorporate some ideas I had never gotten around to trying myself,
- and iterate repeatedly on improvements.
After dozens of iterations, the result is a >2× speedup over the old code — code that I was actually quite proud of back then.
Funny enough, I haven’t even gone through all the details of the updated implementation yet, so I’m a little conflicted about whether this is still “fun” anymore

Here are the updated versions of my CPU and GPU perft programs:
GPU: https://github.com/ankan-ban/perft_gpu_2026
CPU: https://github.com/ankan-ban/perft_cpu_2026

Features:
- TT OFF and TT ON modes
- Single-threaded and multi-threaded CPU versions
- Bulk counting at leaf nodes across all variants
The CPU TT path probably still has a lot of room for improvement — I haven’t tuned it yet.

Some interesting results:

Without TT (Kiwipete position — single-threaded: depth 5, multi-threaded/GPU: depth 7)

Code: Select all

* Apple M4 CPU
      ~2.9 billion NPS single-threaded
      ~17.3 billion NPS with 10 threads
* Intel Core Ultra 7 270K CPU
      ~2.6 billion NPS single-threaded
      ~50.2 billion NPS with 24 threads
* RTX 4090 GPU
      ~1,308 billion NPS
* RTX Pro 6000 Blackwell
      ~2,110 billion NPS

With TT (start position), RTX Pro 6000 Blackwell (96 GB VRAM + 96 GB system memory)

Code: Select all

* Perft 10 in ~2 seconds
* Perft 13 in ~1 hour 34 minutes

syzygy · Post by **syzygy** » Sat May 16, 2026 12:53 am

ankan wrote: ↑Fri May 15, 2026 7:14 am I’m posting on this forum after almost a decade. I haven’t had much time to spend on chess lately, but recent advances in AI coding agents have completely changed the game for hobby projects like this.

I pointed the agents at my old codebase and asked them to:
- analyze the implementation,
- experiment with optimization ideas,
- benchmark and profile different approaches,
- incorporate some ideas I had never gotten around to trying myself,
- and iterate repeatedly on improvements.
After dozens of iterations, the result is a >2× speedup over the old code — code that I was actually quite proud of back then.
Funny enough, I haven’t even gone through all the details of the updated implementation yet, so I’m a little conflicted about whether this is still “fun” anymore

It's crazy....

smatovic · Post by **smatovic** » Sat May 16, 2026 1:16 am

syzygy wrote: ↑Sat May 16, 2026 12:53 am
ankan wrote: ↑Fri May 15, 2026 7:14 am I’m posting on this forum after almost a decade. I haven’t had much time to spend on chess lately, but recent advances in AI coding agents have completely changed the game for hobby projects like this.

I pointed the agents at my old codebase and asked them to:
- analyze the implementation,
- experiment with optimization ideas,
- benchmark and profile different approaches,
- incorporate some ideas I had never gotten around to trying myself,
- and iterate repeatedly on improvements.
After dozens of iterations, the result is a >2× speedup over the old code — code that I was actually quite proud of back then.
Funny enough, I haven’t even gone through all the details of the updated implementation yet, so I’m a little conflicted about whether this is still “fun” anymore
It's crazy....

Yes

I do consider Ankan as top GPU chess programmer, if the AI agents are able to speed up his codebase 2 fold, then we must have passed some kind of inflection point with generative AI...

--
Srdja

Sapling · Post by **Sapling** » Sun May 24, 2026 10:30 pm

ankan wrote: ↑Fri May 15, 2026 7:14 am Hello perft fans,

I’m posting on this forum after almost a decade. I haven’t had much time to spend on chess lately, but recent advances in AI coding agents have completely changed the game for hobby projects like this.

I pointed the agents at my old codebase and asked them to:
- analyze the implementation,
- experiment with optimization ideas,
- benchmark and profile different approaches,
- incorporate some ideas I had never gotten around to trying myself,
- and iterate repeatedly on improvements.
After dozens of iterations, the result is a >2× speedup over the old code — code that I was actually quite proud of back then.
Funny enough, I haven’t even gone through all the details of the updated implementation yet, so I’m a little conflicted about whether this is still “fun” anymore

Here are the updated versions of my CPU and GPU perft programs:
GPU: https://github.com/ankan-ban/perft_gpu_2026
CPU: https://github.com/ankan-ban/perft_cpu_2026

Features:
- TT OFF and TT ON modes
- Single-threaded and multi-threaded CPU versions
- Bulk counting at leaf nodes across all variants
The CPU TT path probably still has a lot of room for improvement — I haven’t tuned it yet.

Some interesting results:

Without TT (Kiwipete position — single-threaded: depth 5, multi-threaded/GPU: depth 7)
Code: Select all
* Apple M4 CPU
      ~2.9 billion NPS single-threaded
      ~17.3 billion NPS with 10 threads
* Intel Core Ultra 7 270K CPU
      ~2.6 billion NPS single-threaded
      ~50.2 billion NPS with 24 threads
* RTX 4090 GPU
      ~1,308 billion NPS
* RTX Pro 6000 Blackwell
      ~2,110 billion NPS
With TT (start position), RTX Pro 6000 Blackwell (96 GB VRAM + 96 GB system memory)
Code: Select all
* Perft 10 in ~2 seconds
* Perft 13 in ~1 hour 34 minutes

What an epic return to the chess scene, absolutely amazing.

There is currently an individual in TGCT discord trying to push to greater PERFT depths, they recently came up with a compression algorithm for storing positions at an average of 4bits/position. If you're interested we'd love for your input!

I just benchmarked your CPU move generator and can confirm it's unlike anything i've seen https://grandchesstree.com/leaderboard
360B+ nps on my 7950x.

NaltaP312 · Post by **NaltaP312** » Mon May 25, 2026 6:47 pm

You can try Xiphos chess engine
https://github.com/milostatarevic/xiphos
On Mac M5 : perft(7)=3195901860, time: 3964ms, nps: 806028211 (no hashing, bulk counting)

A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs