I assume you are using bulk counting to reach those speeds. If you are, then be prepared for a _massive_ drop in speed when you are going to implement the chess engine, because:
- Bulk counting is a perft trick, and useless for playing chess
- Your engine is going to have things to keep track of incrementally, such as zobrist hashing, while making moves. This reduces perft speed. After you implement PST's, you'll find that applying the PST's over and over again in the evaulation function is very expensive... so you'll start to keep the PST value incrementally as well. And a second set too, when you add tapered evaluation. And the evaluation phase / material score...
I did the same as you, focusing on the speed of the move generator (without bulk counting, because that isn't used for chess), and each time I added a new feature with incrementally kept values the leaves/second dropped a few percent. And it still hurts
One optimization I can still make is looking into PEXT, but I don't know how much that will gain.