adding TT reduces NPS by allot

RavRav · Post by **RavRav** » Fri Mar 29, 2024 2:59 pm

Hello,

I recently started making a chess engine. Currently I have a basic AB engine, with piece-square tables for evaluation but I have not jet added a q-search.The only move-ordering that I have is the order in which I decided to write my movegeneration(which is somewhat ordered).

When I added a TT to my engine, I observed a significant decrease in NPS, ranging from 60-80%, depending on the FEN and table size.The engine did improve overall however is such a substantial decrease in NPS normal? (I am not sure if I count NPS correctly, when a move is made on the board I count that new board-state as a node)

I have tested running my engine with the TT implimented, but making sure that all the lookups had the wrong key. This resulted in a similar reduction in NPS. I also tried making the TT very small(16 kB) the NPS reduction mostly dissapeared. Thus I assumed that the decrease in NPS was mostly related to fetching memory.

I tried mitigating this by prefetching the ttEntry, using _mm_prefetch, but this didn't help at all. The time between having the zobrist and reading the ttEntry is quite small, so I dont imagine that prefetching can even compensate for much. I also use magic bitboards for movegeneration, those take up quite some space, but then their seems to be no problem.

Am I correct to assume that the delay is likely due to the fetching of memory? Is it normal for fetching the TTentry to take this much time?

My TT implimentation is as follows. I have a big array of entries. These entries are 16 bytes in size.

Code: Select all

struct TTentry {
  uint64_t key=0;
  int eval=0;
  
  /* 	Exact,         this is the solution, regardless of alpha and beta. 
  	LowerBound,    My eval will be greater or equal to this. 
  	Upperbound,    My eval will be smaller or equal to this.*/
  enum NodeType :uint8_t {NONE, EXACT, UPPER_BOUND, LOWER_BOUND}nodeType:2 = NONE;//0-3
  uint8_t inCheck : 2 = 0;//0-3//unused
  uint8_t padding_1 : 4 = 0;//unused

  uint8_t depth : 6;//0-63
  uint8_t padding_2 : 2 = 0;//unused

  CompactMove bestMove{};
};

Then in the search recursion function, I first lookup a TT entry to see if their exists a usefull one. At the end I try to store the result. My current replacement strategy is to only replace nodes at less or equal depth.

Code: Select all

  Move bestMove = Move::NULL_MOVE();
  const int initialAlpha = alpha;

  TTentry& ttEntry = transpositionTable[board.zobristKey.key % transpositionTableSize];
  if (ttEntry.key == board.zobristKey.key && ttEntry.nodeType != TTentry::NONE) {
    if (ttEntry.depth >= depth) {
      if (ttEntry.nodeType == TTentry::LOWER_BOUND) {
        alpha = std::max(alpha,ttEntry.eval);
      }if (ttEntry.nodeType == TTentry::UPPER_BOUND) {
        beta = std::min(beta, ttEntry.eval);
      }
      if (ttEntry.nodeType == TTentry::EXACT || alpha>=beta)return ttEntry.eval;
    }
    bestMove = ttEntry.bestMove.toBoardMove(board);
  }

..... search and evaluation .....

  if (depth >= ttEntry.depth) {
    ttEntry = TTentry{
      .key = board.zobristKey.key,
      .eval = alpha,
      .nodeType = alpha >= beta ? TTentry::LOWER_BOUND : alpha > initialAlpha ? TTentry::EXACT : TTentry::UPPER_BOUND,
      .depth = uint8_t(depth),
      .bestMove = CompactMove(bestMove)
    };
  }

hgm · Post by **hgm** » Fri Mar 29, 2024 3:11 pm

This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes.

As you noticed, a table small enough to fit in cache is substantially faster. So you could use a separate table for the d=0 nodes. Even a very small table could significantly speed up QS, as the QS trees do not branch out very much, and the same final position can often be reached by doing the same captures in another order.

Dann Corbit · Post by **Dann Corbit** » Fri Mar 29, 2024 4:48 pm

Use a power of two hash table size and form the address with &. Division is a lot slower.
Make sure your hash table is aligned on 8 byre boundaries. Try for a compact hash element size that is a multiple of 8 bytes in size.

RavRav · Post by **RavRav** » Fri Mar 29, 2024 7:26 pm

Dann Corbit wrote: ↑Fri Mar 29, 2024 4:48 pm This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes.

As you noticed, a table small enough to fit in cache is substantially faster. So you could use a separate table for the d=0 nodes. Even a very small table could significantly speed up QS, as the QS trees do not branch out very much, and the same final position can often be reached by doing the same captures in another order.

Thank you for the reply. I am happy to hear that it is not me messing up. Using a seperate TT for the QS is something that I will try out. When running the search with a TT with only 1K elements, it suprised me that their was still a decent node reduction.

Dann Corbit wrote: ↑Fri Mar 29, 2024 4:48 pm Use a power of two hash table size and form the address with &. Division is a lot slower.
Make sure your hash table is aligned on 8 byre boundaries. Try for a compact hash element size that is a multiple of 8 bytes in size.

Thank you for the advice, I already used a power of two for my hash table size. I assume that the compiler will optimize the % operator into a & operation(as I declared the size as a constexpr). I had not looked into the alignement.

Ciekce · Post by **Ciekce** » Fri Mar 29, 2024 7:34 pm

hgm wrote: ↑Fri Mar 29, 2024 3:11 pm This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes.

As you noticed, a table small enough to fit in cache is substantially faster. So you could use a separate table for the d=0 nodes. Even a very small table could significantly speed up QS, as the QS trees do not branch out very much, and the same final position can often be reached by doing the same captures in another order.

Probing the TT in qsearch is usually a decent gainer in modern engines, it's not 2005. If speed drops by that much it's presumably an implementation issue.

Ciekce · Post by **Ciekce** » Fri Mar 29, 2024 7:36 pm

Dann Corbit wrote: ↑Fri Mar 29, 2024 4:48 pm Use a power of two hash table size and form the address with &. Division is a lot slower.

You can get non-power of 2 indexing equally cheaply using the fixed-point multiply trick, which you can see here:
https://github.com/Ciekce/Stormphrax/bl ... able.h#L97
https://github.com/official-stockfish/S ... isc.h#L186

chrisw · Post by **chrisw** » Fri Mar 29, 2024 7:53 pm

you can use prefetch on the hash address as soon as you know it

hgm · Post by **hgm** » Fri Mar 29, 2024 10:23 pm

RavRav wrote: ↑Fri Mar 29, 2024 7:26 pmThank you for the advice, I already used a power of two for my hash table size. I assume that the compiler will optimize the % operator into a & operation(as I declared the size as a constexpr). I had not looked into the alignement.

That would be true only if it is known at compile time that the hash size will be a power of two (e.g. because it is defined as a constant). But in an engine complying to one of the standard communication protocols the hash size is variable. And it is questionable whether the compiler would be very smart about this. E.g. when you left-shift 1 as many times as you can without exceeding the requested size, and use that as the actual size, you would have a power of two. But it is questionable if the compiler would be able to deduce that.

syzygy · Post by **syzygy** » Fri Apr 05, 2024 10:11 pm

Ciekce wrote: ↑Fri Mar 29, 2024 7:34 pm
hgm wrote: ↑Fri Mar 29, 2024 3:11 pm This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes.

As you noticed, a table small enough to fit in cache is substantially faster. So you could use a separate table for the d=0 nodes. Even a very small table could significantly speed up QS, as the QS trees do not branch out very much, and the same final position can often be reached by doing the same captures in another order.
Probing the TT in qsearch is usually a decent gainer in modern engines, it's not 2005. If speed drops by that much it's presumably an implementation issue.

So do you think someone's very first chess engine written in 2024 will be a "modern engine"?

Dann Corbit · Post by **Dann Corbit** » Wed Apr 10, 2024 3:51 am

syzygy wrote: ↑Fri Apr 05, 2024 10:11 pm
Ciekce wrote: ↑Fri Mar 29, 2024 7:34 pm
hgm wrote: ↑Fri Mar 29, 2024 3:11 pm This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes.

As you noticed, a table small enough to fit in cache is substantially faster. So you could use a separate table for the d=0 nodes. Even a very small table could significantly speed up QS, as the QS trees do not branch out very much, and the same final position can often be reached by doing the same captures in another order.
Probing the TT in qsearch is usually a decent gainer in modern engines, it's not 2005. If speed drops by that much it's presumably an implementation issue.
So do you think someone's very first chess engine written in 2024 will be a "modern engine"?

I hope not.

All the fun is bound to discovery and learning. If we start off at 3300 Elo, due to incorporation of every good idea ever invented implemented in a non-optimal way, then all that is left is turning some handles until the score settles down.

Now, there is nothing wrong with understanding how a good chess engine works and saving time from that. But I think that if we just want to top the Elo of Stockfish and Lc0, we will find that a tiny team of one or two will never compete with a massive army and we will tire out. But if we are fueled by discovery and invention, then we will have fun anyway, without being number one on the charts.

adding TT reduces NPS by allot

adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot

Re: adding TT reduces NPS by allot