adding TT reduces NPS by allot

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

syzygy
Posts: 5576
Joined: Tue Feb 28, 2012 11:56 pm

Re: adding TT reduces NPS by allot

Post by syzygy »

Dann Corbit wrote: Wed Apr 10, 2024 3:51 am
syzygy wrote: Fri Apr 05, 2024 10:11 pm
Ciekce wrote: Fri Mar 29, 2024 7:34 pm
hgm wrote: Fri Mar 29, 2024 3:11 pm This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes.

As you noticed, a table small enough to fit in cache is substantially faster. So you could use a separate table for the d=0 nodes. Even a very small table could significantly speed up QS, as the QS trees do not branch out very much, and the same final position can often be reached by doing the same captures in another order.
Probing the TT in qsearch is usually a decent gainer in modern engines, it's not 2005. If speed drops by that much it's presumably an implementation issue.
So do you think someone's very first chess engine written in 2024 will be a "modern engine"?
I hope not.

All the fun is bound to discovery and learning. If we start off at 3300 Elo, due to incorporation of every good idea ever invented implemented in a non-optimal way, then all that is left is turning some handles until the score settles down.

Now, there is nothing wrong with understanding how a good chess engine works and saving time from that. But I think that if we just want to top the Elo of Stockfish and Lc0, we will find that a tiny team of one or two will never compete with a massive army and we will tire out. But if we are fueled by discovery and invention, then we will have fun anyway, without being number one on the charts.
Agreed.
connor_mcmonigle
Posts: 543
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: adding TT reduces NPS by allot

Post by connor_mcmonigle »

Sure. However, telling someone not to perform the TT probe in quiescent search to avoid the slowdown is just bad advice generally and, when stated from a position of authority, especially unhelpful. It's pretty established at this point that the cost of performing a TT probe in the quiescent search is worth it, though I'd encourage any new authors to test this themselves and re-test as the cost of their evaluation function increases relative to the cost of a TT probe.

The solution adopted in most recent engines to ameliorate the cost of the TT probe is to prefetch relevant TT entries ahead of the probe. Notably, this prefetch needs to be performed well ahead of the following TT probe (typical is to compute a zobrist prior to making the move in the ancestor node). Additionally, ensure your entries are reasonably sized (16 bytes or less) and, if you're using a bucketing scheme, ensure that your buckets fit cleanly in a cache line.
User avatar
hgm
Posts: 27894
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: adding TT reduces NPS by allot

Post by hgm »

I guess that is why no one told such a thing. :wink:

Prefetching is not much help if memory bandwidth is the limiting factor. It can even make things worse, if you are prefetching entries that in the end turn out not to be needed. The problem is that there is very little you need to do unconditionally between deciding a move should be searched and needing the probe result.
Viz
Posts: 95
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: adding TT reduces NPS by allot

Post by Viz »

"This is normal, especially if your move-generation and evaluation code is fast. TT probing is a quite slow operation. In micro-Max using TT made the speed drop from 6Mnps to 1Mnps.

Conventional solution to this problem is not probe the TT in the QS/leaf nodes."
a) This is not normal;
b) No one does this type of conventional solution at all.
User avatar
hgm
Posts: 27894
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: adding TT reduces NPS by allot

Post by hgm »

Crafty did this. I would not call Bob Hyatt 'no one'. He was always loudly present.

DRAM access is very slow compared to CPU cycle time, equivalent to hundreds of instructions. Adding a slow operation to an otherwise much faster code normally slows it down a lot. But perhaps you consider it normal that engines are very slow to begin with. I suppose 'normal' is a somewhat subjective qualification. I was not the one to introduce it in the discussion, but it seemed was clear that the OP was using it in the sense of 'no indication that I did something wrong that I should worry about'.
Viz
Posts: 95
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: adding TT reduces NPS by allot

Post by Viz »

hgm wrote: Mon Apr 15, 2024 1:53 pm Crafty did this. I would not call Bob Hyatt 'no one'. He was always loudly present.

DRAM access is very slow compared to CPU cycle time, equivalent to hundreds of instructions. Adding a slow operation to an otherwise much faster code normally slows it down a lot. But perhaps you consider it normal that engines are very slow to begin with. I suppose 'normal' is a somewhat subjective qualification. I was not the one to introduce it in the discussion, but it seemed was clear that the OP was using it in the sense of 'no indication that I did something wrong that I should worry about'.
No one nowadays does this, this is better?
Not doing it at leaf nodes and qsearch is ofc a solution.
Like solution to not miss zugzwangs in null move pruning is disabling it completely. Is it a solution? Yes. A good one? No.
And since there are literally dozens of examples that it's not only doable but is actually beneficial advicing not using TT as "conventional" solution is just bizarre.
User avatar
hgm
Posts: 27894
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: adding TT reduces NPS by allot

Post by hgm »

Dozens of examples is not so impressive if you realize there are many hundreds of engines.

And no matter how much it might annoy you, the only way to get a node rate faster than the memory bandwidth is not probe the DRAM in every node.

The conventional solution for not missing zugzwangs is a (heavily reduced) verification search.
connor_mcmonigle
Posts: 543
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: adding TT reduces NPS by allot

Post by connor_mcmonigle »

hgm wrote: Mon Apr 15, 2024 2:22 pm Dozens of examples is not so impressive if you realize there are many hundreds of engines.

And no matter how much it might annoy you, the only way to get a node rate faster than the memory bandwidth is not probe the DRAM in every node.

The conventional solution for not missing zugzwangs is a (heavily reduced) verification search.
The bandwidth is never going to be the limiting factor on modern hardware assuming a sane TT entry size (ballpark 10-100 GiB/s). The real issue is the latency of fetching main memory into the cache stalling your search for 100s if cycles: there's more than enough bandwidth headroom so prefetching eagerly is always worth it.
User avatar
hgm
Posts: 27894
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: adding TT reduces NPS by allot

Post by hgm »

But most people of course don't have 'modern hardware'. I guess this is sort of a blind spot for engines that have no real user base. They are only written for the select group of testers, who usually have the best of the best. On my PC (Intel Sandy Bridge) you would be bandwidth limited. Especially when running multi-threaded. The bus clock is 100MHz (multiplier 32), so for fetching a cache line reading 8 words of data alone already takes 256 CPU clocks, during which no other memory access can be served.
connor_mcmonigle
Posts: 543
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: adding TT reduces NPS by allot

Post by connor_mcmonigle »

hgm wrote: Mon Apr 15, 2024 6:01 pm But most people of course don't have 'modern hardware'. I guess this is sort of a blind spot for engines that have no real user base. They are only written for the select group of testers, who usually have the best of the best. On my PC (Intel Sandy Bridge) you would be bandwidth limited. Especially when running multi-threaded. The bus clock is 100MHz (multiplier 32), so for fetching a cache line reading 8 words of data alone already takes 256 CPU clocks, during which no other memory access can be served.
I don't think your analysis is correct. A sandy bridge part should be capable of ~10GiB/s and the memory controller should be capable of performing multiple simultaneous requests. The key is hiding that latency by asynchronously prefetching the relevant memory.