adding TT reduces NPS by allot

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Viz
Posts: 184
Joined: Tue Apr 09, 2024 6:24 am
Full name: Michael Chaly

Re: adding TT reduces NPS by allot

Post by Viz »

hgm wrote: Mon Apr 15, 2024 6:01 pm But most people of course don't have 'modern hardware'. I guess this is sort of a blind spot for engines that have no real user base. They are only written for the select group of testers, who usually have the best of the best. On my PC (Intel Sandy Bridge) you would be bandwidth limited. Especially when running multi-threaded. The bus clock is 100MHz (multiplier 32), so for fetching a cache line reading 8 words of data alone already takes 256 CPU clocks, during which no other memory access can be served.
Tt in qsearch was shown to be beneficial in stockfish at times not so long after Sandy Bridge became available to market.
Not even saying that % of people that own this type of hardware nowadays is extremely low, so "most people don't have modern hardware" maye be true but most people have newer hardware than Sandy Bridge.
Ciekce
Posts: 148
Joined: Sun Oct 30, 2022 5:26 pm
Full name: Conor Anstey

Re: adding TT reduces NPS by allot

Post by Ciekce »

hgm wrote: Mon Apr 15, 2024 1:53 pm Crafty did this. I would not call Bob Hyatt 'no one'. He was always loudly present.
Mhm. Crafty, the very strong modern engine with current and even decade-old innovations in search.
hgm wrote: Mon Apr 15, 2024 2:22 pm Dozens of examples is not so impressive if you realize there are many hundreds of engines.
hgm wrote: Mon Apr 15, 2024 6:01 pm But most people of course don't have 'modern hardware'. I guess this is sort of a blind spot for engines that have no real user base. They are only written for the select group of testers, who usually have the best of the best. On my PC (Intel Sandy Bridge) you would be bandwidth limited. Especially when running multi-threaded. The bus clock is 100MHz (multiplier 32), so for fetching a cache line reading 8 words of data alone already takes 256 CPU clocks, during which no other memory access can be served.
You can complain from theory until your keyboard breaks, but that changes absolutely nothing about the fact that probing the TT in qsearch has been shown to work in dozens of the not-even-nearly-hundreds of currently relevant engines. Look at this titanic, incredible slowdown that it causes in Stormphrax:

Code: Select all

    stormphrax-4.1.12-native.exe    |stormphrax-4.1.12-native_no_qstt.exe|
        mu              sigma       |        mu              sigma       |   Sp(1)/Sp(2)      3*sigma
------------------------------------+------------------------------------+------------------------------------
       2689468.000             0.000|       2703485.000             0.000|      -0.518 %  +/-  0.000 %
       2691716.000          3179.152|       2711280.500         11024.502|      -0.721 %  +/-  0.859 %
       2700262.000         14971.835|       2713998.000          9106.275|      -0.506 %  +/-  1.271 %
       2701965.000         12690.078|       2718143.000         11135.840|      -0.595 %  +/-  1.166 %
       2700955.000         11219.583|       2717552.600          9733.861|      -0.610 %  +/-  1.015 %
       2704337.667         13013.756|       2721806.333         13578.068|      -0.641 %  +/-  0.936 %
       2707283.571         14208.449|       2722640.714         12590.073|      -0.564 %  +/-  1.053 %
       2708761.125         13802.367|       2722085.250         11761.547|      -0.489 %  +/-  1.163 %
       2709544.444         13123.051|       2722897.889         11268.792|      -0.490 %  +/-  1.088 %
       2711238.300         13482.247|       2722609.100         10663.496|      -0.417 %  +/-  1.236 %
terrible. unworkable technique.
Ciekce
Posts: 148
Joined: Sun Oct 30, 2022 5:26 pm
Full name: Conor Anstey

Re: adding TT reduces NPS by allot

Post by Ciekce »

for completeness, here are (somewhat noisier) measurements with a 1 GB TT, which absolutely cannot fit in cache:

Code: Select all

    stormphrax-4.1.12-native.exe    |stormphrax-4.1.12-native_no_qstt.exe|
        mu              sigma       |        mu              sigma       |   Sp(1)/Sp(2)      3*sigma
------------------------------------+------------------------------------+------------------------------------
       2573580.000             0.000|       2604465.000             0.000|      -1.186 %  +/-  0.000 %
       2595079.500         30404.884|       2605286.000          1161.069|      -0.392 %  +/-  3.368 %
       2597486.333         21899.932|       2609292.667          6988.145|      -0.452 %  +/-  2.402 %
       2590299.750         22941.795|       2600139.000         19175.885|      -0.378 %  +/-  2.012 %
       2594565.400         22039.131|       2602102.200         17177.215|      -0.289 %  +/-  1.841 %
       2566082.167         72500.663|       2606323.500         18519.222|      -1.534 %  +/-  9.291 %
       2571510.571         67724.157|       2606066.429         16919.336|      -1.317 %  +/-  8.654 %
       2572360.000         62746.400|       2603765.875         16962.000|      -1.198 %  +/-  8.076 %
       2573061.333         58731.581|       2604679.000         16101.241|      -1.206 %  +/-  7.555 %
       2569221.800         56688.202|       2602002.600         17380.321|      -1.253 %  +/-  7.137 %
User avatar
hgm
Posts: 28010
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: adding TT reduces NPS by allot

Post by hgm »

I never claimed otherwise. I only pointed out that this is a long-known effect, what has been used to counteract it, and that for some engines this even proved beneficial.

Without knowing what hardware the OP has, or how fast his code is, there can be no prediction what would work best for him. Like with every feature of an engine it would have to be tested.

For my engines probing in QS was always better as well. But it could be different for engines that heavily use SMP, as this requires more memory bandwidth.
smatovic
Posts: 2873
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: adding TT reduces NPS by allot

Post by smatovic »

It depends on engine and hardware, modern hardware has ~10-way memory-level-parallelism per core, so bandwidth should not be the issue but latency, as mentioned, prefetch might help:

Re: Need advice on SMP
viewtopic.php?p=945827#p945827
viewtopic.php?p=945835#p945835
viewtopic.php?p=945860#p945860

--
Srdja
syzygy
Posts: 5646
Joined: Tue Feb 28, 2012 11:56 pm

Re: adding TT reduces NPS by allot

Post by syzygy »

connor_mcmonigle wrote: Mon Apr 15, 2024 12:57 am Sure. However, telling someone not to perform the TT probe in quiescent search to avoid the slowdown is just bad advice generally and, when stated from a position of authority, especially unhelpful. It's pretty established at this point that the cost of performing a TT probe in the quiescent search is worth it, though I'd encourage any new authors to test this themselves and re-test as the cost of their evaluation function increases relative to the cost of a TT probe.
Whether it is worth it will depend on the engine. For most engines (that were not derived from SF), it is most likely not worth it to probe in the qsearch. For an engine written from scratch, it will probably take 10+ years of active development to make probing in the qsearch worth it. In the meantime, just don't probe in the qsearch.
syzygy
Posts: 5646
Joined: Tue Feb 28, 2012 11:56 pm

Re: adding TT reduces NPS by allot

Post by syzygy »

Viz wrote: Mon Apr 15, 2024 2:02 pm
hgm wrote: Mon Apr 15, 2024 1:53 pm Crafty did this. I would not call Bob Hyatt 'no one'. He was always loudly present.

DRAM access is very slow compared to CPU cycle time, equivalent to hundreds of instructions. Adding a slow operation to an otherwise much faster code normally slows it down a lot. But perhaps you consider it normal that engines are very slow to begin with. I suppose 'normal' is a somewhat subjective qualification. I was not the one to introduce it in the discussion, but it seemed was clear that the OP was using it in the sense of 'no indication that I did something wrong that I should worry about'.
No one nowadays does this, this is better?
What you are effectively saying is that everybody nowadays starts by forking SF. And this is simply not the case.

A programmer's first engine written from scratch in 2024 will be quite similar to a programmer's first engine written from scratch in 1994.
syzygy
Posts: 5646
Joined: Tue Feb 28, 2012 11:56 pm

Re: adding TT reduces NPS by allot

Post by syzygy »

Ciekce wrote: Tue Apr 16, 2024 8:53 am
hgm wrote: Mon Apr 15, 2024 1:53 pm Crafty did this. I would not call Bob Hyatt 'no one'. He was always loudly present.
Mhm. Crafty, the very strong modern engine with current and even decade-old innovations in search.
For the third time in this thread: the TS is writing an engine from scratch. He does not start by forking SF. He does not even start from Crafty or gnuchess.
User avatar
towforce
Posts: 11817
Joined: Thu Mar 09, 2006 12:57 am
Location: Birmingham UK

Re: adding TT reduces NPS by allot

Post by towforce »

syzygy wrote: Thu Apr 18, 2024 9:31 pmA programmer's first engine written from scratch in 2024 will be quite similar to a programmer's first engine written from scratch in 1994.

If I were writing a chess engine today, I would use one of the open source chess DLLs to take care of things like move generation. I'm guessing you wouldn't count that as "writing from scratch", but I would call it, "good development practice". :)
The simple reveals itself after the complex has been exhausted.
syzygy
Posts: 5646
Joined: Tue Feb 28, 2012 11:56 pm

Re: adding TT reduces NPS by allot

Post by syzygy »

towforce wrote: Thu Apr 18, 2024 10:47 pm
syzygy wrote: Thu Apr 18, 2024 9:31 pmA programmer's first engine written from scratch in 2024 will be quite similar to a programmer's first engine written from scratch in 1994.
If I were writing a chess engine today, I would use one of the open source chess DLLs to take care of things like move generation. I'm guessing you wouldn't count that as "writing from scratch", but I would call it, "good development practice". :)
Good development practice is not writing a chess engine at all because there are plenty of engines already and you can just use one of those.

In principle I agree with you that not re-inventing the wheel is a good thing, but someone who starts writing a regular chess program will in most cases not do that for the end result (yet another chess engine) but for the (learning) experience. Why would one want to reduce that experience by copying someone else's code?

(Of course in reality there is nobody that could write a half-decent engine entirely from scratch without at least looking at some high-level code. So there will inevitably be some kind of borrowing, and that is fine. And if you want to borrow more than I would like to myself, then that is perfectly fine too (as long as copyrights are respected etc.). And once you have something that plays somewhat decent legal chess without crashing and want to add TBs, then by all means just copy the code from somewhere else.)