Asynchronous tablebase lookups

Sesse · Post by **Sesse** » Thu Dec 17, 2020 6:19 pm

Since tablebases became commonplace, the storage landscape has shifted. Modern SSDs can reach staggering speeds; a single WD SN850 (a high-end consumer/prosumer NVMe disk) can reach one million IOPS!

However, at QD1 (serial access), this drops to 1/50th of that (21k IOPS). And a chess engine typically drives tablebases very much synchronously; you'll get _some_ parallelism, but in general, this isn't an efficient way to unlock the entire potential of such a disk.

With modern I/O interfaces like io_uring for Linux and (hopefully) the upcoming DirectStorage for Windows, we finally have low-overhead ways of driving these asynchronously. How would a chess engine look like? Is it realistic to keep searching other moves, potentially firing off tablebase lookups for those as well? Obviously, if the first move yielded a win, you're very likely to get a cutoff, but if it yielded a draw, it's likely that you'd need to search remaining moves as well, so you got a head start.

Is it feasible? Also: Without upending the entire way an engine works?

bob · Post by **bob** » Fri Dec 18, 2020 5:50 am

This is not a new idea. Same thing occurs in a cluster chess engine where you distribute the transposition across all nodes. Latency is high compared to local memory. There are several papers around on cluster chess engines. You might give 'em a whirl to see what has been done in the past. The biggest problem is to search beyond a probe, then get a hit, and throw away everything you have done. If you probe a TB, you KNOW you will get a hit (assuming you only probe positions with the number of pieces you have downloaded and saved). So if you KNOW it will hit, then searching beyond that point is a waste. In a transposition table, you do not know whether you will hit or not so the speculative execution makes more sense.

For a EGTB probe, the common idea would be called "lifting the read request." In simple terms, look back in your code's execution paths to see where the information needed to do the probe is available, and "lift" the probe from where it is (at the point where it is needed) to the earliest you can do the probe. Then when you get to the point where you need the probe, check on the async probe and you save some time.

It can complcate your code some as there might be multiple pathways back to that "lift point" so you get to do several probes, hopefully only one per path, but also one in EVERY path.

Hope that makes sense.

Sesse · Post by **Sesse** » Fri Dec 18, 2020 10:48 am

I'm not actually sure if you want to search beyond a probe; if it's a 7-man situation, you probably just want to go on to some other part of the tree. The probe is going to come back eventually… but it might not yield a cutoff.

syzygy · Post by **syzygy** » Sat Dec 19, 2020 3:31 pm

Especially with lazy SMP approaches, it may make sense to have more search threads than cores.

Another idea is to somehow make a TB probe fail when the right memory pages haven't yet been paged in and to then let them get paged in in the background. The TB probe may then succeed at the next iteration. This is tricky to implement (it might be possible on Linux).

Sesse · Post by **Sesse** » Sun Dec 20, 2020 12:42 pm

TBH I think you have to just abandon mmap to get this kind of performance going. You don't want to be roundtripping through the kernel on every I/O.

syzygy · Post by **syzygy** » Sun Dec 20, 2020 2:14 pm

Sesse wrote: ↑Sun Dec 20, 2020 12:42 pm TBH I think you have to just abandon mmap to get this kind of performance going. You don't want to be roundtripping through the kernel on every I/O.

Not sure about that. The page faults are already happening now anyway. But it may be tricky to roll back the TB probe efficiently.

There are also ways to check whether some memory page has been mapped into physical memory before doing the access, but that might incur too much overhead for the "normal" case where the page is present.

Sesse · Post by **Sesse** » Sun Dec 20, 2020 2:24 pm

syzygy wrote: ↑Sun Dec 20, 2020 2:14 pm Not sure about that. The page faults are already happening now anyway. But it may be tricky to roll back the TB probe efficiently.

But there's no good reason why they'd need to happen? io_uring will give you I/O without ever transitioning into kernel space.

If you can do 1M TB lookups per second, it's not even given that you'd need a cache of any kind. You can just do the I/O every time, similar to how you decompress every time in the Syzygy tablebases because it's so cheap. But you'd need some way of doing useful work while waiting for the I/O to come back, since the latency is still there.

As an aside, asynchronous I/O would also be a huge win for tablebases on traditional HDDs. I used it in a project recently solely for that; you get the benefit of the elevator algorithm.

syzygy · Post by **syzygy** » Sun Dec 20, 2020 4:13 pm

Sesse wrote: ↑Sun Dec 20, 2020 2:24 pm
syzygy wrote: ↑Sun Dec 20, 2020 2:14 pm Not sure about that. The page faults are already happening now anyway. But it may be tricky to roll back the TB probe efficiently.
But there's no good reason why they'd need to happen? io_uring will give you I/O without ever transitioning into kernel space.

I guess it's time for me to look into io_uring

If you can do 1M TB lookups per second, it's not even given that you'd need a cache of any kind. You can just do the I/O every time, similar to how you decompress every time in the Syzygy tablebases because it's so cheap. But you'd need some way of doing useful work while waiting for the I/O to come back, since the latency is still there.

So one approach would be to let the TB probe fail if it the data is not immediately available. This is what the Gaviota probing code can do as well ("soft probes"). But ideally the engine probing code wouldn't have to manage its own caching structures (which then couldn't be used by other engine instances).

Sesse · Post by **Sesse** » Sun Dec 20, 2020 4:22 pm

syzygy wrote: ↑Sun Dec 20, 2020 4:13 pm I guess it's time for me to look into io_uring

It's surprisingly comfortable! (But do use liburing, not the syscalls directly.)

So one approach would be to let the TB probe fail if it the data is not immediately available. This is what the Gaviota probing code can do as well ("soft probes"). But ideally the engine probing code wouldn't have to manage its own caching structures (which then couldn't be used by other engine instances).

Yeah, that's probably fine in some cases. Another one would be really aggressive prefetching, e.g. if one capture dives into 7-man, generate all the other captures immediately and prefetch them.

Sesse · Post by **Sesse** » Mon Dec 28, 2020 1:22 am

An effect I hadn't thought of: It seems that synchronous tablebase lookups also make the “stop” command a bit slower, since there's no way to abort an mmap-induced I/O operation. Of course, with an SSD we're only talking milliseconds, but on rotating media with many threads hammering, you could get quite unlucky with your loads.

(I haven't verified this rigorously. There could be other reasons for “stop” in Stockfish being slow that I haven't fully accounted for.)

Asynchronous tablebase lookups

Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups

Re: Asynchronous tablebase lookups