What would you do differently

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What would you do differently

Post by bob »

lucasart wrote:
Bloodbane wrote:I've got a few macros of that kind there, but I don't like how there is bitshifting going under the hood all the time. The question is does removing that compensate for the increased memory usage. I haven't tested the idea yet so I don't know how it will go, but I don't think it's doomed to fail. I imagine you've tested the idea at some point?
I solve this problem as follows:
1. move type has strict minimum info, which is fsq tsq and promotion piece (no flag).
2. I do the compression/decompression at once. When a function is given a compressed move and needs to use the components of it, I simply unpack into 3 ints and work with those throughout the function. And vice-versa (function calculates 3 ints and packs at the end to return a move).

Despite the fact that I am writing C code (no C++), I never use macros or even inline. I use extern functions implemented in source files (never in header files to keep those clean). I let gcc's link time optimization do the job of deciding what to inline at the code generation step which is done by at link time (ie. across all compilation units which are bytecode at this stage). Advantage of functions vs macros are obvious (debugging, type safety, avoid subtle syntactic traps when you forget to parenthethize everything in your macros or when you forget the famous "do {...} while (0);" hack...)

I can't believe the C-guy is telling the C++ guy not to use macros. Shouldn't it be the other way around ? ;)
The good thing about macros is there is no "option" to the compiler. A compiler can choose to inline or not as it sees fit. There are a bunch of issues involved in making the "to inline or not to inline" decision. With a macro, there are none.

lto is not fully implemented for all formats yet. But if you do as I do, and have a simple "crafty.c" that #include's every source file, you get exactly the same effect because the compiler sees everything at once. And from lots of testing, this actually produces a faster code for me than depending on lto to take care of this.

to each his own, of course, but I have to deal with non-elf, non-macho machines all the time where lto is not always implemented.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: lazy eval

Post by bob »

jdart wrote:If you only used the the static eval for just generating a cutoff in the q-search then a correct implementation of it would be non-problematic I think. But eval is also is commonly used for pruning decisions at higher depths, as well as setting various thresholds (e.g. I vary null depth based on eval). Once you have an inexact score for the eval it is very hard to make these other places work properly. (You could possibly do a more exact eval only when you need it but that is a more complex implementation).

--Jon
Do you not pass in alpha and beta to your eval? You can, from inside the search, pass in any arbitrary bound you want to improve the accuracy, while driving up the cost (computational) at the same time of course.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What would you do differently

Post by bob »

lucasart wrote:
bob wrote:
lucasart wrote:
jdart wrote:
if you were restarting from scratch, what would you do differently?
Test, test and test some more. It is easy to introduce bugs and bugs can have a significant effect on playing strength. So verifying correctness is very important. I have unit tests now for some important functions (disabled in production builds) as well as a lot of asserts, support for perft, etc.

If you are careful there is no reason to avoid incremental movegen. But you have to verify it is working as you expect.

I agree with Pawel about lazy eval. I used to have it but it is quite problematic.

--Jon
Unit tests: completely agree. I am progressing very slowly and validating everything one step at a time. I have unit test for perft, bench, see. perft and bench tests validate correctness (that plus git bissect has made wonders in the past). They can also be used as a speed measurement to validate non functional patches supposed to speed up the engine.

Assert: I have them almost wherever possible. Every time a function is entered parameters are checked. Idea is not just to spot bugs but spot them early in the call tree so the consequences can easily be linked to the cause...

Phased movegen: I am ready to believe that there is a speed up, but I am pretty sure it can only be very small.
1. most nodes are qs nodes where a phased moven would generate all pseudo legal captures anyway. no speed up there (quite the contrary due to legal move checking having to be performed).
2. I think the real gain of phasing is when you have a TT move, and can skip movegen completely. This can also be done with a legal movegen, so I may do it later, if and when I care so much about raw speed.
3. In practice movegen is only a small part of my profile in DiscoCheck, so doing psejdo legal and phased movegen is way too much complexity and risk of bugs for a tiny gain IMO.

Lazy eval is one of those stupid ideas that looks like a good idea at first. Been there, done that, got the T-shirt... No lazy eval for me, thanks. Again, making design decisions based on dubious compromise between speed and accuracy is already the first sign that's it's plainly wrong. I do not compromise accuracy and correctness, speed is important but it comes after those.
A note on phased move generation. Here are the places where you save, ranked from highest to lowest.

1. No generation until hash move searched. Many times you bypass move generation completely. But this really doesn't required a "phased" generation, just avoiding any sort of generation until after searching the hash move, and then you need to be able to exclude searching that move gain after the move generation is done.

2. Captures only saves a LOT of time. In the q-search, just generating captures (bit board engines only of course) avoids a lot of unnecessary effort both in ordering the captures as well as in simply generating them.

3. I do killers next, which are guaranteed to be non-captures. I do this before generating non-captures for the same reason as the hash move above.

4. Next I generate the non-captures, add them to the captures I didn't rank high enough to search during the non-losing captures only part of the search. Again you need to eliminate that pesky hash move if it was not a capture, and you need to eliminate the killers that you have already searched.

I think the biggest gain is generating captures only since that eliminates looping down rays for sliding pieces, and looping over empty destinations for all pieces..

Lazy eval works done right. It can be problematic otherwise. But that is true of most any serious optimization one might do in an engine. Anything can be done poorly.
Thank you. That confirms my intuition. I already do 2 in QS. I will probably also do 1, but only if the gain is really measurable. My move ordering is perhaps different from yours though. What worked best in DiscoCheck is good captures by descnding SEE before quiet moves (in search) the bad captures by descending SEE. For QS I order captured by MVV/LVA which is slightly better than SEE.
If you count the number of fail highs on the first move searched, most average over 90%. That means that it is very likely the hash move will fail high, at least when you are at a CUT node. I don't remember what this saved, but I could probably measure it by just forcing a move generation at the top of NextMove() rather than waiting until after the hash move has been searched. I'd expect at least 5% savings and it is probably more. I'll try to run this in a bit to see...

Tried it for one test position, speed dropped from 6.9M nodes per second to 6.4M, using one CPU for repeatability...
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: lazy eval

Post by jdart »

Yes, when I had lazy eval, I passed in the bounds.

But if you call it multiple times with different bounds, you can't cache and re-use the result. And this negates some of the benefit of lazy eval. Now I don't use it but the result of eval is cached (and stored in the hashtable).

--Jon
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: What would you do differently

Post by Rein Halbersma »

mar wrote:
Rein Halbersma wrote:Ah, but what every chess programmer calls a hash table is not the same thing as std::unordered_map or boost::map. The latter do dynamic allocation of new nodes and expand automatically. What chess programs use is more appropriately known as a bounded cache (N-way associative for N-element bucket replacement)!
Really? :lol: C'mon Rein, you completely twisted what I wrote! We both know what we were talking about, it was not about transposition table.
No twisting intended: I simply replied to your HashMap example, not to your general remark. I didn't know about your clever implementation of a dynamic hash map, very nice! I thought all chess engines used a fixed size TT with a replacement scheme. Seems to me that a dynamic hash table can eat up all RAM when searching for a few hours ( correspondence chess). How do you guard against this in your engine?
Anyway, I already polluted Lucas' thread more than I intended to... so I stop here :)


Actually my main remarks were on topic: data layout and search function complexity. What are your ideas about that?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: lazy eval

Post by bob »

jdart wrote:Yes, when I had lazy eval, I passed in the bounds.

But if you call it multiple times with different bounds, you can't cache and re-use the result. And this negates some of the benefit of lazy eval. Now I don't use it but the result of eval is cached (and stored in the hashtable).

--Jon
I cache pawn scoring. That is really all. I don't cache the eval because I don't hash in the q-search. Used to many years ago but testing showed (for my code anyway) that removing q-search hashing sped up the search enough to offset the loss caused by not hashing in search for search results or evaluation results. Since it was a wash, I went for not hashing because it greatly reduces the pressure on the hash table.
mar
Posts: 2559
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: What would you do differently

Post by mar »

Rein Halbersma wrote:No twisting intended: I simply replied to your HashMap example, not to your general remark. I didn't know about your clever implementation of a dynamic hash map, very nice! I thought all chess engines used a fixed size TT with a replacement scheme. Seems to me that a dynamic hash table can eat up all RAM when searching for a few hours ( correspondence chess). How do you guard against this in your engine?
Just to clean up the confusion: I was talking about general purpose hash map (associative array) all the time.
Of course I'm using a preallocated fixed-size TT, anything else would be a performance killer and waste of precious RAM :)
Actually my main remarks were on topic: data layout and search function complexity. What are your ideas about that?
I know, I was talking about my replies :) My engine isn't particularly fast, I never bothered with micro-optimizations.
My take on cache-friendliness (in general) is to pack data together (prefer vector over list etc. if you want dynamic containers) = avoid dynamic allocations for small elements and (multiple) indirection via pointers. But I guess that's obvious.
mvk
Posts: 589
Joined: Tue Jun 04, 2013 10:15 pm

Re: lazy eval

Post by mvk »

jdart wrote:Yes, when I had lazy eval, I passed in the bounds.

But if you call it multiple times with different bounds, you can't cache and re-use the result. And this negates some of the benefit of lazy eval. Now I don't use it but the result of eval is cached (and stored in the hashtable).

--Jon
I see lazy eval as a pitfall. If you find it helps a lot, you are searching the wrong part of the tree and should work on that first. And then it keeps working against you when tuning the evaluation weights, and the tree shaping in general because of the lost fidelity: every change in the right direction now first must overcome a barrier raised by this premature optimisation. Development-wise, it therefore tends to keep you locked in a local optimum of program strength. I still have it in the code, but with huge bounds so it is effectively off. This might be different if you're within 30 elo of the top and must scrape the barrel.
[Account deleted]
BeyondCritics
Posts: 396
Joined: Sat May 05, 2012 2:48 pm
Full name: Oliver Roese

Re: What would you do differently

Post by BeyondCritics »

I have downloaded your test code and tried it and can confirm your results. Actually you had been modest, with other parameters i got a speed ratio as high as 5 to 1 (!).

It is true that the stl library code gives strong exception safety, but that alone can not explain that whopping difference. Besides that, your code is tight but there is nothing exceptional.

You are using somewhat more memory. I corrected for that with

Code: Select all

m1.max_load_factor(1);
m1.reserve(4096);
but it is then still twice as slow.

You are using a fast modulo implementation, simply cutting of the higher digits. I "corrected" for that, replacing

Code: Select all

size_t index = h & ((size_t)sz-1);
with

Code: Select all

size_t index = h % ((size_t)sz);
but it is still roughly two times slower.

When i trace into the stl library code i encounter lots of ugly type casts and indirections. Maybe they just getting their heavy abstraction penalty for that.

Very instructive.
User avatar
emadsen
Posts: 434
Joined: Thu Apr 26, 2012 1:51 am
Location: Oak Park, IL, USA
Full name: Erik Madsen

Re: What would you do differently

Post by emadsen »

If you were restarting from scratch, what would you do differently?
Hey Lucas. I just started rewriting MadChess using procedural code. To date, I've written it using object-oriented code, which I know performs worse. But I wanted to establish a benchmark using a familiar coding style, which I have: ~2250 ELO.

I'm still coding using C# but eliminating the OO abstractions and focusing on performance instead of code readability. I'm still using mailbox board representation because, quite frankly, I don't understand bitboards and all their clever bit-shifting. And I don't want to copy-paste code I don't understand.

Maybe one of these days I'll graduate to pure C and bitboards- and actually understand the code :). For now, I'll just make MadChess procedural.
My C# chess engine: https://www.madchess.net