Incremental or non-incremental PST evaluation calcs

hgm · Post by **hgm** » Fri Jan 27, 2012 2:05 pm

Sven Schüle wrote:Incrementally updating also means subtracting old + adding new scores, while "from scratch" only involves additions.

I would think that makes pretty simple arithmetic: incremental update takes only three operations (from, to and capture). In any meaningful position there would be at least 6 pieces, which would already require double the amount of additions when doing it from scratch, and in a more typical case (where the game is decided!) 25-30 pieces. Then take into account that the incremental code is completely branchless, while the from-scratch code has to loop over piece lists of undetermined length, needs to retreive the piece location before it can index the PST with it, has to test the pieces for not being captured and branch based on that... IBased on that it seems to me that there is no room for the slightest amount of doubt that incremental is at least one order of magnitude faster.

Btw, I always use copy-make for this, as copy is exactly as expensive as update. The data has to be loaded into CPU registers to do the calculation anyway, and it does not matter whether you store it to the same memory location or a different one. But with copy-make the unmake is absolutely free. One or two extra ints does not cause significant memory pressure.

I agree it is the QS branching ratio that counts. But it seems to me that needing the PST term in lazy eval only drives up the EBR of QS, as all moves that are pruned based on lazy eval no do count as moves.

Caching effects could favor the "from scratch" implementation even further, depending on actual data structures.

I don't see that at all. I would expect the opposite. Why would making tons of unnecessary data accesses be good for caching? You want to freeze the PST in cache? Even if that would help, there would be far more efficient methods to do this. You could for instance start every node by doing exactly one dummy read operation on each cache line of the PST. That would at least eliminate unnecessary multiple accesses to the same cache line.

Don · Post by **Don** » Fri Jan 27, 2012 3:21 pm

RoadWarrior wrote:
bob wrote:Incremental can begin to not pay off if you go deep. If the tips are too far from the root, then the incremental cost can exceed the total eval cost if computed at the endpoints.
Is this correct? Given, say, a minimal branching factor of 3, the ratio between the total number of nodes and the number of leaf nodes barely alters at all between ply 4 and ply 15.

Once I've expanded my evaluation function to something decent (and added the search), I think it would be worth running a self-play tournament between the incremental and non-incremental versions to see if there's any significant difference. Given your experience, I suspect you're right. But I'd like to prove it for myself.

I think every program does piece square table evaluation but most modern programs do not weight it heavily and only use it for the most general stuff. For example you don't want knights on the edge of board and that can easily be expressed with piece square tables. It's best to avoid anything that is not really general in nature.

In the old days we did piece square tables as a major component of the evaluation function but they were pre-processed. Before the search begins you could "figure out" where to place your pieces based on the root position. For example it's easy to see where a pawn would be passed and you can give them big weights if they reach those squares. The downside of course is that a pawn might be passed on a less advanced square if something in the position changes but at least you get some of it right. In RexChess I think we calculated mobility based on the current pawn structure for each square and you can do things like give a penalty if a pawn appears on a less advanced square than an existing pawn.

All of that worked ok when programs rarely did more than 5 or 6 ply searches.

I remember trying to teach the program to mate with BN vs K and I got it working perfectly only to find that it didn't work at deeper levels! That was a convincing demonstration to me of the problems with pre-processed piece square tables.

Don · Post by **Don** » Fri Jan 27, 2012 3:25 pm

RoadWarrior wrote:
bob wrote:Incremental can begin to not pay off if you go deep. If the tips are too far from the root, then the incremental cost can exceed the total eval cost if computed at the endpoints.
Is this correct? Given, say, a minimal branching factor of 3, the ratio between the total number of nodes and the number of leaf nodes barely alters at all between ply 4 and ply 15.

Once I've expanded my evaluation function to something decent (and added the search), I think it would be worth running a self-play tournament between the incremental and non-incremental versions to see if there's any significant difference. Given your experience, I suspect you're right. But I'd like to prove it for myself.

Beware. As soon as you implement alpha/beta pruning your nodes per second will take a big hit. Part of making the program stronger will be to recover some of that with specialized move generators and such. The reason is that the ratio of work generating vs making moves will change. You will spend more a higher percentage of effort generating moves.

Sven · Post by **Sven** » Fri Jan 27, 2012 5:17 pm

hgm wrote:
Sven Schüle wrote:Incrementally updating also means subtracting old + adding new scores, while "from scratch" only involves additions.
I would think that makes pretty simple arithmetic: incremental update takes only three operations (from, to and capture).

Depends on what you consider as an "operation", I guess in the given case you mean the sum of memory read + addition/subtraction + memory write, correct? Then note, however, that the "from scratch" method needs only one such "operation" per piece since it does not deal with captures nor with from-to, just with one piece on one square at a time.

hgm wrote:In any meaningful position there would be at least 6 pieces, which would already require double the amount of additions when doing it from scratch, and in a more typical case (where the game is decided!) 25-30 pieces.

I don't get that last point (25-30 pieces = "game decided" ?!) but I think it isn't important. Hard to say what a "typical" position is but let's say it has around 16-20 pieces on average.

hgm wrote:Then take into account that the incremental code is completely branchless, while the from-scratch code has to loop over piece lists of undetermined length, needs to retreive the piece location before it can index the PST with it, has to test the pieces for not being captured and branch based on that...

I disagree since that will highly depend on the board representation. For instance, a bitboard engine with redundant 64-byte board does one loop per color over the whole "occupied" bitboard, and no "testing for being captured". And even a mailbox program can have one piece list per color with fixed length (requiring the "being already captured" test) or one piece list per color with dynamic length where only non-captured pieces are listed.

hgm wrote:Based on that it seems to me that there is no room for the slightest amount of doubt that incremental is at least one order of magnitude faster.

Well, from my side of the door that room is quite large

hgm wrote:Btw, I always use copy-make for this, as copy is exactly as expensive as update. The data has to be loaded into CPU registers to do the calculation anyway, and it does not matter whether you store it to the same memory location or a different one. But with copy-make the unmake is absolutely free. One or two extra ints does not cause significant memory pressure.

100% agreed!

hgm wrote:
Caching effects could favor the "from scratch" implementation even further, depending on actual data structures.
I don't see that at all. I would expect the opposite. Why would making tons of unnecessary data accesses be good for caching? You want to freeze the PST in cache? Even if that would help, there would be far more efficient methods to do this. You could for instance start every node by doing exactly one dummy read operation on each cache line of the PST. That would at least eliminate unnecessary multiple accesses to the same cache line.

I don't follow that "tons of unnecessary data accesses". When calculating the PST scores from scratch you access exactly the same data as in the incremental case: the board itself, and the PST table(s). But while in the incremental case most of the data you access has already gone from the cache when doing the next update with the next move, in the "from scratch" case you have all accesses close together in a loop, and can organize it in a cache-friendly way - how exactly you do that will depend on your board repr. I did not intend to say that I want to "freeze" the PST in cache but it should be possible to do something similar only for the short period of doing that "from scratch" PST score calculation loop (that approach might actually almost never access the PST at any other point in time).

So again, the winner is "it depends"

Maybe also the "tweaking abilities" of the programmer will have a great influence.

Sven

bob · Post by **bob** » Fri Jan 27, 2012 6:11 pm

hgm wrote:
bob wrote:With todays programs, the trees are deep and narrow. Which is exactly the circumstance that doesn't favor incremental calculations. The old-style wide/bushy trees were quite good for incremental update, but I don't think it is nearly as clear an advantage today...

Worth actually quantifying at some point...
They are not that narrow: a branching ratio of 2 only doubles the incremental work per end leaf from 1 to 2 updates, as all earlier updates are shared between 2, 4, 8... nodes. And that is only infinitesimally smaller than double for a tree that is not infinitely deep, so depth has certainly nothing to do with it.

Only when the EBF smaller than 1.125 would you get substantial difference in the workload of incremental updating in an 8 ply tree,

Right. But I am talking about 24-30 ply searches in typical CCT-like tournaments...

I am not claiming incremental is bad. or that it is not good. Just that the gains I saw back in the 70's and 80's have not been seen in cases where I tried some incremental updates in recent years...

Sven · Post by **Sven** » Fri Jan 27, 2012 6:43 pm

bob wrote:
hgm wrote:
bob wrote:With todays programs, the trees are deep and narrow. Which is exactly the circumstance that doesn't favor incremental calculations. The old-style wide/bushy trees were quite good for incremental update, but I don't think it is nearly as clear an advantage today...

Worth actually quantifying at some point...
They are not that narrow: a branching ratio of 2 only doubles the incremental work per end leaf from 1 to 2 updates, as all earlier updates are shared between 2, 4, 8... nodes. And that is only infinitesimally smaller than double for a tree that is not infinitely deep, so depth has certainly nothing to do with it.

Only when the EBF smaller than 1.125 would you get substantial difference in the workload of incremental updating in an 8 ply tree,
Right. But I am talking about 24-30 ply searches in typical CCT-like tournaments...

The point of HGM was like this:

- You are at a node X where you call eval() (which is somewhere within QS most of the time). The effort for the incremental PST score update done during the previous move (the move made from a node 1 ply above X) counts by 100% for that eval call.

- Assuming EBF=2 (actually it is smaller in QS which makes HGM's point somewhat weaker), the PST update during the next-to-previous move (2 plies above X) only counts by 50% for eval() at X since on average that update is shared with a sibling of X.

- The PST update done 3 plies above X counts by 25%, and so on.

In total this is ~200% for EBF=2 (300% for EBF=1.5, 500% for EBF=1.25, is that possible within the area of only QS and "low depth" nodes?) and independent from the actual path length from root to X due to "sharing".

Sven

wgarvin · Post by **wgarvin** » Fri Jan 27, 2012 6:49 pm

Sven Schüle wrote:
hgm wrote:
Caching effects could favor the "from scratch" implementation even further, depending on actual data structures.
I don't see that at all. I would expect the opposite. Why would making tons of unnecessary data accesses be good for caching? You want to freeze the PST in cache? Even if that would help, there would be far more efficient methods to do this. You could for instance start every node by doing exactly one dummy read operation on each cache line of the PST. That would at least eliminate unnecessary multiple accesses to the same cache line.
I don't follow that "tons of unnecessary data accesses". When calculating the PST scores from scratch you access exactly the same data as in the incremental case: the board itself, and the PST table(s). But while in the incremental case most of the data you access has already gone from the cache when doing the next update with the next move, in the "from scratch" case you have all accesses close together in a loop, and can organize it in a cache-friendly way - how exactly you do that will depend on your board repr. I did not intend to say that I want to "freeze" the PST in cache but it should be possible to do something similar only for the short period of doing that "from scratch" PST score calculation loop (that approach might actually almost never access the PST at any other point in time).

So again, the winner is "it depends" Maybe also the "tweaking abilities" of the programmer will have a great influence.

Sven

All of that data will stay in the L1 cache 100% of the time while your engine is searching, unless there is something seriously wrong with the design of your program.

My guess is that incremental would be slightly cheaper overall, but maybe the only way to be sure for a given engine is to test both methods in it.

hgm · Post by **hgm** » Fri Jan 27, 2012 7:45 pm

Sven Schüle wrote:Depends on what you consider as an "operation", I guess in the given case you mean the sum of memory read + addition/subtraction + memory write, correct? Then note, however, that the "from scratch" method needs only one such "operation" per piece since it does not deal with captures nor with from-to, just with one piece on one square at a time.

Indeed, I took that into account. I need 3 operations in total, and you need 1 per piece. With 30 pieces that is an order of magnitude. But I believe my perations are significantly faster, because you have loop overhead, branch overhead, extra table accesses to get the indexes to the PST....

I don't get that last point (25-30 pieces = "game decided" ?!) but I think it isn't important. Hard to say what a "typical" position is but let's say it has around 16-20 pieces on average.

Seems a bit low to me, but even if true, it is a lot larger than 3.

I disagree since that will highly depend on the board representation. For instance, a bitboard engine with redundant 64-byte board does one loop per color over the whole "occupied" bitboard, and no "testing for being captured". And even a mailbox program can have one piece list per color with fixed length (requiring the "being already captured" test) or one piece list per color with dynamic length where only non-captured pieces are listed.

Indeed. And compacting the piece list on every MakeMove (in QS...) takes an order of magniture more time than an incremental PST update. And extracting bits from the bitboard with negate, and, complement, and, multiply, table index, to get the piece number was already longer than the PST update. (And most of it is on the critical path of a loop with many iterations!)

Well, from my side of the door that room is quite large

Well, my statement stands. I challenge you to come with actual code for any board representation you think optimal which would do the from-scratch summation faster than 32 clocks for a board with 16 pieces. (Knowing that the incremental way of doing it would take three memory loads from the PST, indexed by the square numbers which are already in registers, as you have just been using them to make the move on the board, three add / subtract ALU operations, and 1 store of the result, i.e. 7 instructions, or 2.33 CPU clocks at 3 instructions per clock.) This actually tempts me to up the ante to two orders of magnitude, but for now I will resist that.

I don't follow that "tons of unnecessary data accesses". When calculating the PST scores from scratch you access exactly the same data as in the incremental case: the board itself, and the PST table(s). But while in the incremental case most of the data you access has already gone from the cache when doing the next update with the next move, in the "from scratch" case you have all accesses close together in a loop, and can organize it in a cache-friendly way - how exactly you do that will depend on your board repr. I did not intend to say that I want to "freeze" the PST in cache but it should be possible to do something similar only for the short period of doing that "from scratch" PST score calculation loop (that approach might actually almost never access the PST at any other point in time).

For the purpose of keeping the PST in memory most of these accesses would be redundant, as you would be making multiple accesses to the same cache line in the loop, where only one would have sufficed to make the entry 'least-recent used'. And I think the whole line of arguing is suspect, as when you keep something in cache, you keep something else out!

You also seem to overlook an important fact in your answer to Bob:

When you want to focus on QS, to argue for a very small branching factor, you are talking about nodes where you would do the from-scratch PST evaluation not only in the leaves, but in every node. So where the effective number of incremental updates would explode as the sum of a geometric series 1/(1-1/EBF) when EBF->1, the number of from-scratch evaluations would do exactly the same. In every QS node where I would have done an incremental update in the MakeMove leading to it, you would have to do a from-scratch evaluation. So in this case they would still have to be compared 1 on 1.

kbhearn · Post by **kbhearn** » Fri Jan 27, 2012 8:19 pm

You may well be looping over each piece list in your eval function for one reason or another anyways - for instance to generate mobility scores for each piece. Adding in the PST value at the same time as you select each piece to check mobility removes most of the overhead you want to claim is extra to calculating from scratch.

You could also argue that the pawn PST is irrelevant to this comparison since you'd want to do more advanced calculations on the pawn structure anyways, and quite possibly hash the result. Whatever value the PST is adding there could either be rolled into the hashed pawn calculation or substituted with alternative setwise bonuses/penalties.

hgm · Post by **hgm** » Fri Jan 27, 2012 9:05 pm

You would still have to do it once for every piece. I don't think you would beat the 2.33 clock that way. And even if it is only 1% of the total evaluation, rather than 0.1%, that is still an order of magnitude. That it is only a very small part of the evaluation in any case is not under discussion. And still no argument to intentionally throw away the 1%.

Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs

Re: Incremental or non-incremental PST evaluation calcs