Sven Schüle wrote:Depends on what you consider as an "operation", I guess in the given case you mean the sum of memory read + addition/subtraction + memory write, correct? Then note, however, that the "from scratch" method needs only one such "operation" per piece since it does not deal with captures nor with from-to, just with one piece on one square at a time.
Indeed, I took that into account. I need 3 operations in total, and you need 1 per piece. With 30 pieces that is an order of magnitude. But I believe my perations are significantly faster, because you have loop overhead, branch overhead, extra table accesses to get the indexes to the PST....
I don't get that last point (25-30 pieces = "game decided" ?!) but I think it isn't important. Hard to say what a "typical" position is but let's say it has around 16-20 pieces on average.
Seems a bit low to me, but even if true, it is a lot larger than 3.
I disagree since that will highly depend on the board representation. For instance, a bitboard engine with redundant 64-byte board does one loop per color over the whole "occupied" bitboard, and no "testing for being captured". And even a mailbox program can have one piece list per color with fixed length (requiring the "being already captured" test) or one piece list per color with dynamic length where only non-captured pieces are listed.
Indeed. And compacting the piece list on every MakeMove (in QS...) takes an order of magniture more time than an incremental PST update. And extracting bits from the bitboard with negate, and, complement, and, multiply, table index, to get the piece number was already longer than the PST update. (And most of it is on the critical path of a loop with many iterations!)
Well, from my side of the door that room is quite large
Well, my statement stands. I challenge you to come with actual code for any board representation you think optimal which would do the from-scratch summation faster than 32 clocks for a board with 16 pieces. (Knowing that the incremental way of doing it would take three memory loads from the PST, indexed by the square numbers which are already in registers, as you have just been using them to make the move on the board, three add / subtract ALU operations, and 1 store of the result, i.e. 7 instructions, or 2.33 CPU clocks at 3 instructions per clock.) This actually tempts me to up the ante to two orders of magnitude, but for now I will resist that.
I don't follow that "tons of unnecessary data accesses". When calculating the PST scores from scratch you access exactly the same data as in the incremental case: the board itself, and the PST table(s). But while in the incremental case most of the data you access has already gone from the cache when doing the next update with the next move, in the "from scratch" case you have all accesses close together in a loop, and can organize it in a cache-friendly way - how exactly you do that will depend on your board repr. I did not intend to say that I want to "freeze" the PST in cache but it should be possible to do something similar only for the short period of doing that "from scratch" PST score calculation loop (that approach might actually almost never access the PST at any other point in time).
For the purpose of keeping the PST in memory most of these accesses would be redundant, as you would be making multiple accesses to the same cache line in the loop, where only one would have sufficed to make the entry 'least-recent used'. And I think the whole line of arguing is suspect, as when you keep something in cache, you keep something else out!
You also seem to overlook an important fact in your answer to Bob:
When you want to focus on QS, to argue for a very small branching factor, you are talking about nodes where you would do the from-scratch PST evaluation not only in the leaves, but in every node. So where the effective number of incremental updates would explode as the sum of a geometric series 1/(1-1/EBF) when EBF->1, the number of from-scratch evaluations would do exactly the same. In every QS node where I would have done an incremental update in the MakeMove leading to it, you would have to do a from-scratch evaluation. So in this case they would still have to be compared 1 on 1.