hgm wrote: ↑Sat Apr 27, 2024 11:48 pm
Pali wrote: ↑Sat Apr 27, 2024 9:55 pmYou are... just wrong? Your assumption that I clear the transposition table between moves, and your assumption that an entry is 16 bytes is wrong (it is 12 bytes), and despite underestimating my TT capacity, you severely undershot the actual value.
That makes it worse, right? You did give the size in MB, not in entries, and gave you the benefit of the doubt.
You were the one who asked for hash size, hash size is measured in MB. I would have elaborated further if you had asked for the details of my TT implementation.
You didn't give me the benefit of the doubt. You picked a particularly large entry size and then claimed 4% filling rate. It was massively lower than the actual values.
Here are the actual values calculated by the engine:
In STC games, the first move fills up 13.4% of the transposition table on average with a maximum of 42.5% and a minimum of 4.5%. These changes obviously occur due to time management.
That number is not really helpful without knowing how much time it took on average for this first move. (Many engines are programmed to take extra time on the first move out of book.) But 13.4% is still a quite small filling fraction.
"In STC games", equivalent to wtime 8000 winc 80. The latter parts of this reply explain why this statistic is useless and can't be measured accurately.
The parentheses do not apply to me, the TM algorithm is strictly position/move agnostic.
Your sense of small, or anyone else's for that matter is irrelevant to strength.
In an STC game, the current generation of entries (entry.age == current_age) can take up to 40% of the 8MB transposition table.
In an LTC game, the current generation of entries can take up to 27% of the 64 MB transposition table.
Again, 'up to' isn't really very elucidating. It makes a huge difference if that would happen every two moves or every 100 moves, and whether the other moves then fill for 30% or for 10%. The important parameter here is the fraction of the nodes reported by the search that creates a new entry in the table (rather than hitting upon an entry that was already there). And in ID most of the nodes visited in earlier iterations (which with an EBF of 1.5 would be 2/3 of the total nodes) will be included in the tree of the final iteration.
It is useful, it means that on average, at least one move wrote to 40% of the transposition table. You are not reading, I am measuring the newly created entries. That is what the current generation of entries mean. Also, can you please stop with the maths that use made up numbers. What an EBF of 1.5 does in a hypothetical situation is of zero interest to me.
High filling fractions reached on moves that took far above average time are not really meaningful; the game will more likely be decided in the other moves. It is the average filling of the TT that matters.
They do matter, you are contradicting the point of time management, engines ideally spend more time on moves that are decisive. Average filling does not matter, all TT entries have potential use as long as half move clock is not reset. Again I am repeating, the TT is not cleared between moves.
In an LTC game, it takes around 8-9 moves to fill up 50% of the transposition table, and 14-15 moves to fill up 75%.
In an STC game, it takes around 5-6 moves to fill up 50% of the transposition table, and 9-10 moves to fill up 75%.
Both get to >90% and stay around there after move 30.
Of course the table will fill up. But useless (unreachable) entries will also stay around, and in 8-9 moves (16-18 ply) a lot will happen. Most of that 50% could very well be useless.
The entries only are guaranteed to be useless if a capture is made or if I have entries that have a pawn at a given square, those entries only become useless if that pawn is moved. The transposition table is kept between plies for a reason, it is very reusable.
As much as I'd like to provide the amount of "useful" entries, It is not possible to measure it as it would require a proof method that can prove if a position is reachable from the current root.
Doesn't the search itself probe exactly that? I don't know how your aging works, but it is in principle possible to measure how many entries of the previous search (or any search before it) are probed by the current one, and how many there were initially, and how many were overwitten before they were probed. That should give you a pretty good idea for the typical number of useful entries.
No the search doesn't probe that. You can update the age of the entry if it gets probed again but it is not an accurate measure. If a position is reachable, it's potentially useful and that position is reachable as long as half move clock is not reset.
On top of all this, 16 byte entry -> 12 byte entry is the second most recent merge, with the most recent being an NNUE update. This is being VERY charitable towards your position as the assumptions you made on a practically smaller transposition table don't even hold at the one that is 33% larger.
Even if the numbers are 33% larger, they are not indicative of any significant hash pressure. Even if the engine did use the same time for the first move as it on average does for the others, the typical number of nodes added in the table by a search would be 1.333*13.4% = 17.9%. Hash pressure is when that is several times larger than 100%. Like it might be when someone wants to analyze a position for one hour.
Yes, they are in fact indicative of even less "hash pressure" than your assumptions suggested and yet your formula still resulted in significantly lower numbers.
If someone is analyzing a position for one hour, they should perhaps consider using a larger Hash size? As I replied to Uri Blass, 10.6MB gains over 8MB in LTC conditions, it really is a small price to pay for someone who's willing to spend an hour of compute.
- How can you ever distinguish the performance of replacement schemes when testing under conditions where replacement virtually does not take place?
It does take place, this is how I was able to test and merge replacement scheme patches.
I kindly ask you to not make up numbers in future conversation as I had to waste 2 hours of my time computing the actual values.
Waste???? You mean you were ignorant of what your engine does under the hood, and consider it a waste of time to learn that?
I am not ignorant of what my engine does under the hood, I programmed the entire thing. The waste here was that you didn't care to provide actual calculations and I had to go through the effort of disproving made up numbers that do not correspond to anything in terms of Elo.
I also kindly ask you to not change topics. This is not about whether my engine works under "severe hash pressure," it's about the testing methods one should use. Many engines directly use TT entries in: LMR, Singular Extensions, Null Move Pruning, Internal Iterative Reductions, Move Ordering, History Updates. Depth/Nodes comparisons, TT cut off rates, TT hit rates automatically break due to direct interactions. They are not useful metrics, you yourself do not use them any longer. So yes, I do think it is strange that you are suggesting others use a method you yourself stopped using precisely because it did not work.
The topic is related. The reason that what I suggested to the OP (and which would most likely work perfectly for him) does not work for your engine, because you do something that strikes me as very strange: in case of hash misses you appear to give up on the branch, and hardly search it, as the reduction of the node count by a factor 4 when running without TT shows, rather than to reconstruct the info that you did not get for free from the TT (which would have cost extra nodes) and then proceed as usual. Such an extreme difference in the depth at which you search side branches cannot be good, or it would basically not matter at all how you prune and reduce. So I wonder how you get away with this. By making the search so sensitive for whether you have a hash move or not you seem to have tuned it for a very specific miss rate. That opens the possibility that you use the replacement scheme to tune the fraction of misses to some optimal value, rather than just minimizing it.
What you call very weird gains 10 Elo for my engine and is common practice. I suggest you try it in your own engine as well.
"That cannot be good" is an awful argument when put against at least 20 engines having SPRT'd this.
Please provide actual tests that support your points rather than trying to debate your way out of things that have been proven over and over again by different people all with statistically sound methods.
Chess performance is very measurable. If you think IIR "cannot be good", you are welcome to try and simplify it out of Stockfish, Berserk, Ethereal, RubiChess, Caissa, Obsidian, Seer, Alexandria... do I need to go on?
No one at any point in this thread has suggested an algorithm to use as far as I am aware, I for sure haven't. The point of promoting good testing methods (i.e. SPRT) is that the person can now actually try something on their own and innovate rather than get lost in metrics that do not correlate to strength.
Wasn't that your question then? Why I didn't just give him the algorithm that I found to be universally best in terms of optimizing hash hits, rather than telling him the method I used for finding that algorithm?
No, the question is why you gave him a metric that doesn't serve any purpose to develop an engine.
I will not be replying on this thread any longer unless if:
- I am provided with evidence that IIR in fact loses Elo.
- You show your replacement scheme testing method having strong correlation with SPRT testing
If any of these are provided, I'll be very willing to:
- Discuss why IIR is useful in my testing conditions and causes Elo losses at your testing conditions.
- Experiment with your method on my own engine and share my own findings.
Until then, I am not willing to continue this discussion as I see no way of interpreting your arguments as good faith from this point on considering I've done nothing but provide actual numbers, and you've done nothing but provide verbal explanations that contradict the findings of many.
edit: please excuse the formatting - I will be fixing it
I gave up on fixing this, I have no idea how this website works.
[moderation] I 'wasted' 10 min to shape it up a bit.