Per-square MVV/LVS: it's nice but it doesn't work

bob · Post by **bob** » Thu Nov 27, 2008 12:22 am

hgm wrote:
bob wrote:I will stick with what certainly works...
Certainty comes at a price, and what "certainly works", might in practice be highly inferior to something that likely works. Because the combined error bars of statistical and systematic errors can still be larger when the error bars on the systematical errors is zero.

This particular case is a very good example: you sticking to things that certainly work, lagging several years behind people with approximately 100 times smaller computational resources. And without those people, you would not even have known what to test, with your certainly working method.

I have absolutely no idea what you are talking about now. Since when have I _ever_ said that I didn't depend on ideas from others as well as on ideas of my own? I assume this pertains to the SEE + MVV/LVA discussion in some way? I don't see what algorithmic ideas have to do with testing methodologies, however. I've tried positions, and have now discovered that conclusions made from such testing doesn't match up well with testing in real games. I have (had) several pieces of code in Crafty that were better in some sort of specific circumstance, whether it be tactical, positional, endgame, or tree size control. And after testing with the current approach, I have ripped those ideas out because while they were good in tactical suites, or in my "random position suite" or in an endgame suite, or whatever, they were clearly hurting performance in real games because as I have removed things, our rating has steadily climbed over the past 6 months. Which convinces me my old "position" tests were not quite as good as I had thought.

But it is of course all their fault that you were wrong all that time: they were, after all, just superstitious morons, that did not have the slightest idea what they were doing, and happened to pick a quite complicated and non-obvious move ordering by sheer luck. Now that you have proven to the world with certainty that it actually works, they will probably repent, and abolish their misguided ways, giving you due credit for your brilliant discovery.

I proved that it works. To the tune of 7 Elo +/- 4. But keep your self-centered attitude intact, it doesn't matter to me. It is even more likely that the next "great idea" I test will go the other way. One classic was using history counters as in Fruit. This testing proved clearly that the history counter as implemented in fruit is also NFG and does absolutely nothing to make the program stronger, and can actually weaken it in some circumstances. But one thing is for absolute certain... when I test something that may help or hurt, I will know with absolute certainty whether it is good or bad and by how much. I won't be guessing. Someone tried a different move ordering and it worked. There are lots of other ideas that have been tried that don't work, but nobody knows (yet).

I prefer to "shine a light" on these things and find out, now that I have a mechanism to do so, rather than to make posts such as yours that offer absolutely nothing of any use other than to try to insult. You should laugh out loud all you can. Most of it will ultimately be you laughing at yourself, however...

bob · Post by **bob** » Thu Nov 27, 2008 2:41 am

mcostalba wrote:
hgm wrote:
bob wrote:Games expose you to _all_ types of positions, which any test set is almost guaranteed to not do...
They do if you randomly select them from games that you played before, e.g. for another purpose.

This is why I wrote "typical game positions". That is not the same as: "tactical positions".

That you can think of many other poor methods next to the one you used, does not automatically mean that what I proposed above is also poor, or that the people that have been using it in the past to post their conclusions here were "merely superstitious" or must have been "too clever".

When it is possible I agree real games are the best thing. Unfortunatly not everybody has a cluster to test so try to be _clever_ is a necessity not a choice.

I have tried to test with tree size for a given depth the ordering algorithm here proposed by H.G.Muller (BTW thanks very much for the pseudo-code).

Ordering, IMHO, is suitable for tree size testing, I would say is the poster child of tree size testing methods. Of course you cannot rely on tree-size methods for futility pruning or evaluation changes because there it counts also the quality of the moves that the engine chooses, not only their numbers. But just changing the searching order should not change the results...at least I hope

I have tried different variations of the proposed alghoritm on a set of quiet positions, i.e. positions where there is not a winning move but are quite typical and common in mid game.

At the end I found a variation that worked in reducing the size of the tree of about 3-6% depending on the depth. When double checked with real games, with my disappoint, I found the change in ordering failed to give any advantage at all.

I am still investigating....

Not that surprising. I have a bit more data on this subject now as well. I have several tests I can run. Most frequently, I run a 1 hour test (32,000 games in 1 hour) at a time control of 10 secs on clock + 10ms inc. That is what I used to measure the 7 +/-4 elo for the MVV/LVA + SEE ordering. However, I had some idle time on the cluster and re-ran at 1min + 1sec which takes about 10 hours per run, and the two runs (one with new ordering, one with old SEE) were dead even. The new ordering was +1 Elo better, but with +/-4 error, there is no confidence that it is better. And in thinking about it, it makes sense. I ran the test because I had questions about very fast games to test a change that is a potential minor-speedup. In fast games, minor speedups produce exaggerated performance increases. As the time stretches out, that difference begins to narrow. So the idea is still reasonable, but the actual Elo gain is probably zero.

Tracy could tell you horror stories about all the various ideas we have tried, and about the assumed-to-be-working ideas that we removed, because testing showed them all to be bad.

While I realize that this cluster approach is not for everybody, because of cost and availability, it is an incredible tool regardless of what 'others' might think.

mcostalba · Post by **mcostalba** » Thu Nov 27, 2008 8:38 am

bob wrote:Most frequently, I run a 1 hour test (32,000 games in 1 hour) at a time control of 10 secs on clock + 10ms inc.

Please, what is the average depth per move in a midgame position that you reach with this time control?

Now I do most of my tests with 1 minute games, it would be nice to reduce but because I have no experience about results reliability at ultra fast times I am a little bit scared to do so.

The absolute time control is not so much worth because it depends on the hardware, but if you tell me the average depth per move I can tweak my time control to reach that on my poor notebook.

Thanks
Marco

bob · Post by **bob** » Thu Nov 27, 2008 4:51 pm

for these fast games, it looks like 9-10 plies early, more later. for 30s+100ms games, which takes about 3-4 hours to run, I see 10-12 plies...

If we are doing "search improvements" I usually run a 12 hour test using 1m+1s type time controls to run the depth up to the 14 ply range or so. For eval changes, we use the 1-hour test for "screening" and then run a 12 hour test for validation.

One major caution. I have found several cases where I tuned the eval and it did better at very fast games, but did much worse at longer time controls, so while fast games are fine for initial screening, longer games are essential for accurate measurement to make sure you haven't made things worse...

jwes · Post by **jwes** » Thu Nov 27, 2008 5:41 pm

bob wrote:for these fast games, it looks like 9-10 plies early, more later. for 30s+100ms games, which takes about 3-4 hours to run, I see 10-12 plies...

If we are doing "search improvements" I usually run a 12 hour test using 1m+1s type time controls to run the depth up to the 14 ply range or so. For eval changes, we use the 1-hour test for "screening" and then run a 12 hour test for validation.

One major caution. I have found several cases where I tuned the eval and it did better at very fast games, but did much worse at longer time controls, so while fast games are fine for initial screening, longer games are essential for accurate measurement to make sure you haven't made things worse...

How would you deal with the opposite case, worse at very fast games, but much better at longer time controls ?

bob · Post by **bob** » Thu Nov 27, 2008 5:42 pm

bob wrote:for these fast games, it looks like 9-10 plies early, more later. for 30s+100ms games, which takes about 3-4 hours to run, I see 10-12 plies...

If we are doing "search improvements" I usually run a 12 hour test using 1m+1s type time controls to run the depth up to the 14 ply range or so. For eval changes, we use the 1-hour test for "screening" and then run a 12 hour test for validation.

One major caution. I have found several cases where I tuned the eval and it did better at very fast games, but did much worse at longer time controls, so while fast games are fine for initial screening, longer games are essential for accurate measurement to make sure you haven't made things worse...

I noticed a small error above. My fast test is 20s + .1s, not 10ms. I had forgotten that the xboard protocol specifies times in units of .01 seconds.

So fast is 10s + .1s, next up is 30s + .5s, and the 12-hour test (32,000 games for all tests) is 1m + 1s.

I had the increment off in my previous post, I was thinking it was in ms units, when it is in 10ms increments...

krazyken · Post by **krazyken** » Thu Nov 27, 2008 5:48 pm

jwes wrote:
bob wrote:for these fast games, it looks like 9-10 plies early, more later. for 30s+100ms games, which takes about 3-4 hours to run, I see 10-12 plies...

If we are doing "search improvements" I usually run a 12 hour test using 1m+1s type time controls to run the depth up to the 14 ply range or so. For eval changes, we use the 1-hour test for "screening" and then run a 12 hour test for validation.

One major caution. I have found several cases where I tuned the eval and it did better at very fast games, but did much worse at longer time controls, so while fast games are fine for initial screening, longer games are essential for accurate measurement to make sure you haven't made things worse...
How would you deal with the opposite case, worse at very fast games, but much better at longer time controls ?

Makes me wonder if it is worth it to use separate eval terms based on time control. Could make it a user controlled switch to be put in the rc file.

bob · Post by **bob** » Thu Nov 27, 2008 6:21 pm

jwes wrote:
bob wrote:for these fast games, it looks like 9-10 plies early, more later. for 30s+100ms games, which takes about 3-4 hours to run, I see 10-12 plies...

If we are doing "search improvements" I usually run a 12 hour test using 1m+1s type time controls to run the depth up to the 14 ply range or so. For eval changes, we use the 1-hour test for "screening" and then run a 12 hour test for validation.

One major caution. I have found several cases where I tuned the eval and it did better at very fast games, but did much worse at longer time controls, so while fast games are fine for initial screening, longer games are essential for accurate measurement to make sure you haven't made things worse...
How would you deal with the opposite case, worse at very fast games, but much better at longer time controls ?

This has happened. Although I am sure we miss one here and there. But if something is intuitively better, but fails, I usually try long and short games before I give up on it. Fortunately most of the tests I have done carry over from long to short games and vice-versa just fine. But some do not. I have particularly found things that help in fast games and hurt in longer games. Probably it is more common to find things that help in fast games and make no difference in longer games...

bob · Post by **bob** » Thu Nov 27, 2008 6:23 pm

krazyken wrote:
jwes wrote:
bob wrote:for these fast games, it looks like 9-10 plies early, more later. for 30s+100ms games, which takes about 3-4 hours to run, I see 10-12 plies...

If we are doing "search improvements" I usually run a 12 hour test using 1m+1s type time controls to run the depth up to the 14 ply range or so. For eval changes, we use the 1-hour test for "screening" and then run a 12 hour test for validation.

One major caution. I have found several cases where I tuned the eval and it did better at very fast games, but did much worse at longer time controls, so while fast games are fine for initial screening, longer games are essential for accurate measurement to make sure you haven't made things worse...
How would you deal with the opposite case, worse at very fast games, but much better at longer time controls ?
Makes me wonder if it is worth it to use separate eval terms based on time control. Could make it a user controlled switch to be put in the rc file.

I have known for years that you can tune specifically for a fast or slow game, or for a specific opponent, or whatever. I have not tried to do this intentionally, and indeed would consider most crafty versions to be better as time gets longer, rather than in faster games...

Tord Romstad · Post by **Tord Romstad** » Tue Dec 02, 2008 10:43 pm

michiguel wrote:
bob wrote:Here is the final results. Again, 22.6R01 is normal, 22.6R02 is the SEE + MVV/LVA capture ordering code.
Code: Select all
Name               Elo    +    - games score oppo. draws
Crafty-22.6R02-101  2600    5    4 31128   49%  2604   20% 
Crafty-22.6R02-102  2600    4    5 31128   49%  2604   21% 
Crafty-22.6R01-101  2593    4    4 31128   48%  2604   21% 
Crafty-22.6R01-102  2592    5    4 31128   48%  2604   21% 
The new version checks in pretty reliably better, although I am still not certain exactly why this would be. The new version hits 2600 twice, the older version is between 2592 and 2593 twice. So 7-8 Elo, again remembering the error bar, but with a total of 62,000 games (which I could combine into two lines above if anyone wants) the error bar would be even smaller while the Elo numbers would not change.

One of those things that makes you go hmmmm.....
I am not surprised at all!

I must admit that I was surprised when I first discovered that MVV/LVA among moves with non-negative SEE performed better than pure SEE. Like Bob, I found that the former scheme scored something like 5--10 Elo points better, although I needed a lot more time to test it (I'm so envious that he can get 32,000 games in one hour, while I can get about 100 games per week). I was stupid. In hindsight, it is quite obvious why MVV/LVA for non-losing captures is better than SEE: As you point out elsewhere in the thread (and as HGM has explained several times in the past), after you capture the most valuable enemy piece and the opponent recaptures, the capture with the highest SEE value will usually still be available, and because the remaining material is smaller, the subtree size is also smaller. Common sense should be enough to tell us that SEE on all captures is an inferior move ordering scheme.

At cut nodes, good move ordering is not just about picking a move which gives a beta cutoff, but also about picking a move which gives a beta cutoff with the smallest possible number of nodes.

Tord

Per-square MVV/LVS: it's nice but it doesn't work

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results