Per-square MVV/LVS: it's nice but it doesn't work

hgm · Post by **hgm** » Thu Dec 04, 2008 10:29 pm

Tord Romstad wrote: after you capture the most valuable enemy piece and the opponent recaptures, the capture with the highest SEE value will usually still be available, and because the remaining material is smaller, the subtree size is also smaller.

Apart from the size of the subtree being smaller, the QS score of the capture with the lower SEE score is often better. This because the SEE score of the capture of a higher victim cannot be lower than that of a lower victim without the former exchange involving a recapture. E.g. if you can capture a Rook with SEE = 2 (say BxR, PxB), and you can capture a hanging Knight, with SEE = 3. If you start taking the Knight, the opponent will rescue the Rook, and you end at +3. If you capture the Rook first, after the recapture you are at +2, but still have the move to reap the remaining +3, for a total of +5. And if he saves the Knight in stead of recapturing the Bishop, you withdraw the Bishop, and have grabbed a free Rook, also for +5.

You only have to be careful when the capture of the higher victim involves an odd number of ply to reap the SEE score, e.g. QxR, RxR, KxR for SEE = +1 (assuming Q=9 and R=5), so that the opponent ends having the move. Now an also existing PxB (SEE = +3) should have priority, or the Bishop will get away. So a PxB (+3) is better ordered before a QxR(+1), despite both having SEE >= 0, and R > B.

This is why I prefer the Joker ordering over the Glaurung ordering. The PxB(+3) in that case will get sort key +3, as it is a LxH capture, which is ordered by victim (like equal captures, btw). The QxR(+1) is a HxL capture, though, and will thus get its SEE score of +1 as sort key, and thus is sorted behind the PxB. This would even happen if the B was defended (so that it had SEE = +2).

Note that this also reduces the overhead of applying SEE: if good and equal captures are going ordered by MVV, as Glaurung does, there is no need to calculate the SEE for LxH and equal x equal captures, as the SEE cannot possibly be negative. So you know they are going to be ordered by MVV in advance. Likewise in Joker, only SEE for HxL captures has to be calculated. The difference is that in Joker the HxL captures get ordered later, based on their SEE, even if they are good, while in Glauring they would be ordered by Victim, if they were good. This makes no difference for capturing undefended pieces, as there SEE and Victim value are the same. But it does matter for exchanges of the type R vs 2 minors or Q vs 2R, where H x defended L can still be good. Joker delays those compared to Glaurung, because they make you lose the tempo.

Dave Gomboc · Post by **Dave Gomboc** » Sat Jan 17, 2009 4:16 am

Dann Corbit wrote:Reminds me a a least squares fit I did with evaluation pararameters and a test set of 12000 tactical positions. After the fit, the engine would solve test positions like a world champion. Unfortunately, it got pounded by the original engine in actual games.

Least-squares fitting for tuning eval? Awww, Dann, I'm disappointed in you. :-P

Well, because you tuned on just tactical positions, you surely expected the resulting eval to get hammered in complete games, so I suppose it didn't matter.

bob · Post by **bob** » Sat Jan 17, 2009 7:34 pm

Dave Gomboc wrote:
Dann Corbit wrote:Reminds me a a least squares fit I did with evaluation pararameters and a test set of 12000 tactical positions. After the fit, the engine would solve test positions like a world champion. Unfortunately, it got pounded by the original engine in actual games.
Least-squares fitting for tuning eval? Awww, Dann, I'm disappointed in you.

Well, because you tuned on just tactical positions, you surely expected the resulting eval to get hammered in complete games, so I suppose it didn't matter.

Long time, no see. And I am not talking about Static Exchange Evaluation.

Dave Gomboc · Post by **Dave Gomboc** » Fri Jan 23, 2009 6:07 pm

bob wrote: Long time, no see. And I am not talking about Static Exchange Evaluation.

:)

Thanks, Bob. :-) Indeed, I've not been around for a long while. I relocated to southern California, got married, etc. It's nice to see that you're still active in the community!

This thread was a bit disappointing to me. I acknowledge that it's great to be able to quantify the effect of a change by playing a bevy of games quickly, when you can. Can we perhaps also agree that matches aren't an efficient use of limited computational resources to determine whether a change is beneficial?

Years ago, Christophe Theron once posted about how a major difference between professionals and amateurs was their approach to testing their changes. I, too, believed this to be true back at that time, when there were >150-Elo differences between most commercial and most amateur programs. One of the reasons my M.Sc. thesis discussed what it did was because I thought that there are more efficient ways for developers to assess the strength of their software's evaluation function (including tree search), but all I saw were amateurs using tiny test suites or playing matches ad infinitum.

As it turned out, I ran out of time before I could address tree-search aspects of evaluation. (Hey, I still want to earn a doctorate, so perhaps I can work that in there somehow. ;-) Nonetheless, I do want to seriously suggest that analyzing a chess engine's behaviour on perhaps a million positions could be a more efficient way to determine the worth of changes than playing long matches.

I think another poster in this thread attempted to make a similar point, but he seemed to actually be more interested in addressing you in a snide manner than in communicating clearly. Consequently his post was ineffective at communicating the idea. Sadly, that's an all too predictable behaviour at CCC :-( -- one other reason I have ignored it for so many years.

Dave

michiguel · Post by **michiguel** » Fri Jan 23, 2009 7:09 pm

Dave Gomboc wrote:
bob wrote: Long time, no see. And I am not talking about Static Exchange Evaluation.

Thanks, Bob. Indeed, I've not been around for a long while. I relocated to southern California, got married, etc. It's nice to see that you're still active in the community!

This thread was a bit disappointing to me. I acknowledge that it's great to be able to quantify the effect of a change by playing a bevy of games quickly, when you can. Can we perhaps also agree that matches aren't an efficient use of limited computational resources to determine whether a change is beneficial?

Years ago, Christophe Theron once posted about how a major difference between professionals and amateurs was their approach to testing their changes. I, too, believed this to be true back at that time, when there were >150-Elo differences between most commercial and most amateur programs. One of the reasons my M.Sc. thesis discussed what it did was because I thought that there are more efficient ways for developers to assess the strength of their software's evaluation function (including tree search), but all I saw were amateurs using tiny test suites or playing matches ad infinitum.

A point I want to make that is not totally related to your argument:

I remember CT's statement, and I also remember someone related to a commercial product (can't remember the name right now) describing how impossible it was for an amateur to come close to a pro. Well, how wrong that was!

I remember CT advising Uri Blass when he declared he wanted to start writing an engine. ~6 years later, Movei was at the same level of Tiger in some rankings. Amateur have done tremendous progress and many surpassed commercial products (becoming commercials themselves...). Some of the amateurs are even GPLed.

As it turned out, I ran out of time before I could address tree-search aspects of evaluation. (Hey, I still want to earn a doctorate, so perhaps I can work that in there somehow. Nonetheless, I do want to seriously suggest that analyzing a chess engine's behaviour on perhaps a million positions could be a more efficient way to determine the worth of changes than playing long matches.

You have a million positions in around ~10000 games. The only way that using positions would be faster is without search. How can you know your changes are certifiably good without matching the engine with another?

I see that there may be better ways for tuning than "trial-matches-error-back to the beginning", but sooner or later you need the engine to play games to know how good the changes were.

Miguel

I think another poster in this thread attempted to make a similar point, but he seemed to actually be more interested in addressing you in a snide manner than in communicating clearly. Consequently his post was ineffective at communicating the idea. Sadly, that's an all too predictable behaviour at CCC -- one other reason I have ignored it for so many years.

Dave

Dave Gomboc · Post by **Dave Gomboc** » Sat Jan 24, 2009 6:26 pm

michiguel wrote:I remember CT's statement, and I also remember someone related to a commercial product (can't remember the name right now) describing how impossible it was for an amateur to come close to a pro. Well, how wrong that was!

I don't specifically recall that post or its context. If that poster was saying that amateurs weren't going to make headway without making a significant change in how they approached the task, perhaps that poster was right, and those changes happened. If that poster was simply saying it was impossible, well, that makes no sense -- nobody starts off as a professional!

michiguel wrote: I remember CT advising Uri Blass when he declared he wanted to start writing an engine. ~6 years later, Movei was at the same level of Tiger in some rankings. Amateur have done tremendous progress and many surpassed commercial products (becoming commercials themselves...). Some of the amateurs are even GPLed.

Uri: congratulations! You've been slogging away at it for a while. I remember when you were asking for people's perft counts while you were debugging your move generator.

michiguel wrote: You have a million positions in around ~10000 games. The only way that using positions would be faster is without search. How can you know your changes are certifiably good without matching the engine with another?

No, it's not the only way. Give yourself a large pool of positions, and search them extremely deeply using a strong program every time you upgrade your hardware (once every 18-24 months?). Actually, you can also conduct these analyses piecemeal over time if you have to. But, you do need to log all relevant information about these searches (what is relevant depends upon what you intend to work on, so log a lot!).

This "strong program" can be yours if your program is already reasonable. However, use a strong open-source program if your program is still weak -- or you will need to redo your "extremely deep" searches whenever you make significant progress on your engine.

Anyway, so now you have this pool of deeply-searched positions. You don't need to use every position every time: you might use 5000 for one task, or 50000 for another, depending on what it takes for you to reach a desired level of confidence that a change is beneficial. By not always using the same 5000 or 50000 positions, you avoid overfitting your software to them.

Baseline performance is now how the search of your program at tournament time control length compares to the extremely deep searches. Whenever you make a change and do an experiment, you can see under what conditions your engine sees more or less than it used to because you've got those extremely deep searches to compare against.

michiguel wrote:I see that there may be better ways for tuning than "trial-matches-error-back to the beginning", but sooner or later you need the engine to play games to know how good the changes were.

I think that there are so many people having fun playing matches between computer programs that you probably never need to run one yourself unless you suspect a configuration problem on their end.

Dave

bob · Post by **bob** » Sat Jan 24, 2009 6:46 pm

Dave Gomboc wrote:
bob wrote: Long time, no see. And I am not talking about Static Exchange Evaluation.

Thanks, Bob. Indeed, I've not been around for a long while. I relocated to southern California, got married, etc. It's nice to see that you're still active in the community!

This thread was a bit disappointing to me. I acknowledge that it's great to be able to quantify the effect of a change by playing a bevy of games quickly, when you can. Can we perhaps also agree that matches aren't an efficient use of limited computational resources to determine whether a change is beneficial?

Years ago, Christophe Theron once posted about how a major difference between professionals and amateurs was their approach to testing their changes. I, too, believed this to be true back at that time, when there were >150-Elo differences between most commercial and most amateur programs. One of the reasons my M.Sc. thesis discussed what it did was because I thought that there are more efficient ways for developers to assess the strength of their software's evaluation function (including tree search), but all I saw were amateurs using tiny test suites or playing matches ad infinitum.

As it turned out, I ran out of time before I could address tree-search aspects of evaluation. (Hey, I still want to earn a doctorate, so perhaps I can work that in there somehow. Nonetheless, I do want to seriously suggest that analyzing a chess engine's behaviour on perhaps a million positions could be a more efficient way to determine the worth of changes than playing long matches.

I think another poster in this thread attempted to make a similar point, but he seemed to actually be more interested in addressing you in a snide manner than in communicating clearly. Consequently his post was ineffective at communicating the idea. Sadly, that's an all too predictable behaviour at CCC -- one other reason I have ignored it for so many years.

Dave

You missed a critical post by CT a while back. After looking at what I was doing at the time he discovered that his testing was just as badly flawed as my original testing where I used 40 starting positions and played each one many times since there is so much randomness in the results. He had measured a significant improvement in his last version, but CCRL/etc were not seeing the same results. After that discussion on testing, he tried something more in line with my approach and discovered his original data was bad...

I have tried all sorts of "positional tests". But the bottom line is that actual games provide more information. One can change something that improves a positional measurement, but hurts in actual game play. We have had plenty of those kinds of ideas in Crafty that we thought were good (we have a small group working on it constantly) but testing showed that most were bad, a few were neutral, and a few were good.

Dave Gomboc · Post by **Dave Gomboc** » Sun Jan 25, 2009 5:21 am

bob wrote:You missed a critical post by CT a while back. After looking at what I was doing at the time he discovered that his testing was just as badly flawed as...

I went looking for that thread using the search function here, but I didn't come across it. Can anyone link to it?

bob wrote:I have tried all sorts of "positional tests". But the bottom line is that actual games provide more information. One can change something that improves a positional measurement, but hurts in actual game play.

I don't think I referred to any "positional tests". What is your opinion of the specific suggestions I made? (I freely acknowledge in advance that what I suggested isn't particularly appropriate when addressing time management issues.)

Dave

Per-square MVV/LVS: it's nice but it doesn't work

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results

Re: MVV/LVA - SEE - test - final results