xboard compliance and new feature

hgm · Post by **hgm** » Mon May 25, 2009 12:11 am

bob wrote:(1) in complex programs, your suggestion does not work. Eval terms influence the NPS differently in different parts of the game. A static node counter will simply not be as accurate as I want in such cases.

This is why I said you coud also measure the average time per move over many games.

(2) I can't see any reason to limit opponents to a specific tree space size. That also changes the way _they_ play. Some programs search 2-3x faster in endgames than in the opening. This kind of nonsense completely invalidates games where a program is artificially limited to some "average" speed that can be 2x-3x off in the right kinds of positions.

I am not interested in how my opponents play. Just that I have many, which play different.

I'm not trying to somehow handicap my opponent. I'm trying to measure the best that I can do in a given amount of time against the best that they can do in a given amount of time, and then decide if my new "best" is better or worse than the previous best version.

This is nonsense. By limiting their time you handicap them anyway.

If you want to test like that, that is certainly your choice. I'm more interested in realistic comparisons where both programs use the time as they see fit so that I don't draw an invalid conclusion because I have unintentionally finagled the timing.

I am more interested in efficient testing, and testing that is not affected by machine loading.

NPS is not static. Testing in this way assumes that it is.

It doesn't assume anything. It is just a different way of testing.

So far everything you mention applies a hundred times worse to testing at fixed depth. So if these are truly drawbacks, playing by nodes should be a _huge_ impovement over playing by depth. Which is all I have been saying.

bob · Post by **bob** » Mon May 25, 2009 1:40 am

hgm wrote:
bob wrote:(1) in complex programs, your suggestion does not work. Eval terms influence the NPS differently in different parts of the game. A static node counter will simply not be as accurate as I want in such cases.
This is why I said you coud also measure the average time per move over many games.

Let's take Ferret. It plays at 2M nodes per second or 6M nodes per second depending on whether it is in the opening or the middlegame. I want to play a one second per move game. How many nodes are you going to tell it to search before moving? Average = 4M. So in the opening you will tell it to search a tree _twice_ as big as normal, or give it a 2:1 time advantage. In the endgame it will search a tree 50% smaller than normal, giving it a time disadvantage. That clearly distorts things, and has me playing against a _different_ opponent than what Ferret normally would be. Crafty's NPS changes. It's parallel search efficiency changes. Time ignores all of that, and that's the reason that is the only way I test.

If you don't mind skewing results, then fixed depth or fixed nodes can both work. not very well, but work.

(2) I can't see any reason to limit opponents to a specific tree space size. That also changes the way _they_ play. Some programs search 2-3x faster in endgames than in the opening. This kind of nonsense completely invalidates games where a program is artificially limited to some "average" speed that can be 2x-3x off in the right kinds of positions.
I am not interested in how my opponents play. Just that I have many, which play different.

I'm not trying to somehow handicap my opponent. I'm trying to measure the best that I can do in a given amount of time against the best that they can do in a given amount of time, and then decide if my new "best" is better or worse than the previous best version.
This is nonsense. By limiting their time you handicap them anyway.

That is nonsense. I do _not_ handicap them in any way. All my test games are equal time for both sides. No idea what you are talking about.

If you want to test like that, that is certainly your choice. I'm more interested in realistic comparisons where both programs use the time as they see fit so that I don't draw an invalid conclusion because I have unintentionally finagled the timing.
I am more interested in efficient testing, and testing that is not affected by machine loading.

Machine loading is irrelevant to me. All my machines have nothing on them but the chess engines when I test.

NPS is not static. Testing in this way assumes that it is.
It doesn't assume anything. It is just a different way of testing.

So far everything you mention applies a hundred times worse to testing at fixed depth. So if these are truly drawbacks, playing by nodes should be a _huge_ impovement over playing by depth. Which is all I have been saying.

And there I agree. But I don't consider nodes _or_ depth to be realistically representative of what would happen with the same two programs in a real timed match.

hgm · Post by **hgm** » Mon May 25, 2009 8:06 am

So node-count-controlled is a different engine than time-controlled Ferret. So what? Why would it be important for me to play against time-controlled Ferret?

Limiting time is a handicap, as an engine would have played better when it could have searched longer than the time limit (on move, session or game) you impose. Ferret at 1 min/move is also not the same Ferret as Ferret at 1 min/game.

Machine loading is irrelevant to you... Well, good for you, but guess what? You are not the only person in the world, and we don't design XBoard to suit your needs and your needs only. In fact, you are a self-confessed non-XBoard user, having written your own GUI-less referee program and using it for your cluster testing. So your opinion basically carries zero weight.

bob · Post by **bob** » Tue May 26, 2009 12:59 am

hgm wrote:So node-count-controlled is a different engine than time-controlled Ferret. So what? Why would it be important for me to play against time-controlled Ferret?

Limiting time is a handicap, as an engine would have played better when it could have searched longer than the time limit (on move, session or game) you impose. Ferret at 1 min/move is also not the same Ferret as Ferret at 1 min/game.

Machine loading is irrelevant to you... Well, good for you, but guess what? You are not the only person in the world, and we don't design XBoard to suit your needs and your needs only. In fact, you are a self-confessed non-XBoard user, having written your own GUI-less referee program and using it for your cluster testing. So your opinion basically carries zero weight.

I could care less what you do to xboard. My comments were directed toward the idea of using fixed depth, or fixed nodes. Nothing more, nothing less. I plan on trying to educate new authors to the accurate way of measuring engine strength changes. And that was all I was doing.

hgm · Post by **hgm** » Tue May 26, 2009 8:46 am

That would have been a noble cause, if your 'education' was based on actual knowledge you had and other lacked.

But in fact, what you are presenting here like it is gospel, is pure speculation on your behalf, based on nothing but prejudice.

In reality you haven't the slightest idea if the Ferret that is playng by node count is actually stronger or weaker than the Ferret playing by time. Or if evaluating the Elo gain in engine X would produce a different value when evaluating that same change in a gauntlet against a number of engines playing by time as it would when evaluating it in a gauntlet aganst the same bunch playing by nodes. You have not tested any of those things.

Show us fact, and we might believe you. (If such fact passes usual scientific scruteny, that is...) But engine authors are in fact very ill advised to listen to your gut reaction "this is not the way I have been doing for the past 49 years, so it must be wrong". The latter is just wrecking a fruitful discussion by interjecting dis-info.

bob · Post by **bob** » Tue May 26, 2009 8:04 pm

hgm wrote:That would have been a noble cause, if your 'education' was based on actual knowledge you had and other lacked.

But in fact, what you are presenting here like it is gospel, is pure speculation on your behalf, based on nothing but prejudice.

And that statement is based on ignorance, so I guess at some level we must be "equal" here. Searching to a fixed number of nodes distorts the results. I have already explained why. Programs are not uniform in their NPS, so limiting them to a fixed number of nodes helps their search depth in some positoins because they are searching slower than normal and your extra nodes are like extra time. Same problem when the program goes fast, now it is penalized.

The first point here is that any changes that influence program speed will give invalid results, because fixed node searches eliminates the NPS from affecting the game at all. And many changes are all about speed.

The second point is that when playing an opponent whose search speed varies significantly over the different phases of the game, the program will seem to play stronger in positions where it is slower, and vice versa, because of the time bias fixed nodes introduces.

As one example, suppose you are playing against a program that has the 3x speed variance Ferret has (used to have when it was active). A change to your program that causes it to trade material a little more aggressively will appear to be a good change, because your winning rate goes up somewhat. But not because of your change, but because you are forcing Ferret to play in the part of the game where it is normally a lot faster, but now you are time-handicapping it.

I don't see why that is so hard to understand. And it is _not_ based on "prejudice". It is based on understanding how the tree search progresses and how a fixed number of nodes can affect the search of two different engines in two different ways, or perhaps not even affect one at all, which is even worse.

In reality you haven't the slightest idea if the Ferret that is playng by node count is actually stronger or weaker than the Ferret playing by time. Or if evaluating the Elo gain in engine X would produce a different value when evaluating that same change in a gauntlet against a number of engines playing by time as it would when evaluating it in a gauntlet aganst the same bunch playing by nodes. You have not tested any of those things.

Must you continue to post such absolute rubbish? Do you _really_ think that if I slow a program down by a factor of 2 that it will _not_ play weaker? If so, you need to find another hobby. You are so far out in left field here, you are completely out of the stadium and out in the parking lot.

For the record, I have run tests like this. And I noticed that the results were not matching the fixed-time-limit games very accurately. And I studied the problem to figure out why. Have _you_ ever done that? I have. And this is not guesswork.

Show us fact, and we might believe you. (If such fact passes usual scientific scruteny, that is...) But engine authors are in fact very ill advised to listen to your gut reaction "this is not the way I have been doing for the past 49 years, so it must be wrong". The latter is just wrecking a fruitful discussion by interjecting dis-info.

Notice I did not say "this is the way I have been doing it for the last 49 years." I simply said that this is not the best way to test and gave a concrete (and easily understandable) reason for anyone interested enough to think about it)

I do not believe it is reasonable to test in an environment where I can make changes to my program that let it exploit a weakness in the test methodology. Full-width searches are amazing in the way they can steer the game toward favorable situations when you are looking at wins vs losses.

If you play against an opponent that speeds up in the endgame, then playing with fixed node searches will make any change that reaches an endgame position more quickly look like it is good. Not because the program is playing better, but because it is suddenly searching "much faster" virtually. Because it has steered the game ttoward positions where the opponent slows down significantly.

By the same token, if you play against an opponent that slows down in the endgame (Crafty is an example here) then any changes that avoid trading will look good because you now force Crafty to search smaller trees than it normally would, giving your program a time advantage.

neither of those cases have anything at all to do with making your program play better, just making it play toward positions where fixed node searches are more favorable to you than your opponent.

Let's take the first case. You tune your eval to the programs that speed up in the endgame and think that because of your unknowing speed advantage, you are getting better. And then in real games, you steer the games toward the endgame where suddenly you discover that your opponents know more about pawns and such than you do and without that time handicap, you lose more than you did before.

Again, you need to test like you intend to play. Since we don't have fixed depth or fixed node count tournaments, it is simply more efficient to test and tune like you are going to really play.

No prejudice. No "dis-info". Just logical reasoning.

xboard compliance and new feature

Re: xboard compliance and new feature

Re: xboard compliance and new feature

Re: xboard compliance and new feature

Re: xboard compliance and new feature

Re: xboard compliance and new feature

Re: xboard compliance and new feature