Asymetric Node/move in winboard for UCI engines

hgm · Post by **hgm** » Wed Sep 30, 2009 7:59 pm

Actually, you can use it for serious testing, and that is exactly where François will be using it for. Of course you cannot use it to measure the relative strength of completely different engines A and B, because of the reasons you give. But when you develop an engine that is immaterial. What you want to test is how versions A and A' of your engine perform relative to each other, when you play them against B (and C and D). And when A and A' count nodes in the same way and have the same nps, the results of such gauntlets are directly comparable.

And even if A and A' have different nps, because (say) you added a very expensive evaluation term, you can either test them at the same node budget, and record the time they take to correct the results with an empirical rating vs time formula, or you can restrict the node budget so that the engines A and A' effectively use the same time.

The latter strategy can even be used when you play A against B. It does not matter much how exactly the engine report the nodes. You just adapt their node budget until they use equal time, which can easily be done by a benchmark for their real nps. Not that when you specify a node-based TC in WinBoard, the TC dialog lets you specify the nps conversion factors for each engine independently.

bob · Post by **bob** » Wed Sep 30, 2009 10:23 pm

hgm wrote:Actually, you can use it for serious testing, and that is exactly where François will be using it for. Of course you cannot use it to measure the relative strength of completely different engines A and B, because of the reasons you give. But when you develop an engine that is immaterial. What you want to test is how versions A and A' of your engine perform relative to each other, when you play them against B (and C and D). And when A and A' count nodes in the same way and have the same nps, the results of such gauntlets are directly comparable.

And even if A and A' have different nps, because (say) you added a very expensive evaluation term, you can either test them at the same node budget, and record the time they take to correct the results with an empirical rating vs time formula, or you can restrict the node budget so that the engines A and A' effectively use the same time.

The latter strategy can even be used when you play A against B. It does not matter much how exactly the engine report the nodes. You just adapt their node budget until they use equal time, which can easily be done by a benchmark for their real nps. Not that when you specify a node-based TC in WinBoard, the TC dialog lets you specify the nps conversion factors for each engine independently.

It is not so immaterial, as I have explained previously. And, for the Nth time, here is why this is not the best form of testing.

Some programs vary in speed significantly from opening to middlegame to endgame. I have seen a factor of 2-3X quite often, with some being even larger. Ferret was 4x-5x faster in endgames, as an example.

If you do fixed-node searches, you have to somehow come up with a value that represents a "fair" approximation to equal time for the two opponents. Doable.

But, then comes the rest of the story. Suppose you play against an opponent where he speeds up by 5x, but you show little or now speed improvement. In an endgame, he will be 5x faster than you. In the opening, you will be pretty equal. Irrelevant, you say?

hardly. What happens if the changes you make, although small, end up making your king safety enough better than you start surviving the middlegame more frequently, Say you were losing 75% of middlegame positions and almost all endgame positions because of the speed advantage your opponent has. Now you have learned to avoid the busted king positions, and now 1/2 of the games you play reach the endgame, where before say only 25% were doing so. So you lost 25% of total games (100% of endgames because you get badly out-searched) and you were losing 3 of 4 of the other 75%. Now you no longer lose 3 of four in the middlegame, you are losing only 1/2 as many. But those now become endgame positions where you lose 'em all. And you now conclude you are doing _worse_ and throw the change away.

This can happen in other ways as well, depending on who speeds up where in the game. You might win more endgames because you are faster. But if all your evaluation change does is push the game into a phase where you are either faster or slower, the results will have little to do with the actual changes you made, and the conclusions you reach are wrong.

If the programs do not change their speed significantly, I would agree this will work. But even Crafty varies by a factor of 3x or a little more as there is some slow code I execute in the opening that I do not execute in the MG/EG parts of the game because it has castled. If you play real games using a time limit, and you tune using something else, you really are inviting trouble. And are going to make mistakes.

Trying for repeatability is futile also, since we now know that varying the search space by only one node per move leads to different games anyway. Repeatability implies accuracy, but it is a mirage.

michiguel · Post by **michiguel** » Wed Sep 30, 2009 10:32 pm

Daniel Mehrmann wrote:As expected you try to do a race between a mouse, elephant and a tiger just for example.

Your idea doesn't work at all because every programmer handle search stuff different. It's starts already with "how to enter a node" or better "how to count nodes".

There is no modell for all possibilities. Furthermore results might be handle different und of course search which isn't finished , we might not use all stages, gives not useable results.

You can't define it inside a protocol as well, because you'll never know every possible idea and implementation of each programmer.

However, its funny - Yes ! But you can't use it for serious testing. Its just more a running gag for the users out there.

Best,
Daniel

ps: Each reliable engine-developer will tell you the same.

I cannot disagree more! In fact, I think it would be fantastic to have as many engines as possible that support this feature. This is a tool, so it depends how you use it. For instance, with this feature you can run matches of your debug version as it were the release version (using a factor to compensate differences in speed). Moreover, the game will become deterministic (if I understand correctly) so it will be much easier to repeat the situation where a rare bug happened. That is what I call serious testing

You may be thinking about testing the relative strengths of different engines, but that is not what this feature is about.

Miguel

bob · Post by **bob** » Wed Sep 30, 2009 10:36 pm

michiguel wrote:
Daniel Mehrmann wrote:As expected you try to do a race between a mouse, elephant and a tiger just for example.

Your idea doesn't work at all because every programmer handle search stuff different. It's starts already with "how to enter a node" or better "how to count nodes".

There is no modell for all possibilities. Furthermore results might be handle different und of course search which isn't finished , we might not use all stages, gives not useable results.

You can't define it inside a protocol as well, because you'll never know every possible idea and implementation of each programmer.

However, its funny - Yes ! But you can't use it for serious testing. Its just more a running gag for the users out there.

Best,
Daniel

ps: Each reliable engine-developer will tell you the same.
I cannot disagree more! In fact, I think it would be fantastic to have as many engines as possible that support this feature. This is a tool, so it depends how you use it. For instance, with this feature you can run matches of your debug version as it were the release version (using a factor to compensate differences in speed). Moreover, the game will become deterministic (if I understand correctly) so it will be much easier to repeat the situation where a rare bug happened. That is what I call serious testing You may be thinking about testing the relative strengths of different engines, but that is not what this feature is about.

Miguel

It is fine for testing to expose and fix errors. It is not so fine to measure whether a change is better or worse as I pointed out in another post in this thread.

hgm · Post by **hgm** » Wed Sep 30, 2009 10:48 pm

bob wrote:Some programs vary in speed significantly from opening to middlegame to endgame. I have seen a factor of 2-3X quite often, with some being even larger. Ferret was 4x-5x faster in endgames, as an example.

Then just don pick engines like that as an opponent, and make sure your own way of counting nodes gives a reasonable impression of the time used in your own engine.

This is just a tool, and in the hands of people that know how to wield it it can be a powerful tool. But even the most powerful tool, placed in the hands of the stupid or clumsy, will not make them anything but stupid and clumsy. In fact even more so. That it can be used to cause disasters on improper use can never imply that a tool is bad.

Daniel Mehrmann · Post by **Daniel Mehrmann** » Wed Sep 30, 2009 11:02 pm

Well, i think Bob wrote a lot of cons arguments already. So, there is no need to write more. It doesn't look like you'll accept any point of view.

Just one point i didn't find so far from Bob's side. If you're testing search stuff your idea is totaly useless as well.

However, i think the engine authors will read this and re-think about the stuff. Homer will never support it anyway.

Best,
Daniel

michiguel · Post by **michiguel** » Wed Sep 30, 2009 11:19 pm

bob wrote:
hgm wrote:Actually, you can use it for serious testing, and that is exactly where François will be using it for. Of course you cannot use it to measure the relative strength of completely different engines A and B, because of the reasons you give. But when you develop an engine that is immaterial. What you want to test is how versions A and A' of your engine perform relative to each other, when you play them against B (and C and D). And when A and A' count nodes in the same way and have the same nps, the results of such gauntlets are directly comparable.

And even if A and A' have different nps, because (say) you added a very expensive evaluation term, you can either test them at the same node budget, and record the time they take to correct the results with an empirical rating vs time formula, or you can restrict the node budget so that the engines A and A' effectively use the same time.

The latter strategy can even be used when you play A against B. It does not matter much how exactly the engine report the nodes. You just adapt their node budget until they use equal time, which can easily be done by a benchmark for their real nps. Not that when you specify a node-based TC in WinBoard, the TC dialog lets you specify the nps conversion factors for each engine independently.
It is not so immaterial, as I have explained previously. And, for the Nth time, here is why this is not the best form of testing.

Some programs vary in speed significantly from opening to middlegame to endgame. I have seen a factor of 2-3X quite often, with some being even larger. Ferret was 4x-5x faster in endgames, as an example.

If you do fixed-node searches, you have to somehow come up with a value that represents a "fair" approximation to equal time for the two opponents. Doable.

But, then comes the rest of the story. Suppose you play against an opponent where he speeds up by 5x, but you show little or now speed improvement. In an endgame, he will be 5x faster than you. In the opening, you will be pretty equal. Irrelevant, you say?

hardly. What happens if the changes you make, although small, end up making your king safety enough better than you start surviving the middlegame more frequently, Say you were losing 75% of middlegame positions and almost all endgame positions because of the speed advantage your opponent has. Now you have learned to avoid the busted king positions, and now 1/2 of the games you play reach the endgame, where before say only 25% were doing so. So you lost 25% of total games (100% of endgames because you get badly out-searched) and you were losing 3 of 4 of the other 75%. Now you no longer lose 3 of four in the middlegame, you are losing only 1/2 as many. But those now become endgame positions where you lose 'em all. And you now conclude you are doing _worse_ and throw the change away.

I see your point but your numbers are misleading. If the games you lose in the middlegames become endgames, now you lose will them because your endgame is weak. Ok, but the games you won in the middlegame, you still win them. The only thing you do in your example is you pick a different poison. A change that could be potentially beneficial, now is neutral. This happens already in many situations.

The traditional way of testing suffer from the same problem if you pick engines with the same behavior (i.e. engines that play well endgames vs. middlegames or vice versa). Picking few sparring partners enhances the chances to suffer from this hypothetical problem.

Miguel

This can happen in other ways as well, depending on who speeds up where in the game. You might win more endgames because you are faster. But if all your evaluation change does is push the game into a phase where you are either faster or slower, the results will have little to do with the actual changes you made, and the conclusions you reach are wrong.

If the programs do not change their speed significantly, I would agree this will work. But even Crafty varies by a factor of 3x or a little more as there is some slow code I execute in the opening that I do not execute in the MG/EG parts of the game because it has castled. If you play real games using a time limit, and you tune using something else, you really are inviting trouble. And are going to make mistakes.

Trying for repeatability is futile also, since we now know that varying the search space by only one node per move leads to different games anyway. Repeatability implies accuracy, but it is a mirage.

michiguel · Post by **michiguel** » Wed Sep 30, 2009 11:24 pm

Daniel Mehrmann wrote:Well, i think Bob wrote a lot of cons arguments already. So, there is no need to write more. It doesn't look like you'll accept any point of view.

And how about the pros?

Miguel

Just one point i didn't find so far from Bob's side. If you're testing search stuff your idea is totaly useless as well.

However, i think the engine authors will read this and re-think about the stuff. Homer will never support it anyway.

Best,
Daniel

michiguel · Post by **michiguel** » Wed Sep 30, 2009 11:25 pm

bob wrote:
michiguel wrote:
Daniel Mehrmann wrote:As expected you try to do a race between a mouse, elephant and a tiger just for example.

Your idea doesn't work at all because every programmer handle search stuff different. It's starts already with "how to enter a node" or better "how to count nodes".

There is no modell for all possibilities. Furthermore results might be handle different und of course search which isn't finished , we might not use all stages, gives not useable results.

You can't define it inside a protocol as well, because you'll never know every possible idea and implementation of each programmer.

However, its funny - Yes ! But you can't use it for serious testing. Its just more a running gag for the users out there.

Best,
Daniel

ps: Each reliable engine-developer will tell you the same.
I cannot disagree more! In fact, I think it would be fantastic to have as many engines as possible that support this feature. This is a tool, so it depends how you use it. For instance, with this feature you can run matches of your debug version as it were the release version (using a factor to compensate differences in speed). Moreover, the game will become deterministic (if I understand correctly) so it will be much easier to repeat the situation where a rare bug happened. That is what I call serious testing You may be thinking about testing the relative strengths of different engines, but that is not what this feature is about.

Miguel
It is fine for testing to expose and fix errors. It is not so fine to measure whether a change is better or worse as I pointed out in another post in this thread.

Fixing error is not a small thing...

Miguel

hgm · Post by **hgm** » Thu Oct 01, 2009 5:17 am

Daniel Mehrmann wrote:Well, i think Bob wrote a lot of cons arguments already. So, there is no need to write more. It doesn't look like you'll accept any point of view.

I don have to accept any point of view. I only supply opportunities. It is up to the users to turn that to their advantage. Those imaginative enough to profit from these new opportunities, will do so. Those that are not, will simply not use it. There is no reason to withhold it from people that can profitably use it just because there are some others that cannot. No matter how elaborate their argument for why _they_ cannot.

And of course the most important thing: this feature is useful to _me_, as I took care to do node counting in such a way that it gives a good representation of the time used in any stage of the game. So Bob's cons simply do not apply there. I tend to focus on what works, not on how I could mess it up. You could as well argue that it was better not to implement setboard, because there is no guarantee that a position you feed to the engine is completely balanced,

That you don want to implement it in Homer is of no concern to me. It just means I cannot and will not use Homer. There are plenty of other engines that are fit for my purposes...

Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines

Re: Asymetric Node/move in winboard for UCI engines