Crafty and software development

jwes · Post by **jwes** » Sun Nov 22, 2009 10:03 am

There has been a long thread about testing methods and part of that was several people writing that fixed nodes would allow separating the idea from the implementation, and Bob saying that it was not necessary for him as he always implemented his ideas efficiently so it would require him to test twice. It occurred to me that the release of c^HCrafty 23.1 is an opportunity to test Bob's statements by doing a diff on his code and inspecting the changes to see if they are highly optimized.

frankp · Post by **frankp** » Sun Nov 22, 2009 11:28 am

What hypothesis is being tested here?

bob · Post by **bob** » Sun Nov 22, 2009 6:30 pm

jwes wrote:There has been a long thread about testing methods and part of that was several people writing that fixed nodes would allow separating the idea from the implementation, and Bob saying that it was not necessary for him as he always implemented his ideas efficiently so it would require him to test twice. It occurred to me that the release of c^HCrafty 23.1 is an opportunity to test Bob's statements by doing a diff on his code and inspecting the changes to see if they are highly optimized.

How could you do that? You only have the "final product" so did we write sloppy code, test, and then rewrite if it worked? Or did we write it optimally first? I don't see any possible way you could determine that without having all of the intermediate versions we tested, of which there were _hundreds_ between 23.0 and 23.1...

Feel free to inspect what you want of course, but you only see what was released, which doesn't seem capable of answering your question.

jwes · Post by **jwes** » Mon Nov 23, 2009 3:42 am

bob wrote:
jwes wrote:There has been a long thread about testing methods and part of that was several people writing that fixed nodes would allow separating the idea from the implementation, and Bob saying that it was not necessary for him as he always implemented his ideas efficiently so it would require him to test twice. It occurred to me that the release of c^HCrafty 23.1 is an opportunity to test Bob's statements by doing a diff on his code and inspecting the changes to see if they are highly optimized.
How could you do that? You only have the "final product" so did we write sloppy code, test, and then rewrite if it worked? Or did we write it optimally first? I don't see any possible way you could determine that without having all of the intermediate versions we tested, of which there were _hundreds_ between 23.0 and 23.1...

Feel free to inspect what you want of course, but you only see what was released, which doesn't seem capable of answering your question.

I believe that you develop the way you said you did. The question I want to answer is whether the final code actually is optimal. Assuming that the code that is in Crafty is at least as good as the code that was not included in Crafty, then if there are cases of non-optimal code in Crafty, it is a reasonable assumption that some code excluded from Crafty is non-optimal and you might be well served by a method of testing that separates the idea from the efficiency of the implementation.

bob · Post by **bob** » Mon Nov 23, 2009 5:07 am

jwes wrote:
bob wrote:
jwes wrote:There has been a long thread about testing methods and part of that was several people writing that fixed nodes would allow separating the idea from the implementation, and Bob saying that it was not necessary for him as he always implemented his ideas efficiently so it would require him to test twice. It occurred to me that the release of c^HCrafty 23.1 is an opportunity to test Bob's statements by doing a diff on his code and inspecting the changes to see if they are highly optimized.
How could you do that? You only have the "final product" so did we write sloppy code, test, and then rewrite if it worked? Or did we write it optimally first? I don't see any possible way you could determine that without having all of the intermediate versions we tested, of which there were _hundreds_ between 23.0 and 23.1...

Feel free to inspect what you want of course, but you only see what was released, which doesn't seem capable of answering your question.
I believe that you develop the way you said you did. The question I want to answer is whether the final code actually is optimal. Assuming that the code that is in Crafty is at least as good as the code that was not included in Crafty, then if there are cases of non-optimal code in Crafty, it is a reasonable assumption that some code excluded from Crafty is non-optimal and you might be well served by a method of testing that separates the idea from the efficiency of the implementation.

There are _thousands_ of ideas that are good if computational costs can be ignored. However, computational cost is something that can not be overlooked. If there is no way to implement something efficiently enough so that the gain from the idea is not more than offset by the cost, then there is no reason to even deal with implementation.

As an example, who wouldn't like to do mobility where you evaluate each square with respect to (a) usefulness, (b) whether you can actually move there safely based on attackers and defenders, (c) effect on mobility of other pieces (I would like to have mobility to squares that also reduce the mobility of my opponent, because the piece sitting on that square attacks key squares my opponent will no longer have mobility to, etc. Too expensive to contemplate unless one can first figure out a way of computing that type of mobility in reasonable time. If you take the time to implement it, and then totally ignore the computational cost, this idea will be a winner. But when you play real games, it will be a horrible loser, due to cost.

This discussion of separating idea from computational cost is completely foreign to any sort of programming where speed is paramount, as it is in chess. Nothing says you need to develop an "optimal" implementation. But you _must_ develop an efficient one so that the cost doesn't exceed the gain.

Don · Post by **Don** » Mon Nov 23, 2009 11:45 pm

bob wrote:
There are _thousands_ of ideas that are good if computational costs can be ignored. However, computational cost is something that can not be overlooked. If there is no way to implement something efficiently enough so that the gain from the idea is not more than offset by the cost, then there is no reason to even deal with implementation.

Of course, you have to run time control games to determine that. I think you keep beating this dead horse because nobody disagrees with you on this but that's the only thing you want to argue.

It would be just as silly if I kept coming back at you and saying, Bob, 2+2=4 and it just doesn't matter what 3+3 equals because 2+2=4.

I think WE KNOW that computational costs matter and we all agree that no matter how good the idea is, if it's too expensive it will not produce a stronger program. Can we get that out of way first of all?

The part of this discussion that is so offensive is that you use that argument as a some kind of proof that this is the only "right" way to test. It's almost like you believe that just because 2+2=4 that 3+3 cannot be 6.

The kind of testing methodology that I advocate is a lot more flexible. I'll walk you through the process but I will compare your testing methodology with mine and you can correct me where I am wrong.

So let's pretend you and I are doing this same exact thing, we have the same exact idea and the same notions about how to implement it. And now we are both going to do what we do:

Let's use mobility as our example. First of all, we get this idea that we want to improve it. Our idea let's say, is to do a swap-off evaluation on every square that a piece can legally move to. (Let's pretend that we believe we have the perfect implementation of this that we think should be really fast.)

So now the question becomes, what score do you give each square? Do you give losing squares zero mobility, or some fraction? If it's a fraction, which fraction? Is it based on the specific square? Do you get a different score based on whether it can be captured at all (even if it's not losing) and I could on with a million details and there must be literally a zillion proposals one could come up with.

So we both choose some very specific idea that we think is most likely to succeed and we will implement it in the most efficient way that we can.

At this point so far, probably nothing substantially different between us. But here is where it gets a lot more interesting.

We both test. You set up a time control game and I set up a fixed depth game.

When the test is complete, you know for sure whether the exact idea you implement with the exact implementation which you are sure is the ultimate one is going to improve the program. That's the most important thing to know of course. But what will I know?

I will know 3 things:

1. Does this idea win at fixed depth?
2. How must speed penalty do I pay for it?
3. Is it worth it?

Now you continue to argue that I don't really know if it works, but all I can say is that you're quite wrong about this point. I concede that your test is more reliable about this, but only slightly. So I know 3 things and you know only 1 thing at this point. And I'm going to make the point shortly that this gives me a huge advantage over you.

Most of the time this first test will not work. So each of us has to make the following decisions:

1. When to give up on the idea completely.
2. If not, what to try next?

This is the point where we go our separate ways. I have more information than you and I can make a better decisions about whether to give up or try something else. All you can do is take another random shot in the dark. "That didn't work - let's try this." But my next decisions is going to be much more informed than yours.

For instance, I may find that the implementation is surprisingly fast, but that the I'm not getting any ELO out of this, it's actually scoring much worse at the same depth. What could you conclude? You don't have a clue, you would just repeat the same old mantra to yourself, that it doesn't matter how good this idea if the implementation is too slow. Of course you have no idea whether it's too slow and you don't care, you will just pat yourself on the back because you did one test to prove the idea doesn't work and that's all that matters.

At this point I would be encouraged to know that the implementation is quite fast and I would be much more motivated to experiment with the same basic idea. I might discover that I should give unsafe squares a little bit of credit instead of none, or that I'm putting too much weight on the good squares. The point is that I already know just from the first test that I will probably be able to make this idea work.

Or I might conclude just the opposite. I might be appalled at the speed and discover that the idea is worth 25 ELO but that it's just way too slow. And I'm a bit different that you in one regard, I am not quite as confident that everything I do is perfect, so I would probably do something you consider foolish, I might try to find a faster implementation. You would consider it foolish because you always do it right the first time. But my real point is that I have more information to guide me than you do.

There is one thing that you COULD do, and that is to examine your nodes per second. It's crude, but better than nothing at all. But you have never once claimed to be doing that, so I have to take you at your word, that you think the only thing you have to know is whether it wins or loses at time control games.

You claim that is the best and fastest way and that anyone that does it different is wrong, but I think what you are not taking into consideration is that this is only true for a single pass. If you only want to know if one implementation of a single specific idea is workable, a time control test is all you need, but this breaks down when you really need to flesh out an idea and to actually understand what is going on and want to do it the smart way.

bob · Post by **bob** » Tue Nov 24, 2009 1:13 am

Don wrote:
bob wrote:
There are _thousands_ of ideas that are good if computational costs can be ignored. However, computational cost is something that can not be overlooked. If there is no way to implement something efficiently enough so that the gain from the idea is not more than offset by the cost, then there is no reason to even deal with implementation.

Of course, you have to run time control games to determine that. I think you keep beating this dead horse because nobody disagrees with you on this but that's the only thing you want to argue.

If you will simply read this thread _first_ you will see that is the point that was being discussed. The _only_ point that was being discussed. So what thing should I argue? That I think my taxes are too high? That my daily schedule is too busy? What? Or should I (we) simply stick to the topic being discussed, which is what I did.

There has been a long thread about testing methods and part of that was several people writing that fixed nodes would allow separating the idea from the implementation, and Bob saying that it was not necessary for him as he always implemented his ideas efficiently so it would require him to test twice. It occurred to me that the release of c^HCrafty 23.1 is an opportunity to test Bob's statements by doing a diff on his code and inspecting the changes to see if they are highly optimized.

So exactly what did I discuss/argue about that was _not_ in the post above???

It would be just as silly if I kept coming back at you and saying, Bob, 2+2=4 and it just doesn't matter what 3+3 equals because 2+2=4.

Again, "read the first post".

I think WE KNOW that computational costs matter and we all agree that no matter how good the idea is, if it's too expensive it will not produce a stronger program. Can we get that out of way first of all?

And one more time, read the first post.

I even quoted it above so you don't have to jump back to the top of the thread.

The part of this discussion that is so offensive is that you use that argument as a some kind of proof that this is the only "right" way to test. It's almost like you believe that just because 2+2=4 that 3+3 cannot be 6.

What is offensive is for you to jump in here, apparently without reading anything prior, and then telling me I am arguing a point everyone agrees with. If you would just read first, _before_ posting, we would not even be having this side-discussion.

The kind of testing methodology that I advocate is a lot more flexible. I'll walk you through the process but I will compare your testing methodology with mine and you can correct me where I am wrong.

So let's pretend you and I are doing this same exact thing, we have the same exact idea and the same notions about how to implement it. And now we are both going to do what we do:

Let's use mobility as our example. First of all, we get this idea that we want to improve it. Our idea let's say, is to do a swap-off evaluation on every square that a piece can legally move to. (Let's pretend that we believe we have the perfect implementation of this that we think should be really fast.)

So now the question becomes, what score do you give each square? Do you give losing squares zero mobility, or some fraction? If it's a fraction, which fraction? Is it based on the specific square? Do you get a different score based on whether it can be captured at all (even if it's not losing) and I could on with a million details and there must be literally a zillion proposals one could come up with.

So we both choose some very specific idea that we think is most likely to succeed and we will implement it in the most efficient way that we can.

At this point so far, probably nothing substantially different between us. But here is where it gets a lot more interesting.

We both test. You set up a time control game and I set up a fixed depth game.

When the test is complete, you know for sure whether the exact idea you implement with the exact implementation which you are sure is the ultimate one is going to improve the program. That's the most important thing to know of course. But what will I know?

I will know 3 things:

1. Does this idea win at fixed depth?
2. How must speed penalty do I pay for it?
3. Is it worth it?

I just happen to know the answers to question 2 and 3 also. As far as number one, I don't care. I can come up with a ton of ideas that work at fixed depth and fail at timed games. But what's the point?

When I finish, I know what the speed penalty is, because I measure that independently of games. I know whether it is worth it or not from my testing. I fail to see why (1) is important at all. Just ramp up your extensions in the new version and it will win more games at fixed depth. Nearly every time.

Now you continue to argue that I don't really know if it works, but all I can say is that you're quite wrong about this point. I concede that your test is more reliable about this, but only slightly. So I know 3 things and you know only 1 thing at this point. And I'm going to make the point shortly that this gives me a huge advantage over you.

Again, see above. you only know one more thing than I do. You know how it works in a fixed-depth match. Hooray. And you still get to test in time control games to see if it works there. A second test. I have already finished and moved on to the next idea by this point.

Most of the time this first test will not work. So each of us has to make the following decisions:

1. When to give up on the idea completely.
2. If not, what to try next?

This is the point where we go our separate ways. I have more information than you and I can make a better decisions about whether to give up or try something else. All you can do is take another random shot in the dark. "That didn't work - let's try this." But my next decisions is going to be much more informed than yours.

We don't do "random shots in the dark" here. Don't know where you get that idea and really don't care, but it is wrong. We don't just say "that didn't work, let's try this." We say "that didn't work, why?" and we continue from there. We have tried an idea many times before finding something that works, on occasion.

For instance, I may find that the implementation is surprisingly fast, but that the I'm not getting any ELO out of this, it's actually scoring much worse at the same depth. What could you conclude? You don't have a clue, you would just repeat the same old mantra to yourself, that it doesn't matter how good this idea if the implementation is too slow. Of course you have no idea whether it's too slow and you don't care, you will just pat yourself on the back because you did one test to prove the idea doesn't work and that's all that matters.

I'm simply going to exercise my option to stop at this point. Believe what you want. Test how you want. If you are happy with fixed depth testing, continue doing it. I've already experimented with this and fixed nodes for 6 months on our cluster before we got to the point we are at today. I'm not going back, that kind of testing has too many issues where timed tests have absolutely none, since that is how I am going to have to play in real games. If you can't debug a program, or test an idea, and use a clock, and then draw reasonable conclusions about what the results mean, that's perhaps something you need to work on. I'm not stumbling around in a fog, and have made significant progress in a little over a year using this "flawed, random shot-in-the-dark" approach.

At this point I would be encouraged to know that the implementation is quite fast and I would be much more motivated to experiment with the same basic idea. I might discover that I should give unsafe squares a little bit of credit instead of none, or that I'm putting too much weight on the good squares. The point is that I already know just from the first test that I will probably be able to make this idea work.

Your "implementation is quite fast" is a joke, of course. You can run 10 test positions and know whether you slowed the thing down significantly, or any at all. You don't need thousands of games to discover that. That's the first test I run for _every_ change we try, to determine what happened to NPS. Many of our changes have no effect, because we spend some time trying to implement them efficiently (the mobility code we use has very minimal cost because of the cache stuff we use to avoid duplicate calculations when pieces don't move and files/ranks/files don't change. And we knew that 5 minutes after the change was done. And we didn't need thousands of games, regardless of how long they took. And I don't really care about the cases where it gets much slower and seems to play better, because that is caught in our normal timed-game testing anyway.

Or I might conclude just the opposite. I might be appalled at the speed and discover that the idea is worth 25 ELO but that it's just way too slow. And I'm a bit different that you in one regard, I am not quite as confident that everything I do is perfect, so I would probably do something you consider foolish, I might try to find a faster implementation. You would consider it foolish because you always do it right the first time. But my real point is that I have more information to guide me than you do.

You "think" you have more information. But most of it is worthless. An idea can not _possibly_ be worth 25 elo _and_ be way slow. Because Elo is a measure of performance in real games. And I see that effect in my testing. I start off knowing what the speed penalty was. I finish the test knowing what the elo effect was. We are all quite capable of looking at those two numbers to determine where to go next. Not sure why you can't.. but we certainly can... and do.

There is one thing that you COULD do, and that is to examine your nodes per second. It's crude, but better than nothing at all. But you have never once claimed to be doing that, so I have to take you at your word, that you think the only thing you have to know is whether it wins or loses at time control games.

Did you ever read where I said that we developed efficient (not necessarily optimal, but certainly efficient) implementations to test with? Exactly _how_ do you think we know what is efficient? I would think _anyone_ that has read my posts over the years knows that I very definitely know the effect on NPS for every last change we. We are talking about testing games, not testing performance. So just because I didn't say "we do that" doesn't mean we don't "do that". We _always_ have. I've never made a change without testing the NPS first, unless I knew the change had nothing to do with raw search speed.

When you read what I write, it appears to me that you want to insert silly assertions at any point where I didn't directly address an issue. This is a classic. Exactly how do you think Crafty got to be as fast as it is today if I were not aware of how every change affects speed? I have been doing that since 1968, and I make the basic assumption that any rational chess programmer will first test NPS before doing anything else. Why you would assume anything else from me is mind-boggling, actually.

You claim that is the best and fastest way and that anyone that does it different is wrong, but I think what you are not taking into consideration is that this is only true for a single pass. If you only want to know if one implementation of a single specific idea is workable, a time control test is all you need, but this breaks down when you really need to flesh out an idea and to actually understand what is going on and want to do it the smart way.

I don't consider running a ton of tests "the smart way to do this, actually." We try to do that _before_ dumping things on the cluster.

mjlef · Post by **mjlef** » Tue Nov 24, 2009 1:18 am

Please continue these long discussions. I figure every moment Don or Bob spends typing responses is time they do not spend working on their programs, and so it gives me a chance to catch up!

bob · Post by **bob** » Tue Nov 24, 2009 9:00 am

mjlef wrote:Please continue these long discussions. I figure every moment Don or Bob spends typing responses is time they do not spend working on their programs, and so it gives me a chance to catch up!

I only post while testing. I avoid getting too much going on at once, such as testing version X+1 and working on a new idea for X+2. Do I change X or X+1 to create X+2? Easier to wait to see what testing shows and avoid choosing the wrong one to modify. That leaves enough time to address these arguments.

While we had the argument today, I found another +10 elo already.

That's what the testing showed.

Don · Post by **Don** » Tue Nov 24, 2009 2:03 pm

mjlef wrote:Please continue these long discussions. I figure every moment Don or Bob spends typing responses is time they do not spend working on their programs, and so it gives me a chance to catch up!

I don't have Bob's testing resources so I spend most of my time waiting on tests than in actual software development. So I'm basically just slowing Bob down

Crafty and software development

Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development

Re: Crafty and software development