Testing Best Practices

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

snowman

Testing Best Practices

Post by snowman »

Firstly, I want to say that this forum is great with full of computer chess enthusiasts. My first post is about PV retrieval and the discussion was way over my initial expectation. People here are very kind and helpful to beginners like me.

Secondly, in this post, I want to ask about your testing best practices such as:
1) How to test the correctness of new version?
2) How to evaluate the improvement? How to set up a system to quickly quantify it (ELO)? I think most of us don't have access to big clusters like Bob so I'm mainly interested in testing on normal PCs.
3) How to test different parts of the engine: move generation, ordering, search, evaluation, speed, etc.
4) Other good practices.

Hope this thread will be helpful to everyone who wants to improve his engine quickly and systematically.

Many thanks!
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing Best Practices

Post by bob »

snowman wrote:Firstly, I want to say that this forum is great with full of computer chess enthusiasts. My first post is about PV retrieval and the discussion was way over my initial expectation. People here are very kind and helpful to beginners like me.

Secondly, in this post, I want to ask about your testing best practices such as:
1) How to test the correctness of new version?
2) How to evaluate the improvement? How to set up a system to quickly quantify it (ELO)? I think most of us don't have access to big clusters like Bob so I'm mainly interested in testing on normal PCs.
3) How to test different parts of the engine: move generation, ordering, search, evaluation, speed, etc.
4) Other good practices.

Hope this thread will be helpful to everyone who wants to improve his engine quickly and systematically.

Many thanks!
For (1) it depends on what you mean by "correctness". For example, suppose you take your program and rewrite a part to make it faster. It is supposed to do exactly the same things, just faster. Which means when you search a tree to fixed depth, the previous version and this version should search exactly the same number of nodes. If not, fix it. NEVER do the "it's just one node, no big deal." It is a big deal and it is a bug that will sit and cause problems over time. If you make a search or eval change, then this test won't work because an eval change (not just a performance change, but a change that alters the score) will alter the shape/size of the tree. Testing in real games is about the only solution for this.

To measure small elo gains, you need large numbers of games, there is no short-cut. You can play very fast games, but you do need a bunch of them. To measure 4-5 elo changes, you need 30K games or so. If you want to measure 1 elo, you go to the 200K range...
snowman

Re: Testing Best Practices

Post by snowman »

Hi bob, can you talk more about how to select opponents for your engine, how are different parameters set? For example, how many players do you include in the tournament? which rules? How fast per game is enough to get a reasonably accurate ELO estimation? Is there any tool to automate all of these or we need to implement ourself?
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Testing Best Practices

Post by michiguel »

snowman wrote:Firstly, I want to say that this forum is great with full of computer chess enthusiasts. My first post is about PV retrieval and the discussion was way over my initial expectation. People here are very kind and helpful to beginners like me.

Secondly, in this post, I want to ask about your testing best practices such as:
1) How to test the correctness of new version?
2) How to evaluate the improvement? How to set up a system to quickly quantify it (ELO)? I think most of us don't have access to big clusters like Bob so I'm mainly interested in testing on normal PCs.
3) How to test different parts of the engine: move generation, ordering, search, evaluation, speed, etc.
4) Other good practices.

Hope this thread will be helpful to everyone who wants to improve his engine quickly and systematically.

Many thanks!
1) Use assert(). I cannot stress enough how important this is, particularly for a beginner. Once you have decades of experience like Bob, you can relax this a little bit... NOT! not even then.

2) Write code that will do this: Flip the position left to right, and mirror the position black and white. As long as there are no castle availability, all four combinations should give you the **exact** same evaluation score.
Once you wrote this code, place a call to it just before you call search() and make sure that all four evals are the same. If not, exit() with lots of information printed (board etc.).
Wrap the whole thing with an assert() or #if !defined(NDEBUG) etc.

3) Run matches with both options on the top activated. Catching bugs early, is the best thing it could happen to you.

4) find a way to dump the whole tree if needed. Keep a stack with all the moves and dump the whole variation every time you makemove().

5) analyze epd test suites. Analyze a whole set and measure total number of nodes. If move ordering gets better, those numbers go down.

6) check move generation with perft(), and most importantly, perftdiv() of known positions. You will save a huge amount of time!!

7) Every time you introduce a new feature, make it simple. You will have to rewrite it later anyway.

8) Use as fewer globals as possible. You will thank me for this advice whenever you decide to make you engine SMP (but do not get crazy about it, see point #7)

9) follow the progress of you engine with fast games or games in which the engine does not go too deep. Why? you may understand why the engine makes mistakes as shallow depths. At the beginning, there will be plenty and you do not want the search to hide them.

10) Make your goal to write a bug free engine and do not worry about ELO until the engine is 2200. Make sure each feature is introduced it is done properly. You will have to retune the whole thing all over again anyway.

11) If you play matches, limit the thinking time by nodes, not time. i.e. limit your engine to think, say, 10000 nodes per move. At the beginning you need to make things reproducible. Suppose you find there was a mistake in one of your matches. If everything is done by "nodes", or "depth" you may be able to come back and reproduce the problem

12) Quit this hobby right now before it is too late :-)

Miguel
PS: It is extremely important to write the right tools. It is the only way to make steady progress. Be ready, this is a marathon.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing Best Practices

Post by bob »

snowman wrote:Hi bob, can you talk more about how to select opponents for your engine, how are different parameters set? For example, how many players do you include in the tournament? which rules? How fast per game is enough to get a reasonably accurate ELO estimation? Is there any tool to automate all of these or we need to implement ourself?
I ended up using 5 engines. When we started, all were better than Crafty. However, over the past 2 years, we have passed all but one. You have to take what you can get here. I can't us windows stuff since our cluster is linux. I need source as I have to compile to use the cluster libraries, we don't install every possible lib there. Once you have the opponents, I have a set of positions (openings.epd) on my ftp machine that anyone can use. I think there are 4,000 total positions there. I use the first 3,000, and play both sides for each opponent, which gives 6,000 games. With 5 opponents, that is 30K games which gives +/-3 error bars for Elo comparisons.

Time control is pretty much up to you. Slow enough that time losses are either very infrequent or do not happen, fast enough to get the games done so that you don't spend all your time waiting on results. As far as rules, I don't know what you mean. I wrote my own referee for the cluster. It relays moves back and forth, maintains the clock info just like winboard/xboard, and declares games as won, lost or drawn based on the moves played.

I have a mechanism that will play a series of 30K games, varying one or two parameters by any amount I choose. These "settings" are incorporated into the PGN program names so that BayesElo output makes it easy to see which setting(s) was/were better.
snowman

Re: Testing Best Practices

Post by snowman »

michiguel wrote:
snowman wrote:Firstly, I want to say that this forum is great with full of computer chess enthusiasts. My first post is about PV retrieval and the discussion was way over my initial expectation. People here are very kind and helpful to beginners like me.

Secondly, in this post, I want to ask about your testing best practices such as:
1) How to test the correctness of new version?
2) How to evaluate the improvement? How to set up a system to quickly quantify it (ELO)? I think most of us don't have access to big clusters like Bob so I'm mainly interested in testing on normal PCs.
3) How to test different parts of the engine: move generation, ordering, search, evaluation, speed, etc.
4) Other good practices.

Hope this thread will be helpful to everyone who wants to improve his engine quickly and systematically.

Many thanks!
1) Use assert(). I cannot stress enough how important this is, particularly for a beginner. Once you have decades of experience like Bob, you can relax this a little bit... NOT! not even then.

2) Write code that will do this: Flip the position left to right, and mirror the position black and white. As long as there are no castle availability, all four combinations should give you the **exact** same evaluation score.
Once you wrote this code, place a call to it just before you call search() and make sure that all four evals are the same. If not, exit() with lots of information printed (board etc.).
Wrap the whole thing with an assert() or #if !defined(NDEBUG) etc.

3) Run matches with both options on the top activated. Catching bugs early, is the best thing it could happen to you.

4) find a way to dump the whole tree if needed. Keep a stack with all the moves and dump the whole variation every time you makemove().

5) analyze epd test suites. Analyze a whole set and measure total number of nodes. If move ordering gets better, those numbers go down.

6) check move generation with perft(), and most importantly, perftdiv() of known positions. You will save a huge amount of time!!

7) Every time you introduce a new feature, make it simple. You will have to rewrite it later anyway.

8) Use as fewer globals as possible. You will thank me for this advice whenever you decide to make you engine SMP (but do not get crazy about it, see point #7)

9) follow the progress of you engine with fast games or games in which the engine does not go too deep. Why? you may understand why the engine makes mistakes as shallow depths. At the beginning, there will be plenty and you do not want the search to hide them.

10) Make your goal to write a bug free engine and do not worry about ELO until the engine is 2200. Make sure each feature is introduced it is done properly. You will have to retune the whole thing all over again anyway.

11) If you play matches, limit the thinking time by nodes, not time. i.e. limit your engine to think, say, 10000 nodes per move. At the beginning you need to make things reproducible. Suppose you find there was a mistake in one of your matches. If everything is done by "nodes", or "depth" you may be able to come back and reproduce the problem

12) Quit this hobby right now before it is too late :-)

Miguel
PS: It is extremely important to write the right tools. It is the only way to make steady progress. Be ready, this is a marathon.
Hi Miguel, what a great list! I will take all of your advices but the last one. I wish I would have asked it earlier. It's too late by now. :D
Thanks a lot!
snowman

Re: Testing Best Practices

Post by snowman »

bob wrote:
snowman wrote:Hi bob, can you talk more about how to select opponents for your engine, how are different parameters set? For example, how many players do you include in the tournament? which rules? How fast per game is enough to get a reasonably accurate ELO estimation? Is there any tool to automate all of these or we need to implement ourself?
I ended up using 5 engines. When we started, all were better than Crafty. However, over the past 2 years, we have passed all but one. You have to take what you can get here. I can't us windows stuff since our cluster is linux. I need source as I have to compile to use the cluster libraries, we don't install every possible lib there. Once you have the opponents, I have a set of positions (openings.epd) on my ftp machine that anyone can use. I think there are 4,000 total positions there. I use the first 3,000, and play both sides for each opponent, which gives 6,000 games. With 5 opponents, that is 30K games which gives +/-3 error bars for Elo comparisons.

Time control is pretty much up to you. Slow enough that time losses are either very infrequent or do not happen, fast enough to get the games done so that you don't spend all your time waiting on results. As far as rules, I don't know what you mean. I wrote my own referee for the cluster. It relays moves back and forth, maintains the clock info just like winboard/xboard, and declares games as won, lost or drawn based on the moves played.

I have a mechanism that will play a series of 30K games, varying one or two parameters by any amount I choose. These "settings" are incorporated into the PGN program names so that BayesElo output makes it easy to see which setting(s) was/were better.
Many thanks, Bob! I have another naive question. When people are talking about nps. What exactly considered to be a node? Do they count the number of evaluation function callings or something else?
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Testing Best Practices

Post by Michael Sherwin »

snowman wrote:
bob wrote:
snowman wrote:Hi bob, can you talk more about how to select opponents for your engine, how are different parameters set? For example, how many players do you include in the tournament? which rules? How fast per game is enough to get a reasonably accurate ELO estimation? Is there any tool to automate all of these or we need to implement ourself?
I ended up using 5 engines. When we started, all were better than Crafty. However, over the past 2 years, we have passed all but one. You have to take what you can get here. I can't us windows stuff since our cluster is linux. I need source as I have to compile to use the cluster libraries, we don't install every possible lib there. Once you have the opponents, I have a set of positions (openings.epd) on my ftp machine that anyone can use. I think there are 4,000 total positions there. I use the first 3,000, and play both sides for each opponent, which gives 6,000 games. With 5 opponents, that is 30K games which gives +/-3 error bars for Elo comparisons.

Time control is pretty much up to you. Slow enough that time losses are either very infrequent or do not happen, fast enough to get the games done so that you don't spend all your time waiting on results. As far as rules, I don't know what you mean. I wrote my own referee for the cluster. It relays moves back and forth, maintains the clock info just like winboard/xboard, and declares games as won, lost or drawn based on the moves played.

I have a mechanism that will play a series of 30K games, varying one or two parameters by any amount I choose. These "settings" are incorporated into the PGN program names so that BayesElo output makes it easy to see which setting(s) was/were better.
Many thanks, Bob! I have another naive question. When people are talking about nps. What exactly considered to be a node? Do they count the number of evaluation function callings or something else?
You must be trying to put Bob in his grave! :lol: He has answered that question sooooo many times. And just very recently too. Yes, people are nice here, but a new person should be nice and do some reading.

Basically Bob defines a node as any position that is reached while searching. I count them the same way as bob to be consistent with most other engines, however, my technical (to me correct) definition of a node is different. To me a node is any position reached in the actual alpha-beta tree and not in any side positions that are reached for other reasons.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
snowman

Re: Testing Best Practices

Post by snowman »

Michael Sherwin wrote:
snowman wrote:
bob wrote:
snowman wrote:Hi bob, can you talk more about how to select opponents for your engine, how are different parameters set? For example, how many players do you include in the tournament? which rules? How fast per game is enough to get a reasonably accurate ELO estimation? Is there any tool to automate all of these or we need to implement ourself?
I ended up using 5 engines. When we started, all were better than Crafty. However, over the past 2 years, we have passed all but one. You have to take what you can get here. I can't us windows stuff since our cluster is linux. I need source as I have to compile to use the cluster libraries, we don't install every possible lib there. Once you have the opponents, I have a set of positions (openings.epd) on my ftp machine that anyone can use. I think there are 4,000 total positions there. I use the first 3,000, and play both sides for each opponent, which gives 6,000 games. With 5 opponents, that is 30K games which gives +/-3 error bars for Elo comparisons.

Time control is pretty much up to you. Slow enough that time losses are either very infrequent or do not happen, fast enough to get the games done so that you don't spend all your time waiting on results. As far as rules, I don't know what you mean. I wrote my own referee for the cluster. It relays moves back and forth, maintains the clock info just like winboard/xboard, and declares games as won, lost or drawn based on the moves played.

I have a mechanism that will play a series of 30K games, varying one or two parameters by any amount I choose. These "settings" are incorporated into the PGN program names so that BayesElo output makes it easy to see which setting(s) was/were better.
Many thanks, Bob! I have another naive question. When people are talking about nps. What exactly considered to be a node? Do they count the number of evaluation function callings or something else?
You must be trying to put Bob in his grave! :lol: He has answered that question sooooo many times. And just very recently too. Yes, people are nice here, but a new person should be nice and do some reading.

Basically Bob defines a node as any position that is reached while searching. I count them the same way as bob to be consistent with most other engines, however, my technical (to me correct) definition of a node is different. To me a node is any position reached in the actual alpha-beta tree and not in any side positions that are reached for other reasons.
Hi Michael,
thanks for your answer. I didn't mean to ask before searching but the search tool often returns too many results. I will try to ask better questions. :)

@Bob: sorry, please ignore my last question.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing Best Practices

Post by bob »

snowman wrote:
bob wrote:
snowman wrote:Hi bob, can you talk more about how to select opponents for your engine, how are different parameters set? For example, how many players do you include in the tournament? which rules? How fast per game is enough to get a reasonably accurate ELO estimation? Is there any tool to automate all of these or we need to implement ourself?
I ended up using 5 engines. When we started, all were better than Crafty. However, over the past 2 years, we have passed all but one. You have to take what you can get here. I can't us windows stuff since our cluster is linux. I need source as I have to compile to use the cluster libraries, we don't install every possible lib there. Once you have the opponents, I have a set of positions (openings.epd) on my ftp machine that anyone can use. I think there are 4,000 total positions there. I use the first 3,000, and play both sides for each opponent, which gives 6,000 games. With 5 opponents, that is 30K games which gives +/-3 error bars for Elo comparisons.

Time control is pretty much up to you. Slow enough that time losses are either very infrequent or do not happen, fast enough to get the games done so that you don't spend all your time waiting on results. As far as rules, I don't know what you mean. I wrote my own referee for the cluster. It relays moves back and forth, maintains the clock info just like winboard/xboard, and declares games as won, lost or drawn based on the moves played.

I have a mechanism that will play a series of 30K games, varying one or two parameters by any amount I choose. These "settings" are incorporated into the PGN program names so that BayesElo output makes it easy to see which setting(s) was/were better.
Many thanks, Bob! I have another naive question. When people are talking about nps. What exactly considered to be a node? Do they count the number of evaluation function callings or something else?
I've always tried to use the classic AI definition. A node is a position. If you update the board to produce a new position and then later have to retract that move to get back, that is a node. If you count evals, you miss the interior nodes that you must pass thru to get to the q-search, so your NPS will be lower than a traditional count. Some count nodes when they call MakeMove(). Some count a node each time they enter Search() or Quiesce(). Doesn't make a lot of difference. If you get to the point where you make a move, then do a lot of analysis, then unmake the move without recursively calling search() or quiesce() then incrementing the counter in make() is probably the best idea...