OliverBr wrote:I don't see any risk of "inbreeding" here, because crafty is genetically very different to olithink
In my opinion the risk of testing against just crafty is almost as large as self test. You are tuning your engine to beat crafty (from the start positions you are using). Thus, you risk creating an engine particularly suited to beating crafty but perhaps worse against other engines. If crafty is bad against pawn storms for example, you may end up building an engine that does all sorts of overly risky pawn storms that would have been punished by other engines (just an example, no idea how crafty is with pawn storms). If you are going to do 1000 games, I would recommend 200 against 5 engines instead as more reliable results, even though it reduces your number of starting positions.
Regarding "how many" games are needed to detect an improvement, I leave that to the experts. I will say that I consider it dependent on what you are using the testing for. If it is for judging a feature or deciding what to use in ChessWars, perfect testing is not critical (1000 games sounds great to me). When sending an engine to a testing group who is going to spend a lot of time on your engine, I would be more conservative in my criteria as a sign of respect for their time investment.
-Sam