more on engine testing

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

bob wrote:
hgm wrote:And this is the point that you don't seem to get. It doesn't matter how random the influences are. Random is good. The numbers quoted by BayesElo (which, in your case, you could also calculate by hand in 15 seconds) assume totally independent random results, and are an upper limit to the difference that you could typically have between repetitions of the same experiment.
Apparently not. I just gave two runs that violated this completely. So it is _far_ from absolute. And again, I will say that this apparently doesn't apply to computer games, otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...

If you think that all of mathematics can be dismissed because you lack the imagination to pinpoint your source of errors, and fix it, think again. Mathematics is not subject to experimental testing or verification. No matter how often and in how many different ways you can calculate for us that 25 x 25 equal 626, it would only serve to show that you don't know how to multiply. The correct answer would still remain 625.
bob wrote:
hgm wrote: Your results violate that upper bound, and hence the assumptions cannot be satisfied. If you cannot pinpoint the source of the depedence between your games, and eliminate it, your large-number-of-games testing will be completely useless. If I would have designed a laser-ruler to measure distances with sub-nanometer precision, and the typical difference between readings on the same steel needle would be more than a millimeter, I would try to repair my laser-ruler. Not complain about the variability of the length of needles...
Right. And I have explained exactly why there is no dependency of one game on another.
So your explanation is wrong. Mathematical facts cannot be denied. Apparently the environment is not as controlled as you think.
The game results are correlated. (Mathematical fact.) Explain me how they can get correlated without a causal effect acting between them, or show us what the causal effect is.
Each is run individually, on a very controlled environment. The same copy of an engine is used each and every time. whatever they do internally I don't care since that is a variable everyone will have to deal with. I'm tired of this nonsense about games are somehow dependent. When it just can't happen. There are other explanations, if you would just look. One is inherent randomness and streakiness of chess games.
What an incredible bullshit. A freshmen's course in statistics might cure you from such delusions. There is no way you can violate a Gaussian distribution with the calculated with independent results. Do the freshman's math, or ask a competent statiscian if you can't...
One is the timing issue that can never be eliminated in normal chess play. that does not produce direct dependencies. If running consecutive games somehow biases that computer to favor one opponent consistently, there is nothing that can be done because there is no logical way to measure and fix such. It is more pure randomness thrown in...
If the computer favors one opponent consistenly over several games, there will be a causal effect that causes it. And of course that can be fixed,
as there is no logical need for such a causal link to exist. And even if you don't know the underlying physical mechanism of such causation, you can eliminate it if you know its behavior from observation. e.g. if games played close together in time tend to give correlated results, you can interleave the two runs that you want to compare, by alternating games from one run and the other, so that they will be affected the same way by the mysterious effect.

There ar in fact zillions of such tricks. consult an experimental physicist. They can tell you everything about the proper ways to do data collection in noisy environments.
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: more on engine testing

Post by Carey »

Uri Blass wrote:"I've always heard that in computer vs. computer, the one with the better search is likely to win, regardless of the evaluator."

This claim is completely wrong.
change your evaluation to only material evaluation and you are going to lose every game.
I wasn't meaning it to be that extreme. Perhaps I should have qualified it slightly, but I just assumed people would use some common sense.

I was talking about a relatively simple evaluator (with the basics) compared to a much more complex 'everything I can think of' evaluator.

As long as you had the basics in there, it was the quality of the search that was usually the deciding factor. The program that searched the deepest and saw the key move first had the advantage.


However, since you bring up the subject, it is provable that a material only evaluator can be just as good a one's such as Crafty. You just need to search deeper, perhaps to the end of the game.... :D

If you'd like to run some tests, with crippled evaluators vs. deeper search, go ahead, but I'm pretty sure the tests have already been done, which is why for the past 30 years people have been saying that in computer vs. computer games, the deeper search usually wins.


"I'm assuming that Crafty's search is probably better than most of the programs you are testing."

No reason to think that it is better.
Crafty is losing in the tests against Fruit and Glaurung.
I didn't say it was always better. Just simply 'probably better than most'.

Don't put words into my mouth that I didn't actually say.

The only correct claim is

"Perhaps you could get statistically significant results if you experimented with the search itself instead?"

Maybe but if it is the case the only reason is that for Crafty
improvements in the search can be more significant than improving the evaluation and not because evaluation is unimportant.

Uri
I never said the evaluation was unimportant. I said that for computer vs. computer tests, the search was far more important than the small (and not so small) changes he was doing to the evaluator, and that the search was the dominant factor and that it was helping to smear the results of the evaluator.

Go read Hyatt's post... He said he even ran a test where he took out a large chunk of Crafty's eval and he still had trouble telling which version of Crafty was stronger.

It's hard to argue results like that.... If you can lobotomise a good part of your eval and your tests can't the difference, then for computer vs. computer games, it can't be of much importance.
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: more on engine testing

Post by Carey »

Bill Rogers wrote:Carey
A chess program without an evaluation routine can not really play chess in the normal sense.
I never said without an evaluator.

I was meaning a relatively simple one compared to something big & complex like Crafty uses.

Since he took out an important chunk of Crafty and still couldn't determine which was stronger, that it was the search itself that was the dominant factor. Hence the classic wisdom that a deeper searching program is likely the one to win in computer vs. computer.

It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Actually, the search is the tactical nature of chess.

Chess is pure tactics and that is precisely what the search does.

If you could, for example, search 10 plies deepter in the middle game than your opponent, and you did only material and maybe a few simple rules or square tables, I'd put the odds onto you rather than the smarter but more shallow searching program.

Extend that even deeper to 20 plies and I don't think too many people would bet on the more shallow searching program.

This is just in response to you saying 'no matter how fast it could search'. I'm not suggesting it to be even remotely practical.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill
That's provably false.... Search 500 plies... all the way to the end of the game.... Will it play well? Obviously yes.

But again, I'm not suggesting it's practical etc. I'm just taking it to extreme to prove the point.


I wasn't meaning a super simplistic evaluator, just a basic one.



I do, however, still think that a 'TECH' style program is a good idea. Very simplistic evaluator that depends on the search and has advanced search techniques.

Many people use Muller's micro-max for that purpose, actually. But it's a little too simplistic and too small and not quite user friendly.
krazyken

Re: more on engine testing

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:If you do have the PGN from the last game run, it might be useful to see how many duplicate games are in there. Perhaps the sample size is significantly smaller than 25000.
I think they were removed when I tried to re-run the test. Then the A/C went south. I will queue up a 25,000 game run and post the PGN as soon as this gets fixed. No idea about duplicates, although there must be quite a few since there are 40 starting positions played from black and white side for 80 different starting points for each pair of opponents.

BTW wouldn't "duplicates" be a good thing here? More consistency? If every position always produced 2 wins and 2 losses, or 2 wins and 2 draws, then the variability would be gone and the issue would not be problematic.
Well if you have a large number of duplicates, your error margin will be larger. To illustrate, take an extreme case, if your first sample had 25000 unique games, and the second sample had 800 unique games how useful would comparing the two samples be? the error margin on the second sample would go from +/-4 to +/-18. One of the assumptions of good statistics is a good random sampling technique. Here the population is all possible games starting from the given positions with the given algorithms. You are collecting a random sample from those games, by running matches. If there are duplicates in the sample, it is not representative of the population.
Maybe you are thinking about this in the wrong way. If there are a lot of duplicates, and we should know in a couple of days as I can now use 28 nodes on the cluster and have started a 25K game run already, wouldn't the base number of non-duplicates be the actual expected value? And the more duplicates you get, the more convinced you should be that those duplicated games actually do represent what happens most of the time? It is not like there are billions of possible games and we just somehow randomly choose 25,000 that are mostly duplicates, when in the total population there are actually very few duplicates.

Programs are deterministic completely, except for time usage which can vary by a second or so per move depending on how time is measured within each engine. And that second per move turns into a couple of million extra (or fewer) positions per move (at 2M nodes per second or so) that can certainly introduce some ramdomness at points in the tree. Most moves won't change, to be sure, but once any move changes, we are now into a different game where the results can be influenced by other timing changes later on. I might try to experiment with more positions, but my initial interest was in finding representative positions and playing 2 games per opponent per position, and use that as a measuring tool for progress. It isn't enough games. So then you can try either more opponents, or more positions, or both, but either of which greatly adds to the computational requirements. More positions leads to the potential for other kinds of errors if there are several duplicated "themes", so that becomes yet another issue. then the opponents become an issue because some might be similar to each other and that introduces yet another bias...
The primary goal of Elo ratings is to accurately predict game results between contemporary competitors. So you may be right. It would be helpful to know the assumptions used for calculating Elo in BayesELO, it may not apply to your testing conditions. Some brief info is here. It has a built in assumption that white has an advantage. Sounds to me you are trying a method to minimize white's advantage. So i guess another thing to keep track of is the % of wins as white and % wins as black.
Uri Blass
Posts: 10803
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: more on engine testing

Post by Uri Blass »

I am going to respond only to this part:

"And you wouldn't check this by looking at the PGN and log files? I did to make certain that for several 800 game matches, every last PGN matched Crafty's log file expectations exactly."

This is not relevant because the problem was not in the result of the 800 games that you posted but in the result of the 25000 games.

It is possible to have many sets of 800 games with no problem when you have a set of 25000 games that have a problem.

I totally agree with H.G.Muller in this discussion.

Uri
Uri Blass
Posts: 10803
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: more on engine testing

Post by Uri Blass »

Carey wrote:
Uri Blass wrote:"I've always heard that in computer vs. computer, the one with the better search is likely to win, regardless of the evaluator."

This claim is completely wrong.
change your evaluation to only material evaluation and you are going to lose every game.
I wasn't meaning it to be that extreme. Perhaps I should have qualified it slightly, but I just assumed people would use some common sense.

I was talking about a relatively simple evaluator (with the basics) compared to a much more complex 'everything I can think of' evaluator.

As long as you had the basics in there, it was the quality of the search that was usually the deciding factor. The program that searched the deepest and saw the key move first had the advantage.


However, since you bring up the subject, it is provable that a material only evaluator can be just as good a one's such as Crafty. You just need to search deeper, perhaps to the end of the game.... :D

If you'd like to run some tests, with crippled evaluators vs. deeper search, go ahead, but I'm pretty sure the tests have already been done, which is why for the past 30 years people have been saying that in computer vs. computer games, the deeper search usually wins.


"I'm assuming that Crafty's search is probably better than most of the programs you are testing."

No reason to think that it is better.
Crafty is losing in the tests against Fruit and Glaurung.
I didn't say it was always better. Just simply 'probably better than most'.

Don't put words into my mouth that I didn't actually say.

The only correct claim is

"Perhaps you could get statistically significant results if you experimented with the search itself instead?"

Maybe but if it is the case the only reason is that for Crafty
improvements in the search can be more significant than improving the evaluation and not because evaluation is unimportant.

Uri
I never said the evaluation was unimportant. I said that for computer vs. computer tests, the search was far more important than the small (and not so small) changes he was doing to the evaluator, and that the search was the dominant factor and that it was helping to smear the results of the evaluator.

Go read Hyatt's post... He said he even ran a test where he took out a large chunk of Crafty's eval and he still had trouble telling which version of Crafty was stronger.

It's hard to argue results like that.... If you can lobotomise a good part of your eval and your tests can't the difference, then for computer vs. computer games, it can't be of much importance.
1)I already made tests in the past at fixed depth with crippled evaluator(only piece square table) against deeper search.

I found that at small depths the crippled evaluator can win if it searches one ply deeper but at bigger depths it needs 2 plies deeper or 3 plies deeper to win and practically the better evaluator earns more from slower time control.

2)The fact that Hyatt found that removing evaluation knowledge did not reduce strength of Crafty only proves that he removed knowledge that is not productive to Crafty and it does not prove that knowledge in the evaluation cannot be productive.

Uri
Carey
Posts: 313
Joined: Wed Mar 08, 2006 8:18 pm

Re: more on engine testing

Post by Carey »

Uri Blass wrote:
Carey wrote:
Uri Blass wrote:"I've always heard that in computer vs. computer, the one with the better search is likely to win, regardless of the evaluator."

This claim is completely wrong.
change your evaluation to only material evaluation and you are going to lose every game.
I wasn't meaning it to be that extreme. Perhaps I should have qualified it slightly, but I just assumed people would use some common sense.

I was talking about a relatively simple evaluator (with the basics) compared to a much more complex 'everything I can think of' evaluator.

As long as you had the basics in there, it was the quality of the search that was usually the deciding factor. The program that searched the deepest and saw the key move first had the advantage.


However, since you bring up the subject, it is provable that a material only evaluator can be just as good a one's such as Crafty. You just need to search deeper, perhaps to the end of the game.... :D

If you'd like to run some tests, with crippled evaluators vs. deeper search, go ahead, but I'm pretty sure the tests have already been done, which is why for the past 30 years people have been saying that in computer vs. computer games, the deeper search usually wins.


"I'm assuming that Crafty's search is probably better than most of the programs you are testing."

No reason to think that it is better.
Crafty is losing in the tests against Fruit and Glaurung.
I didn't say it was always better. Just simply 'probably better than most'.

Don't put words into my mouth that I didn't actually say.

The only correct claim is

"Perhaps you could get statistically significant results if you experimented with the search itself instead?"

Maybe but if it is the case the only reason is that for Crafty
improvements in the search can be more significant than improving the evaluation and not because evaluation is unimportant.

Uri
I never said the evaluation was unimportant. I said that for computer vs. computer tests, the search was far more important than the small (and not so small) changes he was doing to the evaluator, and that the search was the dominant factor and that it was helping to smear the results of the evaluator.

Go read Hyatt's post... He said he even ran a test where he took out a large chunk of Crafty's eval and he still had trouble telling which version of Crafty was stronger.

It's hard to argue results like that.... If you can lobotomise a good part of your eval and your tests can't the difference, then for computer vs. computer games, it can't be of much importance.
1)I already made tests in the past at fixed depth with crippled evaluator(only piece square table) against deeper search.

I found that at small depths the crippled evaluator can win if it searches one ply deeper but at bigger depths it needs 2 plies deeper or 3 plies deeper to win and practically the better evaluator earns more from slower time control.
Going all the way to just piece square table is a bit extreme. But your results are valid... the one with the better (or in your case, just deeper) can win.

I already admitted I should have qualified my original statement about the quality of the eval.

I wasn't meaning to go to that extreme. Only that knowledge & untuned values for it don't have as much effect as the quality of the search does.

That the better search (not necessarily an extra ply, but that's part of it. A better qsearch could do it too) is the dominant part in computer vs. computer games.

You just tested a simplistic eval against a smarter one, but used the same search and varied the depth. That's not quite what I said. Depth is a part of it. A major part of it. But the search heuristics, the q-search etc. etc. play a major part in it too. That's why I said "...the one with the better search is likely to win, regardless of the evaluator."



2)The fact that Hyatt found that removing evaluation knowledge did not reduce strength of Crafty only proves that he removed knowledge that is not productive to Crafty and it does not prove that knowledge in the evaluation cannot be productive.

Uri
Here's the one line where Bob mentions it
In fact, I removed major parts of Crafty's Evaluation and still had a hard time measuring "good or bad" when I added individual _major_ components back in... using big runs...
Somehow I suspect he knows what he's doing. After 40 *years* of experience you tned to learn a few things about testing. If not, you end up with a 1500 rated program even when running on a cluster...

More likely, the massive compter vs. computer tests he was doing was showing that the faster eval & the search itself was compensating for the missing knowledge.

(And as I pointed out in the original message, I thought it unlikely similar results would be obtained in human vs. human or human vs. computer tests. That these results he's getting are due to computer vs. computer testing.)

Not too many other people have done the kind of *massive* computer vs. computer testing that he's been doing.



And as for whether that proves that knowledge in the evaluator can be productive or not...

Don't start saying I said something I didn't.

I never said a good evaluator wasn't a benefit. I said the one with the better search was likely to win. That's not the same thing. If searches are similar then the better evaluator will play a more significant role.

There have certainly been classic studies on the importance of knowledge. Against human and computer players. But they didn't do the massive computer vs. computer testing that Bob is doing.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Uri Blass wrote:I am going to respond only to this part:

"And you wouldn't check this by looking at the PGN and log files? I did to make certain that for several 800 game matches, every last PGN matched Crafty's log file expectations exactly."

This is not relevant because the problem was not in the result of the 800 games that you posted but in the result of the 25000 games.

It is possible to have many sets of 800 games with no problem when you have a set of 25000 games that have a problem.

I totally agree with H.G.Muller in this discussion.

Uri
Then feel free to do so. Anything is possible, and it is impossible to prove that nothing happened, so the argument becomes circular and infinite. However, I do have some old test results for a large number of games. In these games I have crafty write out a message just before it terminates (when told to "quit") that says either "I am winning, eval=xxx" or "I am losing, eval=xxx", or "game is a draw, eval=xxx". I can then correlate each of those with a PGN result tag, since I carefully name the pgn files to correspond to specific log files to help me find why I lost specific games. And I can verify that for each game, the result recorded in the PGN matches the result crafty was expecting. If this fails, I produce an error in the Elo output file so indicating, because the current testing has to do with removing much draw detection code and I am interested if I manage to break something. And none of the recent testing has produced any sort of error condition dealing with that test, meaning the PGN is "sane" when compared to Crafty's search expectations... I just disabled repetition checking and made a single 160 game run and got 32 errors where Crafty thought it was winning or losing but the game was actually drawn by repetition or 50 moves or insufficient material (all that is done in one place in Crafty). So the error test works, and it is not getting triggered in normal matches, with maybe one to as many as 10 exceptions in the 25,000 game matches where the opponent will lose on time. Happens very infrequently, and I _always_ look at these to make sure it isn't crafty that is unexpectedly losing on time. happens some in 1+1, rarely in 5+5 or longer.

So, I suppose you can believe what you want. I believe the data, because this is not the first case of such variable results I have posted. I doubt it will be the last. The data is real. The test setup is exactly as I have explained. the results were produced exactly as I said. I will supply the next two runs of 25,000 games when they are done, whether they show this kind of variance or they are exact duplicates...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:And this is the point that you don't seem to get. It doesn't matter how random the influences are. Random is good. The numbers quoted by BayesElo (which, in your case, you could also calculate by hand in 15 seconds) assume totally independent random results, and are an upper limit to the difference that you could typically have between repetitions of the same experiment.
Apparently not. I just gave two runs that violated this completely. So it is _far_ from absolute. And again, I will say that this apparently doesn't apply to computer games, otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now. What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.


If you think that all of mathematics can be dismissed because you lack the imagination to pinpoint your source of errors, and fix it, think again. Mathematics is not subject to experimental testing or verification. No matter how often and in how many different ways you can calculate for us that 25 x 25 equal 626, it would only serve to show that you don't know how to multiply. The correct answer would still remain 625.
And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.

That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory". I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent. I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?

One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...

bob wrote:
hgm wrote: Your results violate that upper bound, and hence the assumptions cannot be satisfied. If you cannot pinpoint the source of the depedence between your games, and eliminate it, your large-number-of-games testing will be completely useless. If I would have designed a laser-ruler to measure distances with sub-nanometer precision, and the typical difference between readings on the same steel needle would be more than a millimeter, I would try to repair my laser-ruler. Not complain about the variability of the length of needles...
Right. And I have explained exactly why there is no dependency of one game on another.
So your explanation is wrong. Mathematical facts cannot be denied. Apparently the environment is not as controlled as you think.
The game results are correlated. (Mathematical fact.) Explain me how they can get correlated without a causal effect acting between them, or show us what the causal effect is.
That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness. I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.

Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.

So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...

Each is run individually, on a very controlled environment. The same copy of an engine is used each and every time. whatever they do internally I don't care since that is a variable everyone will have to deal with. I'm tired of this nonsense about games are somehow dependent. When it just can't happen. There are other explanations, if you would just look. One is inherent randomness and streakiness of chess games.
What an incredible bullshit. A freshmen's course in statistics might cure you from such delusions. There is no way you can violate a Gaussian distribution with the calculated with independent results. Do the freshman's math, or ask a competent statiscian if you can't...
So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.

One is the timing issue that can never be eliminated in normal chess play. that does not produce direct dependencies. If running consecutive games somehow biases that computer to favor one opponent consistently, there is nothing that can be done because there is no logical way to measure and fix such. It is more pure randomness thrown in...
If the computer favors one opponent consistenly over several games, there will be a causal effect that causes it. And of course that can be fixed,
as there is no logical need for such a causal link to exist. And even if you don't know the underlying physical mechanism of such causation, you can eliminate it if you know its behavior from observation. e.g. if games played close together in time tend to give correlated results, you can interleave the two runs that you want to compare, by alternating games from one run and the other, so that they will be affected the same way by the mysterious effect.
Rambling leads nowhere. How can a processor favor one opponent consistently over several games? Please explain that in the context of the current Linux OS kernel and the straightforward O(1) scheduler it uses. It is easy enough to measure this if you want to believe that is a problem. But here's the flaw. No pondering. No SMP. So in a single game, only one process is ready to run at any instant in time. So how can you favor A, when about 1/2 the time A is not wanting to execute? That would be a tough idea to sell _anybody_ with any O/S experience of any kind.

how else could it favor besides scheduling? Giving one process more cache? Hardware won't do that under any circumstance preferentially. TLB entries? Ditto. There is a minor issue of page placement in memory where bad placements can cause some cache aliasing that can alter performance by a relatively small amount. And it is completely random usually. But in the last test I ran, that wasn't possible. Each node was warm-started before each game, so that everything was as close as possible to the same starting condition (probably never exact, since the boot process starts several temporary processes and they will suffer from timing randomness which will affect the free memory list order in unpredictable ways. But to bias a single opponent repeatedly would not be possible as it currently works.

So, what is left? No I/O going on, so no way to fudge there. Neither process / program is using enough memory to cause problems. They are set to as equal as possible with respect to hash sizes, I'm using between 96M (crafty) and 128M (for those that support that size) on nodes that have 4 gigs for olympus, or 12 gigs on ferrum (results being discussed were all from olympus).

I doubt you could put together a more consistent test platform if you had unlimited funds.

There ar in fact zillions of such tricks. consult an experimental physicist. They can tell you everything about the proper ways to do data collection in noisy environments.
Totally different animal there.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Bill Rogers wrote:Carey
A chess program without an evaluation routine can not really play chess in the normal sense. It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill
This is not quite true. See Don Beal's paper about a simple random evaluation. It played quite reasonable chess. And due to the way the search works, it managed to turn that random evaluation into a strong sense of mobility. I can explain if you want. But it actually does work...