lkaufman wrote:Some comments:
1. Komodo should outsearch Houdini but not Stockfish because Komodo is in between them (assuming Houdini is same as Ivanhoe in this matter) in how much we reduce in general.
2. Use of a score of one pawn as a threshold seems way too high. In some programs such a score means over 90% chance to win.
3. Use of any score to cancel games is a potential bias, because it depends on which program reports the score. A given score has different meanings in different programs. Scores run nearly double in Stockfish compared to Houdini. As in most things, Komodo is in the middle.
4. It is far better to test with "testsuites" or opening books designed for such tests, rather than generic ones. Then you needn't worry about the score out of book, this has already been done for you.
Yes, I am sure you are correct about the test suites- but life is too short to spend much of it doing something you dislike. And I dislike the test suites. I am afraid the engines will have to suffer thru a generic book with me.
I will go along with a lot- but to worry about +1.00 meaning 2 different things to 2 different engines is one I am not going to get all tangled up in and let it affect the way I test. To me that is overkill. Actually in a 12 move limit you could give an engine a win every time he comes out of the opening showing +1.00 and I doubt in the end it would change the elo diff. a full point.
But there is another issue that is aggravating me more than anything else right now. My machine is fast enough it benchmarks 40/40 to run at 40/21. So it is not my system that is the problem. I am fast getting to the point where I don't know how much longer I can post results here. I click on "submit" and have to sit here watching the little circle in top left turn counter clockwise for about 2 minutes sometimes before it decides to change course and head the other way. This forum needs work. But that is in the other section posting results where you mostly suffer with that. And it started not long ago and keeps getting worse. Maybe someone is late paying the bill.
I think that it is better simply not to allow start positions when the evaluation by some accepted strong program is more than +1 in the first place
Note that what surprise me is that we have almost no games between 1 cpu and 4 cpu
It gives me the conclusion that
If I decide to be a tester I am going to test only 1 cpu against 4 cpu because I think that if we have 2 teams of computers with almost no games between them then it may means distortion of the rating list when the difference between 4 cpu and 1 cpu in the rating list may be wrong.
Note that there are games of komodo4 against 4 cpu but I do not see games of Houdini1.5a 64 bits against 4 cpu and for some reason people test houdini32 bit only against 32 bits program that may distort the rating of houdini32 bit because it does not get stronger opponents like stockfish2.2.2 64 bits 4 cpu.
Uri Blass wrote:I think that it is better simply not to allow start positions when the evaluation by some accepted strong program is more than +1 in the first place
The problem with that is bias - you should not select openings based on the opinion of one program.
However it might make sense to vote among a number of programs (that are derived from each other) for such a purpose.
Note that what surprise me is that we have almost no games between 1 cpu and 4 cpu
It gives me the conclusion that
If I decide to be a tester I am going to test only 1 cpu against 4 cpu because I think that if we have 2 teams of computers with almost no games between them then it may means distortion of the rating list when the difference between 4 cpu and 1 cpu in the rating list may be wrong.
Note that there are games of komodo4 against 4 cpu but I do not see games of Houdini1.5a 64 bits against 4 cpu and for some reason people test houdini32 bit only against 32 bits program that may distort the rating of houdini32 bit because it does not get stronger opponents like stockfish2.2.2 64 bits 4 cpu.
I dislike testing between opponent of widely disparate strengths. It's a big waste of time testing against a program that is 500 ELO weaker for example - time would be better spend playing more games with opponents that are closer together.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Uri Blass wrote:I think that it is better simply not to allow start positions when the evaluation by some accepted strong program is more than +1 in the first place
The problem with that is bias - you should not select openings based on the opinion of one program.
However it might make sense to vote among a number of programs (that are derived from each other) for such a purpose.
Note that what surprise me is that we have almost no games between 1 cpu and 4 cpu
It gives me the conclusion that
If I decide to be a tester I am going to test only 1 cpu against 4 cpu because I think that if we have 2 teams of computers with almost no games between them then it may means distortion of the rating list when the difference between 4 cpu and 1 cpu in the rating list may be wrong.
Note that there are games of komodo4 against 4 cpu but I do not see games of Houdini1.5a 64 bits against 4 cpu and for some reason people test houdini32 bit only against 32 bits program that may distort the rating of houdini32 bit because it does not get stronger opponents like stockfish2.2.2 64 bits 4 cpu.
I dislike testing between opponent of widely disparate strengths. It's a big waste of time testing against a program that is 500 ELO weaker for example - time would be better spend playing more games with opponents that are closer together.
I understand but the difference bwteen houdini1.5 32 bit and stockfish2.2.2 64 bits 4 cpu is clearly less than 500 elo and even less than 100 elo.
Uri Blass wrote:I think that it is better simply not to allow start positions when the evaluation by some accepted strong program is more than +1 in the first place
The problem with that is bias - you should not select openings based on the opinion of one program.
However it might make sense to vote among a number of programs (that are derived from each other) for such a purpose.
Note that what surprise me is that we have almost no games between 1 cpu and 4 cpu
It gives me the conclusion that
If I decide to be a tester I am going to test only 1 cpu against 4 cpu because I think that if we have 2 teams of computers with almost no games between them then it may means distortion of the rating list when the difference between 4 cpu and 1 cpu in the rating list may be wrong.
Note that there are games of komodo4 against 4 cpu but I do not see games of Houdini1.5a 64 bits against 4 cpu and for some reason people test houdini32 bit only against 32 bits program that may distort the rating of houdini32 bit because it does not get stronger opponents like stockfish2.2.2 64 bits 4 cpu.
I dislike testing between opponent of widely disparate strengths. It's a big waste of time testing against a program that is 500 ELO weaker for example - time would be better spend playing more games with opponents that are closer together.
I understand but the difference bwteen houdini1.5 32 bit and stockfish2.2.2 64 bits 4 cpu is clearly less than 500 elo and even less than 100 elo.
You are citing an example of where it is not so I don't understand your point. Of course there are examples of where it's not.
I'm talking about rating lists in general, playing every program against every program. If we are trying to rate Carlsen, should he have to play the same number of games against you and I as he would the top players?
Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
geots wrote:But there is another issue that is aggravating me more than anything else right now. My machine is fast enough it benchmarks 40/40 to run at 40/21. So it is not my system that is the problem. I am fast getting to the point where I don't know how much longer I can post results here. I click on "submit" and have to sit here watching the little circle in top left turn counter clockwise for about 2 minutes sometimes before it decides to change course and head the other way. This forum needs work. But that is in the other section posting results where you mostly suffer with that. And it started not long ago and keeps getting worse. Maybe someone is late paying the bill.
Best,
george
Hi George,
I doubt the problem is your machine, more likely it is your internet connection or service provider. You may want to post something in Help and Suggestions where Sam is likely to see it. Maybe he can offer some suggestions.
Uri Blass wrote:I think that it is better simply not to allow start positions when the evaluation by some accepted strong program is more than +1 in the first place
The problem with that is bias - you should not select openings based on the opinion of one program.
However it might make sense to vote among a number of programs (that are derived from each other) for such a purpose.
Note that what surprise me is that we have almost no games between 1 cpu and 4 cpu
It gives me the conclusion that
If I decide to be a tester I am going to test only 1 cpu against 4 cpu because I think that if we have 2 teams of computers with almost no games between them then it may means distortion of the rating list when the difference between 4 cpu and 1 cpu in the rating list may be wrong.
Note that there are games of komodo4 against 4 cpu but I do not see games of Houdini1.5a 64 bits against 4 cpu and for some reason people test houdini32 bit only against 32 bits program that may distort the rating of houdini32 bit because it does not get stronger opponents like stockfish2.2.2 64 bits 4 cpu.
I dislike testing between opponent of widely disparate strengths. It's a big waste of time testing against a program that is 500 ELO weaker for example - time would be better spend playing more games with opponents that are closer together.
I understand but the difference bwteen houdini1.5 32 bit and stockfish2.2.2 64 bits 4 cpu is clearly less than 500 elo and even less than 100 elo.
You are citing an example of where it is not so I don't understand your point. Of course there are examples of where it's not.
I'm talking about rating lists in general, playing every program against every program. If we are trying to rate Carlsen, should he have to play the same number of games against you and I as he would the top players?
Don
I did not suggest to play games between players with 500 elo difference so I do not understand your point and what is the reason that you mentioned that you dislike testing between opponent of widely disparate strengths.
Uri Blass wrote:
I did not suggest to play games between players with 500 elo difference so I do not understand your point and what is the reason that you mentioned that you dislike testing between opponent of widely disparate strengths.
You did suggest that - not the 500 ELO value but the concept that you should play way up or down to get accurate ratings. You said that if you were a tester you would play 1 core programs against 4 core programs to get ratings. It would be silly to do that to get variety, criiter 1 core, critter 2 core, critter 3 core, critter 4 core - that's not variety. So you must believe that it's good to have programs playing much weaker or stronger programs. If you didn't mean that what did you mean?
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
You and Don are talking about different things. You favor more pairings between dissimilar engines (32 bit vs 64 bit, 1 core vs 4 core), Don favors close matches. You can do both by pairing strong 32 bit or 1 core engines vs weaker 64 bit or 4 core engines. Then you will both be happy!
rvida wrote:Why not implement ches960 support then? It would surely help to prove or disprove your hypothesis. Btw. looking at CCRL 40/4 FRC list, I might start spreading a hypothesis too ... Also note the 100 elo gap between #2 and #3 (and between #4 - #5). It would be nice if more strong engines supported FRC.
A late reaction, but I've just run a test match showing that Critter 1.6a indeed appears to be slightly stronger than Houdini 2.0 in FRC - playing without opening book from the initial 960 positions with reversed colors.
After the 1920 games at 2'+2", single thread, the match result was 1010-910 for Critter (41% draws), scoring 52.6% or a performance of +18 Elo +/- 9 Elo. Congrats, Richard!
I'm now running a similar match against a pre-beta Houdini 3 DEV, results are quite different .