Hi there,
without ponder, after around 1.000 games, is PHQ-2 around 20 ELO stronger as default.
Now again a setting test vs. the same participants SF 2.1.1 JA default and PHQ had. This test started for around 20 hours.
Best
Frank
Much more interesting is the playing style. Hope that PHQ-2 can hold this aggressive style from PHQ-1 with more points.
SWCR results so far:
76,42% for PHQ-1
75,54% for default
77,59% for PHQ-2 at the moment, not enough games!
1% = around 10 ELO. Please have a look in result in %. More interesting as the ELO calculation from Shredder Classic 4 GUI, because all three versions played / playing vs. the same participants.
SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
Moderator: Ras
-
Frank Quisinsky
- Posts: 7281
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
-
mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
Hi Frank,Frank Quisinsky wrote:Hi there,
without ponder, after around 1.000 games, is PHQ-2 around 20 ELO stronger as default.
Now again a setting test vs. the same participants SF 2.1.1 JA default and PHQ had. This test started for around 20 hours.
Best
Frank
Much more interesting is the playing style. Hope that PHQ-2 can hold this aggressive style from PHQ-1 with more points.
SWCR results so far:
76,42% for PHQ-1
75,54% for default
77,59% for PHQ-2 at the moment, not enough games!
1% = around 10 ELO. Please have a look in result in %. More interesting as the ELO calculation from Shredder Classic 4 GUI, because all three versions played / playing vs. the same participants.
thanks a lot for testing this ! I will follow your testing closely...
Marco
-
BubbaTough
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
One should keep in mind that in tests where a large majority of the opponents are significantly weaker, settings that encourage risk and imbalance, even if the positions reached are objectively worse, can result in higher elo. So I would pay particlar attention to results against strong rival programs before adapting new more aggressive settings based on tests with this distribution of opposition. I certainly don't want to imply the test are useuless..they seem very useful and generous. Just that some extra verification that the settings do not weaken resulys against the top 5 rivals or so may be in order as a last sanity check.
-Sam
-Sam
-
mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
Hi Sam,BubbaTough wrote:One should keep in mind that in tests where a large majority of the opponents are significantly weaker, settings that encourage risk and imbalance, even if the positions reached are objectively worse, can result in higher elo. So I would pay particlar attention to results against strong rival programs before adapting new more aggressive settings based on tests with this distribution of opposition. I certainly don't want to imply the test are useuless..they seem very useful and generous. Just that some extra verification that the settings do not weaken resulys against the top 5 rivals or so may be in order as a last sanity check.
-Sam
yes, I fully agree with you. Personally I consider more important direct results against strong engines than the ELO. We are not interested to gain ELO through some kind of "contempt factor" trick, nevertheless PHQ-1 showed good results even in one-to-one matches (with the notably exception of Rybka).
So if I have to write a personal priority list of what I would like to see in an improvement could be something like:
- Improved results in direct matches against strong engines
- Improved playing style (this is something I trust Frank, because I am a weak player and I cannot judge myself)
- Improved overall ELO
-
mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
Hi Sam,BubbaTough wrote:One should keep in mind that in tests where a large majority of the opponents are significantly weaker, settings that encourage risk and imbalance, even if the positions reached are objectively worse, can result in higher elo. So I would pay particlar attention to results against strong rival programs before adapting new more aggressive settings based on tests with this distribution of opposition. I certainly don't want to imply the test are useuless..they seem very useful and generous. Just that some extra verification that the settings do not weaken resulys against the top 5 rivals or so may be in order as a last sanity check.
-Sam
yes, I fully agree with you. Personally I consider more important direct results against strong engines than the ELO. We are not interested to gain ELO through some kind of "contempt factor" trick, nevertheless PHQ-1 showed good results even in one-to-one matches (with the notably exception of Rybka).
So if I have to write a personal priority list of what I would like to see in an improvement could be something like:
- Improved results in direct matches against strong engines
- Improved playing style (this is something I trust Frank, because I am a weak player and I cannot judge myself)
- Improved overall ELO
Marco
-
Frank Quisinsky
- Posts: 7281
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
Hi Marco,
if I get now a weaker results with PHQ-2 I think "attacker or better kamikaze" must give me a better result as PHQ-1.
With other words ...
If 2.1.1 default = 100% result = 100
2.1.1 PHQ = 125% result = 125
2.1.1 PHQ2 = 115% result = 115
Which result will give me 150%
More aggressivess as PHQ-1 make no sense for me. Logical, after the ponder = off games PHQ-2 must go 10 ELO higher. If not, seems to be clear too ... more aggressivess will give me vs. the TOP-20 again a higher rating, with more short win games but with more short lost games too.
If 2.11 PHQ-2 is weaker ...
Unfortunately, I have to test PHQ-3.
PHQ-3 could be
1.
Mobility Middlegame = 160
Mobility Endgame = 140
Aggressivenss = 160
Covardice = 60
The super aggressive style ...
or ...
only a bit more as PHQ-1
2.
Mobility Middlegame = 150
Mobility Endgame = 125
Agressiveness = 150
Covardice = 75
If I test PHQ-3 I think variant 1. make more sense.
After all what I test ...
Combination high Aggressiveness with less Covardice seems to be good but only in combination with higher Mobility Middlegame SF produced a very aggressive style with perhaps 5-10 ELO points more.
To test only vs. the TOP-10 made no sense for me. Reality is, that you will get a faster and better results with more opponents. Not important how strong are the opponents.
Could be very easy test with Shredder 12 in my open SWCR database. Shredder played over 6.000 games and played vs. all other 150 engines in the SWCR 40 games.
Test it ...
Delete the first 40 opponents and calculate again. Delete the latest 40 places and calculate the SWCR again. The result is +-3 ELO the same.
Best
Frank
Higher and more aggressive settings as PHQ-1 make no sense for me, isn't logical but ... in this case ... PHQ-2 have a weaker results as PHQ-1 I should test a third setting. An unlogical setting with more aggressiveness!
if I get now a weaker results with PHQ-2 I think "attacker or better kamikaze" must give me a better result as PHQ-1.
With other words ...
If 2.1.1 default = 100% result = 100
2.1.1 PHQ = 125% result = 125
2.1.1 PHQ2 = 115% result = 115
Which result will give me 150%
More aggressivess as PHQ-1 make no sense for me. Logical, after the ponder = off games PHQ-2 must go 10 ELO higher. If not, seems to be clear too ... more aggressivess will give me vs. the TOP-20 again a higher rating, with more short win games but with more short lost games too.
If 2.11 PHQ-2 is weaker ...
Unfortunately, I have to test PHQ-3.
PHQ-3 could be
1.
Mobility Middlegame = 160
Mobility Endgame = 140
Aggressivenss = 160
Covardice = 60
The super aggressive style ...
or ...
only a bit more as PHQ-1
2.
Mobility Middlegame = 150
Mobility Endgame = 125
Agressiveness = 150
Covardice = 75
If I test PHQ-3 I think variant 1. make more sense.
After all what I test ...
Combination high Aggressiveness with less Covardice seems to be good but only in combination with higher Mobility Middlegame SF produced a very aggressive style with perhaps 5-10 ELO points more.
To test only vs. the TOP-10 made no sense for me. Reality is, that you will get a faster and better results with more opponents. Not important how strong are the opponents.
Could be very easy test with Shredder 12 in my open SWCR database. Shredder played over 6.000 games and played vs. all other 150 engines in the SWCR 40 games.
Test it ...
Delete the first 40 opponents and calculate again. Delete the latest 40 places and calculate the SWCR again. The result is +-3 ELO the same.
Best
Frank
Higher and more aggressive settings as PHQ-1 make no sense for me, isn't logical but ... in this case ... PHQ-2 have a weaker results as PHQ-1 I should test a third setting. An unlogical setting with more aggressiveness!
-
mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
Hi Frank,Frank Quisinsky wrote: Test it ...
Delete the first 40 opponents and calculate again. Delete the latest 40 places and calculate the SWCR again. The result is +-3 ELO the same.
this is a good point indeed !
Does the above still stands if you consider only the first 10 opponents, or even the first 5 ?
Marco
P.S: I have added PHQ-1 settings to my test queue, I was thinking PHQ-2 was better then PHQ-1, but at this point things start to be really interesting
-
Frank Quisinsky
- Posts: 7281
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2 is still running ...
HI Marco,
not at home, can't try out this.
But if an other have interest to do that. Download the SWCR database. In the file are "my" both files you need for bayesian calculation. Copy bayesian.exe in the same directory. Now you can delete engine or create new engines from the database.
Example:
Shredder vs. the TOP 20
Shredder vs. place 20-40
Shredder vs. place 100-120 and so on.
Calculate with the new Shredder engines you created again.
It make a lot of fun to make own experiments with big databases. SWCR database have "only" 115.000 games. CEGT or CCRL is bigger. But in SWCR all played again and again 40 games matches. Very important for better analyzes.
After all my analyzes I made only 4-5 engines are really better vs. weaker or stronger groups of engines. Crafty is one of this engines. Vs. stronger Crafty lost a bit more as vs. weaker participants.
If Crafty now playing vs. 20 new engines 40 games and the most of the 20 new engines are stronger one ... unfortunately, the ELO follow up.
Good other example is BugChess2 1.9 x64, new in SWCR. Played a round robin with 1.200 games, with the result +40 ELO to the predecessor. At the moment BugChess2 1.9 x64 have to play with PHQ-1, PHQ-2, Komodo, Fire, Critter and so on. All are clear stronger ... rating so far +39 to the predecessor.
With other words.
The ratings we produced with CCRL, CEGT, IPON or SWCR are very good. Normaly 800 games are enough. In 1/54 cases and engine will better or weaker with around 15-20 ELO with more games. Normaly the rating after 800 games is +-5. Most looking on errorbar but errorbar gave completly wrong information.
Example:
SF PHQ - SF PHQ-2, 4000 games and the result is
PHQ-2 is 20 ELO stronger.
Bayesian or ELOstat give you the information +-10 ErroBar. That is wrong because both engine have only 1 partipant.
Not important that you have 4.000 games.
Correct ErrBar is +-56 after 4.000 games and not +-10 because 1 participant only.
Important for a good ELO =
1. Many games but important too ...
2. Many participants
And many participants is today a little problem.
The different from place 1 Houdini to place 30 = 450 ELO.
You need 24-26 participants for a good rating after my analyzes.
...
what I do if the day is long
Best
Frank
not at home, can't try out this.
But if an other have interest to do that. Download the SWCR database. In the file are "my" both files you need for bayesian calculation. Copy bayesian.exe in the same directory. Now you can delete engine or create new engines from the database.
Example:
Shredder vs. the TOP 20
Shredder vs. place 20-40
Shredder vs. place 100-120 and so on.
Calculate with the new Shredder engines you created again.
It make a lot of fun to make own experiments with big databases. SWCR database have "only" 115.000 games. CEGT or CCRL is bigger. But in SWCR all played again and again 40 games matches. Very important for better analyzes.
After all my analyzes I made only 4-5 engines are really better vs. weaker or stronger groups of engines. Crafty is one of this engines. Vs. stronger Crafty lost a bit more as vs. weaker participants.
If Crafty now playing vs. 20 new engines 40 games and the most of the 20 new engines are stronger one ... unfortunately, the ELO follow up.
Good other example is BugChess2 1.9 x64, new in SWCR. Played a round robin with 1.200 games, with the result +40 ELO to the predecessor. At the moment BugChess2 1.9 x64 have to play with PHQ-1, PHQ-2, Komodo, Fire, Critter and so on. All are clear stronger ... rating so far +39 to the predecessor.
With other words.
The ratings we produced with CCRL, CEGT, IPON or SWCR are very good. Normaly 800 games are enough. In 1/54 cases and engine will better or weaker with around 15-20 ELO with more games. Normaly the rating after 800 games is +-5. Most looking on errorbar but errorbar gave completly wrong information.
Example:
SF PHQ - SF PHQ-2, 4000 games and the result is
PHQ-2 is 20 ELO stronger.
Bayesian or ELOstat give you the information +-10 ErroBar. That is wrong because both engine have only 1 partipant.
Not important that you have 4.000 games.
Correct ErrBar is +-56 after 4.000 games and not +-10 because 1 participant only.
Important for a good ELO =
1. Many games but important too ...
2. Many participants
And many participants is today a little problem.
The different from place 1 Houdini to place 30 = 450 ELO.
You need 24-26 participants for a good rating after my analyzes.
...
what I do if the day is long
Best
Frank
-
Frank Quisinsky
- Posts: 7281
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: SWCR: Stockfish 2.1.1 JA x64 PHQ-2, final ...
Hi there,
shortly, to late:
I wrote a text to the settings on my News-Page, entry 227. My second tester played 240 games to and the rating drop a bit.
Let us look on the percentage from the 1.200 games vs. identical participants:
PHQ-1 won wore fast win games up to move 56. PHQ-2 played more remis games but lost three games up to move 56 only (PHQ-1 lost 6 games). All in one, I think more interesting is PHQ-1.
At the moment a third PHQ test is still running with clearly higher aggessiv parameters as I gave PHQ-2.
Best
Frank
shortly, to late:
I wrote a text to the settings on my News-Page, entry 227. My second tester played 240 games to and the rating drop a bit.
Let us look on the percentage from the 1.200 games vs. identical participants:
Code: Select all
Stockfish 2.1.1 JA x64 default = 75,54%
Stockfish 2.1.1 JA x64 PHQ-1 = 76,42%
Stockfish 2.1.1 JA x64 PHQ-2 = 76,04%At the moment a third PHQ test is still running with clearly higher aggessiv parameters as I gave PHQ-2.
Best
Frank