lkaufman wrote: After 500 direct bullet (30" + 0.3") games against H3 I'm showing +48 elo, close enough to the claimed 50, but normally elo gains diminish with increased time limit and also against unrelated engines, so I'll "predict" that the real gain (say on the CEGT 5' +3" list, which is similar to IPON) will be around 30 elo. We'll see.
In my tests at 10"+0.1" and 120"+1.2" against 9 opponents Houdini 4 without table bases is about 45 Elo better than Houdini 3.
The Syzygy 6-men add another 5 to 10 Elo in my tests at 60"+1" time control, which explains the official number of "50 Elo" for the release.
How much this will produce in rating lists is always the big surprise, inasmuch as the time management of Houdini 4 has changed as well I'm not even trying to predict these numbers with a precision better than 20 Elo...
Cheers,
Robert
One more question: When you say that the Syzygy 6 men tb adds five to ten elo, that appears to mean compared to no TB. But shouldn't the proper comparison be with the best TB supported by Houdini 3? Or are you saying that Syzygy is that much better than any other supported TB?
By the way I'm now running a recent Komodo version (which was already tested vs H3) against H4 to measure the improvement against an unrelated opponent at 1' +.5". So far it is showing roughly midway between the 27 LS figure so far and your 45 figure. I'll report fully at the end, when I should have 4000 games.
lkaufman wrote: Thanks. Since your methodology seems to be very similar to the LS list, and since LS uses a time limit in between the two you use, can you offer any theory other than sample error for the discrepancy between your +45 figure and LS +27 figure (as last reported, subject to change of course)?
It is very easy to explain. RH has +45 (without TBs) with 1SD approximatelly 7Elo, SP has +27 with 1SD approximatelly 10Elo. So it is quite probable that the "real" improvement is 37Elo which is quite in the middle.
In addition to that SP has higher average rating of opponents with compreses the ratings more.
The numbers you mention for 1SD look about right for 2SD to me, assuming something close to half draws. For 2800 games with half draws and a 27 elo gap I get margin of error 9.35, which might be about ten with somewhat less than half draws. These are two SD values, not 1 SD.I think you made some simple error. The combined margin of error is much less than the sum of the two errors; for 10 and 7 it is about 12.1, way below 17 or 18.
I think your 37 estimate might be about right, but it is not reasonable to consider 27 and 45 elo with the given numbers of games to be just sample error. There should be some other factor.
lkaufman wrote:The numbers you mention for 1SD look about right for 2SD to me, assuming something close to half draws. For 2800 games with half draws and a 27 elo gap I get margin of error 9.35, which might be about ten with somewhat less than half draws. These are two SD values, not 1 SD.I think you made some simple error. The combined margin of error is much less than the sum of the two errors; for 10 and 7 it is about 12.1, way below 17 or 18.
I think your 37 estimate might be about right, but it is not reasonable to consider 27 and 45 elo with the given numbers of games to be just sample error. There should be some other factor.
Larry
For 2800 games assuming draw and win rates slightly better than H3 (which has 44% and 62% respectivelly) - i.e. 41% and 66% 1SD between 2 opponents would be 0.66%. Since there are many opponents here SD is larger and you have to multiply it with at least sqrt(2), which gives 6.5 Elo. On RH side 1SD is about 5.8Elo. This combined gives 8.7Elo for 1SD. 2SD is than around 17Elo which is already the difference between two results.
In addition to that LS list uses stronger opponents (for 60-80Elo on average) which translates into 5-10Elo rating compression.
lkaufman wrote:The numbers you mention for 1SD look about right for 2SD to me, assuming something close to half draws. For 2800 games with half draws and a 27 elo gap I get margin of error 9.35, which might be about ten with somewhat less than half draws. These are two SD values, not 1 SD.I think you made some simple error. The combined margin of error is much less than the sum of the two errors; for 10 and 7 it is about 12.1, way below 17 or 18.
I think your 37 estimate might be about right, but it is not reasonable to consider 27 and 45 elo with the given numbers of games to be just sample error. There should be some other factor.
Larry
For 2800 games assuming draw and win rates slightly better than H3 (which has 44% and 62% respectivelly) - i.e. 41% and 66% 1SD between 2 opponents would be 0.66%. Since there are many opponents here SD is larger and you have to multiply it with at least sqrt(2), which gives 6.5 Elo. On RH side 1SD is about 5.8Elo. This combined gives 8.7Elo for 1SD. 2SD is than around 17Elo which is already the difference between two results.
In addition to that LS list uses stronger opponents (for 60-80Elo on average) which translates into 5-10Elo rating compression.
So the 18 elo gap is just one elo more than the margin of error, so certainly it is possible. But I don't understand your second point. Longer time limits mean rating compression, but the strength of the opposition should not affect properly calculated ratings in general, unless one uses the broken "elostat" which wrongly averages ratings. Why do you claim that stronger opponents makes for rating compression in general (not specifically for Houdini)?
lkaufman wrote:So the 18 elo gap is just one elo more than the margin of error, so certainly it is possible. But I don't understand your second point. Longer time limits mean rating compression, but the strength of the opposition should not affect properly calculated ratings in general, unless one uses the broken "elostat" which wrongly averages ratings. Why do you claim that stronger opponents makes for rating compression in general (not specifically for Houdini)?
I am not claiming in general, I am assuming (not claiming again) for Houdini 4. Even though Robert stated somewhere that H dev has 0 contempt, from TCEC games in many positions I realized that there was still some contempt present and I believe the same for H4 (I don't have it myself yet to test).
Milos wrote:I am not claiming in general, I am assuming (not claiming again) for Houdini 4. Even though Robert stated somewhere that H dev has 0 contempt, from TCEC games in many positions I realized that there was still some contempt present and I believe the same for H4 (I don't have it myself yet to test).
I've just got a confirmation that default H4 contempt is 1, so my guess about 5-10Elo less with 60-80Elo stronger opponents is quite correct.
There you go Larry, you now have complete explanation .
P.S. I'm sure if Stefan ran H4 with contempt 0 at his list (top 10 opponents) he would get better result than with default one.
lkaufman wrote: One more question: When you say that the Syzygy 6 men tb adds five to ten elo, that appears to mean compared to no TB. But shouldn't the proper comparison be with the best TB supported by Houdini 3? Or are you saying that Syzygy is that much better than any other supported TB?
I've never been able to demonstrate any strength improvement with Nalimov EGTB at fast time controls, the overhead of the Nalimov appears to offset any gain. (note that the balance could be different at long TC)
On the other hand, with the Syzygy system the improvement is clear even at 1'+1".
Robert
Last edited by Houdini on Wed Nov 27, 2013 2:27 am, edited 1 time in total.