Komodo 9.1 Results

TShackel · Post by **TShackel** » Wed Jul 08, 2015 6:40 pm

Hi,

Komodo 9.0 had great results against the April 12th development version of stockfish, and against houdini and gull and other engines. But then I upgraded to Komodo 9.1, and at the same time upgraded to June 17th development version of Stockfish, and Stockfish seems to be fighting back in my tournaments now. Komodo 9.0 was clear leader against earlier stockfish version, but now June 17th version of stockfish is leading Komodo 9.1. So I'm a little concerned for Komodo's sake, and wonder if stockfish is really improving substantially faster recently than Komodo.

These sudden changes in results is part of the the reason I posted the thread "Is komodo trying to become like stockfish" and showed concern about Komodo 9.1 vs. Komodo 9.0. Perhaps all this is just a matter of stockfish improving too at a good rate.

If anyone has any ideas as to what's causing this phenomenon where Komodo 9.0 had great results, and now Komodo 9.1 is struggling, it would be great to hear what they thought it was caused by.

Sincerely,

Tim.

zullil · Post by **zullil** » Wed Jul 08, 2015 6:50 pm

TShackel wrote: If anyone has any ideas as to what's causing this phenomenon where Komodo 9.0 had great results, and now Komodo 9.1 is struggling, it would be great to hear what they thought it was caused by.

Sincerely,

Tim.

Random variation? That is, perhaps your sample sizes are too low to measure anything but "noise"? Before looking further, can you rule this out?

leslies · Post by **leslies** » Wed Jul 08, 2015 7:00 pm

I was seeing same results,so i started adjusting the draw score of komodo and started seeing komodo winning again;but with different sf var. the results were different,for instance:with a draw score of -7 komodo will beat sf15070415bmi2,but not sf040715mz.I had to drop the draw score to -2 before komodo started winning,but it would lose to the former sf.I DON'T get it.

TShackel · Post by **TShackel** » Wed Jul 08, 2015 8:22 pm

zullil wrote:Random variation? That is, perhaps your sample sizes are too low to measure anything but "noise"? Before looking further, can you rule this out?

You do understand that I'm a long time control tester right? I think we've had similar discussions between you and I in the past. All long time control rating lists post official ratings after 300 games or so. Sure, there may be error bars for the lists, but everyone respects the ratings at CCRL IPON and CEGT inspite of their error bars after only hundreds of games.

Sure, it's easy to do 1000's of games at a 1 min 1 sec incremement time control, to test every change. But the chess is not as high quality as if you did long time control, and long time control is where you test the engine's "chess muscle". I produce rating lists with 300 long time control games played between each engine, sure, with error bars, but my results are still as valid as CCRL and CEGT.

So I'm' not sure how you know my sample sizes since I didn't tell you how many games were played between komodo and stockfish. I think 300 games is a respectable amount in long time control to start seeing which engines are stronger like, for instance, the CEGT 40/120 lists that only have hundreds of games and yet everybody looks at this list to see which engines are the best in spite of error bars.

It seems like everytime I ask this forum this question about a problem in results, the answer i always get is "more games required". But tell that to CEGT and CCRL who post ratings with error bars after 300 games. My results are valid in other words, and I'm wondering why Komodo 9.1 results have not been panning out as good as Komodo 9.0 results.

Sincerely,

Tim.

zullil · Post by **zullil** » Wed Jul 08, 2015 8:35 pm

TShackel wrote:
zullil wrote:Random variation? That is, perhaps your sample sizes are too low to measure anything but "noise"? Before looking further, can you rule this out?
Sure, there may be error bars for the lists,

So how large are yours? Large enough to account for what you've posted about or not?

Not trying to be a pain.

kranium · Post by **kranium** » Wed Jul 08, 2015 9:25 pm

TShackel wrote:
zullil wrote:Random variation? That is, perhaps your sample sizes are too low to measure anything but "noise"? Before looking further, can you rule this out?
You do understand that I'm a long time control tester right? I think we've had similar discussions between you and I in the past. All long time control rating lists post official ratings after 300 games or so. Sure, there may be error bars for the lists, but everyone respects the ratings at CCRL IPON and CEGT inspite of their error bars after only hundreds of games.

Sure, it's easy to do 1000's of games at a 1 min 1 sec incremement time control, to test every change. But the chess is not as high quality as if you did long time control, and long time control is where you test the engine's "chess muscle". I produce rating lists with 300 long time control games played between each engine, sure, with error bars, but my results are still as valid as CCRL and CEGT.

So I'm' not sure how you know my sample sizes since I didn't tell you how many games were played between komodo and stockfish. I think 300 games is a respectable amount in long time control to start seeing which engines are stronger like, for instance, the CEGT 40/120 lists that only have hundreds of games and yet everybody looks at this list to see which engines are the best in spite of error bars.

It seems like everytime I ask this forum this question about a problem in results, the answer i always get is "more games required". But tell that to CEGT and CCRL who post ratings with error bars after 300 games. My results are valid in other words, and I'm wondering why Komodo 9.1 results have not been panning out as good as Komodo 9.0 results.

Sincerely,

Tim.

Tim-
Louis has very likely hit the nail on the head.
300 games is only enough games to draw a firm conclusion from if the score difference is pretty large.

Before drawing any conclusion concerning your results...calculate the Likelihood of Superiority (LOS).
It's an important formula to help you determine (after a match) if one engine is stronger than the other with a high level of certainty.
https://chessprogramming.wikispaces.com ... Statistics

I've compiled an executable based on Álvaro Begué's sample code presented on that page...
I tweaked it a little to take console input, write the results to disk (LOS.txt) as well as printing them to screen, etc.
All you need to do is enter the #of wins, draws, and losses from your match.

You can be reasonably "sure" that one engine is stronger than the other if you get LOS percentages in the high 90s, for ex: 95% or 97%, etc.
If your result is near 50%, you might as well flip a coin to determine superiority.

Here's a link to download it if interested:
http://www.chesslogik.com/downloads/LOS.rar

Regards-
Norm

PS - If anyone is aware of restrictions concerning distributing this compiled snippet of code for download, please let me know, I'll remove it immediately.

syzygy · Post by **syzygy** » Wed Jul 08, 2015 10:23 pm

TShackel wrote:It seems like everytime I ask this forum this question about a problem in results, the answer i always get is "more games required". But tell that to CEGT and CCRL who post ratings with error bars after 300 games.

There is no need to tell them anything, as they report ratings with error bars and do not post their concerns unsupported by solid statistics.

What you could do is simply post your results, including error bars.

My results are valid in other words, and I'm wondering why Komodo 9.1 results have not been panning out as good as Komodo 9.0 results.

That's the point of statistics and error bars.

Darkmoon · Post by **Darkmoon** » Thu Jul 09, 2015 5:28 am

I'm not sure I understand the issue. I'm running Komodo 9.1 on default parameters as I am Stockfish 290615. ponder off.Gambitlines.

I do understand that the match is limited to only 1hr/15sec repeated and cannot be taken seriously.

But at game 58 Komodo is leading albeit by a slim margin of +8=44-6. This has to count for something!

But maybe I am misunderstanding the purport of the argument in the thread.

zullil · Post by **zullil** » Thu Jul 09, 2015 11:33 am

Darkmoon wrote: But at game 58 Komodo is leading albeit by a slim margin of +8=44-6. This has to count for something!

I just flipped a coin 58 times. It landed heads 30 times and tails 28 times. What do you conclude about the coin?

Waschbaer · Post by **Waschbaer** » Thu Jul 09, 2015 11:41 am

I killed myself by shooting, 100 times.
But I'm not dead until I did it > 300 times?

Komodo 9.1 Results

Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results

Re: Komodo 9.1 Results