Toga The Killer 1Y MP 4CPU is the strongest Toga....

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
As a final note, if I post 10 games and a few others post their 10 games matches, I would have much more info than if nobody posted their 10 games for fear of having their time called worthless.
However, that is _not_ what is happening. Someone is posting ten game results, and drawing a conclusion that is not supported by any statistical analysis of any kind. 10 games. Almost a +/- 200 error bar. Not very informative or useful.
I didn't see anybody drawing conclusions in this thread from only 10 games. It seems the person that was being ridiculed for posting only 10 games has already posted 70. and still nobody has drawn any conclusions based on the data.
Then why are we having this discussion?
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
As a final note, if I post 10 games and a few others post their 10 games matches, I would have much more info than if nobody posted their 10 games for fear of having their time called worthless.
However, that is _not_ what is happening. Someone is posting ten game results, and drawing a conclusion that is not supported by any statistical analysis of any kind. 10 games. Almost a +/- 200 error bar. Not very informative or useful.
I didn't see anybody drawing conclusions in this thread from only 10 games. It seems the person that was being ridiculed for posting only 10 games has already posted 70. and still nobody has drawn any conclusions based on the data.
Then why are we having this discussion?
It might be for the camaraderie?
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
As a final note, if I post 10 games and a few others post their 10 games matches, I would have much more info than if nobody posted their 10 games for fear of having their time called worthless.
However, that is _not_ what is happening. Someone is posting ten game results, and drawing a conclusion that is not supported by any statistical analysis of any kind. 10 games. Almost a +/- 200 error bar. Not very informative or useful.
I didn't see anybody drawing conclusions in this thread from only 10 games. It seems the person that was being ridiculed for posting only 10 games has already posted 70. and still nobody has drawn any conclusions based on the data.
Then why are we having this discussion?
It might be for the camaraderie?
Let's go back to "here":
Dr.Wael Deeb wrote:
jpqy wrote:More results:

Blitz 5min Core i7 @3.89Ghz 2009
  • TTK.cirebonb1Y.st.4cpu_b-Stockfish_13_win32_ja 5.0 - 5.0 +4/-4/=2 50.00%
    TTK.cirebonb1Y.st.4cpu_b-Grapefruit 1.0 alpha 3 5.0 - 5.0 +3/-3/=4 50.00%
    TTK.cirebonb1Y.st.4cpu_b-MP-x86-Inert---Thinker 5.4D 5.5 - 4.5 +4/-3/=3 55.00%
    TTK.cirebonb1Y.st.4cpu_b-TogaII141SE6-4cpu 3.0 - 7.0 +1/-5/=4 30.00%
    TTK.cirebonb1Y.st.4cpu_b-Glaurung22_win32_ja 6.0 - 4.0 +3/-1/=6 60.00%
JP.
Thanks,a pretty good results I assume....
Dr.D
"Thanks, a pretty good results I assume...."

Is that not a "conclusion"? Based on ten games per opponent? With an astronomical error bar?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
What I mean is that by introducing a 3rd variable, with starting positions, your early results can easily vary widely based on the order of the positions. You won't reach a good representation of the 32000 positions until after you've run a few thousand games. What you see in your practice runs due to the way you have set them up will not necessarily apply to tests set up in a different manner.
User avatar
Jaimes Conda
Posts: 921
Joined: Mon May 29, 2006 11:18 pm
Location: For now the planet Earth

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by Jaimes Conda »

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
Bob,
Two questions.

1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?

2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?

Jaimes
Veritas Vos Liberabit
gerold
Posts: 10121
Joined: Thu Mar 09, 2006 12:57 am
Location: van buren,missouri

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by gerold »

Jaimes Conda wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
Bob,
Two questions.

1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?

2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?

Jaimes
Very good question. I have been testing engines for 5 years now.
Playing engines vs engines at 10 mins and at 40 mins. Getting
mixed results on your question.

Best,

Gerold.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

Jaimes Conda wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
Bob,
Two questions.

1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?
Depends on what I am testing. For evaluation ideas, I start with very fast games (10 secs on clock, 10ms increment) so that I can complete the 32,000 games in an hour or so. For changes that look good, I then will play slower games. 1+1 (one min on clock, 1 sec increment) takes about 12 hours to complete). If I am testing search ideas, I usually run 1+1 games, and for changes that look reasonable, or which intuition suggests might be better at longer time controls, I play longer games. I rarely go past 5+5 except for final verificatoin. 5+5 takes about 2 days to finish. 60+60 is almost 2 weeks.

2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?

Jaimes
last question points out a real problem. There are changes that look good at fast time controls and bad at long ones, and vice-versa. But those are fairly rare. I see more of the case where a change either helps at fast games and does little or nothing for longer games, or else it doesn't make any difference in fast games, but helps in longer games.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.

So exactly _what_ can you conclude from 10 games?
That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.

Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.
Except that the tests start in a random order... so there is no predicting.
I suppose that's why not all of your tests take unpredicted turns.
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.

It is just the nature of computer vs computer chess games.
What I mean is that by introducing a 3rd variable, with starting positions, your early results can easily vary widely based on the order of the positions. You won't reach a good representation of the 32000 positions until after you've run a few thousand games. What you see in your practice runs due to the way you have set them up will not necessarily apply to tests set up in a different manner.
Couple of points.

1. I have 4,000 positions. Each position can be played in any order, but as a position is played, two games alternating sides are always played back-to-back. No matter what the order, things are always very volatile for the first 500-1000 games at least. I suppose it might be possible to "replay" a complete match, running samples thru BayesElo every 30 seconds to see how things start and finish and vary over the course of 32,000 games, to show the point.

2. What kind of "different manner" could you use? If you use one computer, you could play the positions in the same order each time. But if you use more than one machine, inherent randomness is still going to get the machines "out of step" pretty quickly. Using an opening book is certainly going to increase the instability, not help it.

Here's some sample data I did have. This represents the results over the first 1,500 games played in one of my matches. More after the data.

Code: Select all

                        --error- total   win
                   Elo   +    -  games   pct  
Crafty-23.1       2608  145  146     3   50%  2596   33% 
Crafty-23.1       2591  130  134     5   40%  2608   40% 
Crafty-23.1       2586  114  118     9   39%  2614   33% 
Crafty-23.1       2562  109  115    11   32%  2623   27% 
Crafty-23.1       2582  103  107    13   38%  2606   31% 
Crafty-23.1       2557   84   88    23   37%  2603   30% 
Crafty-23.1       2591   74   75    34   46%  2594   26% 
Crafty-23.1       2604   68   67    49   53%  2551   20% 
Crafty-23.1       2628   60   59    67   57%  2534   19% 
Crafty-23.1       2639   50   49    92   57%  2556   22% 
Crafty-23.1       2629   43   43   122   54%  2567   21% 
Crafty-23.1       2636   39   38   156   54%  2578   19% 
Crafty-23.1       2659   36   36   188   57%  2580   16% 
Crafty-23.1       2663   34   34   212   58%  2579   16% 
Crafty-23.1       2665   33   32   230   58%  2578   16% 
Crafty-23.1       2659   31   31   244   58%  2580   18% 
Crafty-23.1       2663   31   30   254   58%  2580   18% 
Crafty-23.1       2661   29   29   277   59%  2578   19% 
Crafty-23.1       2656   29   28   297   58%  2580   19% 
Crafty-23.1       2645   27   27   317   56%  2582   19% 
Crafty-23.1       2639   27   26   341   56%  2585   19% 
Crafty-23.1       2642   25   25   378   56%  2585   19% 
Crafty-23.1       2642   24   24   397   57%  2585   20% 
Crafty-23.1       2644   24   24   417   57%  2586   19% 
Crafty-23.1       2648   24   23   440   58%  2584   19% 
Crafty-23.1       2645   23   23   463   57%  2585   19% 
Crafty-23.1       2647   22   22   483   58%  2583   19% 
Crafty-23.1       2646   22   22   500   58%  2584   20% 
Crafty-23.1       2650   22   22   517   58%  2581   19% 
Crafty-23.1       2650   21   21   531   58%  2582   20% 
Crafty-23.1       2648   21   21   556   58%  2584   21% 
Crafty-23.1       2644   20   20   592   57%  2585   21% 
Crafty-23.1       2649   20   20   617   58%  2583   21% 
Crafty-23.1       2649   19   19   634   58%  2584   21% 
Crafty-23.1       2651   19   19   651   58%  2584   21% 
Crafty-23.1       2648   19   19   674   58%  2584   23% 
Crafty-23.1       2644   18   18   704   57%  2585   23% 
Crafty-23.1       2642   18   18   731   57%  2585   23% 
Crafty-23.1       2642   18   18   742   57%  2586   23% 
Crafty-23.1       2642   18   17   765   57%  2586   24% 
Crafty-23.1       2642   17   17   788   57%  2587   24% 
Crafty-23.1       2643   17   17   810   57%  2586   24% 
Crafty-23.1       2644   17   17   839   57%  2586   23% 
Crafty-23.1       2644   17   16   867   57%  2586   23% 
Crafty-23.1       2644   16   16   887   57%  2586   23% 
Crafty-23.1       2643   16   16   911   57%  2586   23% 
Crafty-23.1       2644   16   16   933   57%  2586   23% 
Crafty-23.1       2643   16   16   949   57%  2586   23% 
Crafty-23.1       2642   16   16   972   57%  2586   23% 
Crafty-23.1       2640   15   15  1000   57%  2587   23% 
Crafty-23.1       2640   15   15  1028   57%  2586   23% 
Crafty-23.1       2640   15   15  1047   57%  2586   23% 
Crafty-23.1       2640   15   15  1070   57%  2586   22% 
Crafty-23.1       2640   15   15  1086   57%  2586   22% 
Crafty-23.1       2638   15   15  1115   56%  2587   22% 
Crafty-23.1       2637   15   14  1137   56%  2587   22% 
Crafty-23.1       2636   14   14  1154   56%  2588   22% 
Crafty-23.1       2637   14   14  1177   56%  2587   22% 
Crafty-23.1       2637   14   14  1191   56%  2588   22% 
Crafty-23.1       2636   14   14  1218   56%  2588   22% 
Crafty-23.1       2635   14   14  1249   56%  2588   22% 
Crafty-23.1       2634   14   14  1271   56%  2588   22% 
Crafty-23.1       2632   14   14  1295   55%  2589   22% 
Crafty-23.1       2630   14   13  1327   55%  2589   22% 
Crafty-23.1       2628   13   13  1343   55%  2590   22% 
Crafty-23.1       2627   13   13  1365   55%  2590   22% 
Crafty-23.1       2627   13   13  1385   55%  2590   22% 
Crafty-23.1       2626   13   13  1405   54%  2591   22% 
Crafty-23.1       2626   13   13  1426   54%  2591   22% 
Crafty-23.1       2627   13   13  1445   55%  2590   22% 
Crafty-23.1       2625   13   13  1469   54%  2591   22% 
                       ...
Crafty-23.1       2611    5    4 31128   53%  2591   22% 
After 3 games, the results are actually pretty close to their final value. After 23 games we are at a low point of 2557. Then we climb back to well above reality and hover there from around 200 games until almost 300, where it begins to drop back down slowly (with a couple of jumps back beyond 2650) By 600 games we are still almost +40 too high. Over time the results settle to 2611 which changes little near the end.

But notice the "early returns". If you ran just 10 games, as someone here did, you are -50 (way under-estimating the final result). If you had run just 200 games, you are now +50 (way over-estimating final result).

This is the reason I said 10 games mean absolutely nothing. All this takes is some "balanced" positions where the games can go either way. They will thanks to the inherent randomness caused by using time to constrain the search. Adding opening books makes this worse, which is why I started testing without books a long time back.