Sedat,Sedat Canbaz wrote:Laskos wrote:You said that based on the score 3.5:2.5 between two engines separated by no more than 50 Elo points.Sedat Canbaz wrote:One thing more (for those who does no agree with me),
I just wonder, what is wrong with the current my below statement ???
We need more games...,but as i expected,Komodo is started to show its real power
Really i dont want to loose more time over this issue...
Best,
Sedat
After I gave you examples of what to do with these numbers, you are coming again with some pretty silly statements.I never said that 10 games are enough data or even 300-500 games are not enough data to show the engines real strenght
And i strongly believe in that:
-Minimum 1.000 games per player is required for reliable rating
Kai
Its ok...my statements can be silly for you,no problem !
As far as i remember,(in the past) you was also against my views...it seems you did not changed a lot
And i strongly believe that Perfect 2012 book series are very well optimized, safe and i recommend to be used by Top Chess Programs
Actually i was just trying to explain that WE NEED MORE GAMES and the openings can be serious factor of any engine playing strenght
If you still have different view than mine,then i challenge you to register in Testo Third Book League
Who knows, maybe you are very good Book Maker...and one day i can see you in SCCT Super League Tournament
Best,
Sedat
I'm sure your book is good, and you put a lot of effort into it. I personally use it, as well as the Stockfish book from Silvo Spitaleri. As for your previous disagreements with Kari, I don't know (or care).
But what Kari is trying to say is that it would really be beneficial to you testers to have some background in probability and statistics to avoid drawing hasty conclusions.
Let X(i) be the random variable equal to the score of engine A versus B in the i-th game (values 0, 0.5 or 1). And let S(n)=(X(1)+...X(n))/n.
Let's assume that both engines are equal, in other words E(X(i))=0.5. If you stop the experiment whenever you see a value S(i)>0.5, you're in for some serious problems... It can be shown that the probability of the event {there exists an i such that S(i)>0.5} is 1. In other words you're almost surely going to conclude that A is better (the same could be done to conclude that B is better). Intuitively an unbiaised random walk will almost surely cross the X-axis (in fact it will aolmost surely cross it an infinite number of times).
There are different problems:
1/ fixed sample testing: you decide the number of games to be played before the experiment and tou stick to it no matter what scores you see! This problem is simple, as the central limit theorem provides a good approximation of the confidence interval at the end of the experiment, providing N is "sufficiently large"
2/ You play on, and only decide to stop by following a predetermined stopping algorithm. This problem is far more complicated, and as far as I know the most efficient and practical stopping rule is the sequential Wald test (or Sequential Probability Ratio Test). Even this test however has type I and type II error, so it shouldn't be used unless you understand fully how it works. There are also algorithms w/o type II error. I remember the empirical bernstein stopping rule for example. But those take considerably longer to terminate (they terminate almost surely if E(X)!=0.5)
I suggest you only focus on 1/, as 2/ requires some deeper knowledge and understanding of probabiity/statistics. But you have to follow the recipie rigourously, or you'll end up making erroneous conclusions.
Let's say (in the context of 1/) you're doing N=10 games (you decide on that and stick to it no matter what the intermediate scores are). Then a score of 9-1 is significant under 95% confidence level. If we remove the possibility of draws to simplify the problem, this can be shown by an elementary calculation (Binomial law, a classic from high school that everyone should know). It can also be shown in the real world by reducing the problem to one dimension and using a formula where P(win) and P(draw) are functions of E(X). Remi Coulom made an estimation of such a function for Bayeselo.
**But** do not go and think now that everytime you see 9-1 you ca nstop the experiment. If you decided on N=100 at the beginning, and you see 9-1, you shouldn't stop! With N=100, and stopping anytime you see that A has 8 points more than B, for example, would lead to a very high risk of error (again the error probability tends to 1 if N goes to infinity).
No rule can be given in terms of absolute number of games either. If you use N=100, and you see 501-499, no conclusion can be drawn, obviously. Whilst if you see 600-400, then you can easily say that A is stronger than B.
In the context of 2/, the stopping time will on average trigger earlier as E(X) is far from 0.5 (and will almost surely not terminate if E(X)=0.5). So even in the context of 2/, you cannot say that even 1000 games is enough. Maybe you reached such a score that 500 was enough, and maybe the stopping rule will only be triggered after 20,000 games...
Anyway, all that matters is understanding what you do, and never following rules, or what people say (especially me).