Could you explain why LOS based stopping is wrong? If it is wrong theoretically because it has an unbounded Type I error, then I will say that practically, in a certain range of a number of games, say 50-50,000, Type I error is easily controlled. I think you are messing up things, and LOS based stopping rule is good for unknown ELO changes in detecting true positives with a desired Type I error (false positive). It's not an optimal stopping rule, maybe that's what you are saying, and has only one objective, to detect the true positive, while SPRT optimizes for 2 objectives.Michel wrote:I think that is overstating it.I am amazed how very smart people don't understand that principle.
It is not hard to explain why LOS based stopping is wrong. The LOS computation is based on a uniform (or almost uniform) prior. If it were really true that for example a 10000 elo patch is just as likely as a 0 elo patch then LOS based stopping would be correct.
In practice you do not know the true elo distribution of patches. So your testing methodology must be robust to cope with that.
sprt and margin of error
Moderator: Ras
- 
				Laskos  
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: sprt and margin of error
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: sprt and margin of error
I meant to say that stopping at a "Likelihood of Superiority" of 95% does not at all imply that a patch is an improvement with probability 95%. 
The 95% is computed with respect to a uniform prior which does not correspond to the "real" elo distribution of patches (the latter is unknown).
If you read back then you will see that I gave Larry a reference for the theoretical treatment of stopping rules based on p-value.
			
			
									
						
										
						The 95% is computed with respect to a uniform prior which does not correspond to the "real" elo distribution of patches (the latter is unknown).
If you read back then you will see that I gave Larry a reference for the theoretical treatment of stopping rules based on p-value.
- 
				Laskos  
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: sprt and margin of error
Ah, ok. It's amazing how LOS of 95% accumulates Type I errors. To 3,000 games this stop gives with 68% a false positive. But I am managing to dampen this error with high LOS of 99.9 or 99.95%. These values control the Type I error on the target span of number of games, say 50-50,000.Michel wrote:I meant to say that stopping at a "Likelihood of Superiority" of 95% does not at all imply that a patch is an improvement with probability 95%.
The 95% is computed with respect to a uniform prior which does not correspond to the "real" elo distribution of patches (the latter is unknown).
If you read back then you will see that I gave Larry a reference for the theoretical treatment of stopping rules based on p-value.
- 
				Ajedrecista  
- Posts: 2135
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: SPRT and margin of error.
Hello Larry:
My old suggestion of SPRT(-15, 45) would bring an average number of games of 33 or 34 (too few games). Anyway, someone here said that SPRT is more intended for small Elo changes and I must agree.
If I simulate SPRT for smaller Elo changes:
I hope that my results are in line with others' results and wish that they are somewhat useful to you.
------------------------
Regarding error bars in SPRT, it is familiar to me that Michel did some graphs for 95% confidence error bars and the x axis had the variable LLR/games, but I have not found this graph right now.
Regards from Spain.
Ajedrecista.
			
			
									
						
										
						lkaufman wrote:In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws.
I ran some SPRT simulations:Ajedrecista wrote:I am very busy so I can not provide numbers right now. I will try to run some simulations the next weekend but I do not promise anything.
Code: Select all
alpha = beta = 5%
drawelo = 165; bayeselo = 200 (it gives around 160.9 Elo difference)
SPRT(-1.5, 4.5) --> <Games>/simulation ~ 274 (100% of passes after 80000 simulations).
Shortest simulation: 171 games (+111 -5 =55).
Longest simulation: 432 games (+193 -72 =167).
SPRT(-1, 3) --> <Games>/simulation ~ 408 (100% of passes after 80000 simulations).
Shortest simulation: 276 games (+177 -12 =87).
Longest simulation: 574 games (+266 -86 =222).
SPRT(0, 6) --> <Games>/simulation ~ 276 (100% of passes after 80000 simulations).
Shortest simulation: 176 games (+118 -7 =51).
Longest simulation: 414 games (+189 -66 =159).
SPRT(0, 4) --> <Games>/simulation ~ 410 (100% of passes after 80000 simulations).
Shortest simulation: 269 games (+173 -10 =86).
longest simulation: 643 games (+288 -106 =249).If I simulate SPRT for smaller Elo changes:
Code: Select all
alpha = beta = 5%
drawelo = 240; bayeselo = 7.79 (it gives around 5 Elo difference)
SPRT(-1.5, 4.5) --> <Games>/simulation ~ 9209 (99.92% of passes after 10000 simulations).
Shortest simulation: 1329 games (+331 -220 =778).
Longest simulation: 50337 games (+10300 -10052 =29985).
SPRT (0, 6) --> <Games>/simulation ~ 12011 (99.14% of passes after 10000 simulations).
Shortest simulation: 1554 games (+383 -266 =905).
Longest simulation: 86150 games (+17624 -17039 =51487).
-----------------------------------------------------------------------------------------
alpha = beta = 5%
drawelo = 270; bayeselo = 8.68 (it gives around 5 Elo difference)
SPRT(-1.5, 4.5) --> <Games>/simulation ~ 8740 (99.95% of passes after 10000 simulations).
Shortest simulation: 1161 games (+279 -170 =712).
Longest simulation: 46453 games (+8288 -8068 =30097).
SPRT(0, 6) --> <Games>/simulation ~ 11023 (99.65% of passes after 10000 simulations).
Shortest simulation: 1669 games (+358 -245 =1066).
Longest simulation: 56650 games (+9983 -9600 =37067).
-----------------------------------------------------------------------------------------
alpha = beta = 5%
drawelo = 240; bayeselo = 15.58 (it gives around 10 Elo difference)
SPRT(-1.5, 4.5) --> <Games>/simulation ~ 4130 (100% of passes after 10000 simulations).
Shortest simulation: 916 games (+239 -130 =547).
Longest simulation: 16773 games (+3398 -3245 =10130).
SPRT(0, 6) --> <Games>/simulation ~ 4621 (100% of passes after 10000 simulations).
Shortest simulation: 1197 games (+295 -182 =720).
Longest simulation: 18676 games (+3803 -3594 =11279).
-----------------------------------------------------------------------------------------
alpha = beta = 5%
drawelo = 270; bayeselo = 17.36 (it gives around 10 Elo difference)
SPRT(-1.5, 4.5) --> <Games>/simulation ~ 3955 (100% of passes after 10000 simulations).
Shortest simulation: 1197 games (+269 -162 =766).
Longest simulation: 13266 games (+2367 -2231 =8668).
SPRT(0, 6) --> <Games>/simulation ~ 4371 (100% of passes after 10000 simulations).
Shortest simulation: 999 games (+238 -129 =632).
Longest simulation: 14152 games (+2532 -2359 =9261).------------------------
Regarding error bars in SPRT, it is familiar to me that Michel did some graphs for 95% confidence error bars and the x axis had the variable LLR/games, but I have not found this graph right now.
Regards from Spain.
Ajedrecista.
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: SPRT and margin of error.
Code: Select all
My old suggestion of SPRT(-15, 45) would bring an average number of games of 33 or 34 (too few games). Anyway, someone here said that SPRT is more intended for small Elo changes and I must agree. On the other hand the concept of LOS starts from a uniform prior. This may be the reason that in case elo is very different from elo0,elo1 LOS bases stopping (with suitable tresholds) performs better than the SPRT (as has been reported by Kai, I have not verified this).
[[ The theory behind the 2-SPRT starts with a prior which can take three values: el0,elo1 and the elo point at which you want to optimize the
average sample number. ]]
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: SPRT and margin of error.
They are here. Note that they are 90% confidence intervals.Regarding error bars in SPRT, it is familiar to me that Michel did some graphs for 95% confidence error bars and the x axis had the variable LLR/games, but I have not found this graph right now.
http://hardy.uhasselt.be/Toga/cb_fishtest_10_STC.png
http://hardy.uhasselt.be/Toga/cb_fishtest_10_LTC.png
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: SPRT and margin of error.
Note that if you are _really_ concerned about performance for large elo differences you _should_ use the Schwarz sequential test.
This is a 2-SPRT where you optimize at the elo value estimated from the sample.
The added noise that goes into the estimated elo value makes it less efficient than either the SPRT or pure 2-SPRT at small elo values (although it is still much better than fixed length). However it should be better for large elo values.
Here is the paper that introduces the Schwarz test:
http://projecteuclid.org/euclid.aoms/1177704726
			
			
									
						
										
						This is a 2-SPRT where you optimize at the elo value estimated from the sample.
The added noise that goes into the estimated elo value makes it less efficient than either the SPRT or pure 2-SPRT at small elo values (although it is still much better than fixed length). However it should be better for large elo values.
Here is the paper that introduces the Schwarz test:
http://projecteuclid.org/euclid.aoms/1177704726
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: SPRT and margin of error.
Unlike talkchess claims, statistical tests are never plucked out of thin air. They are optimal answers to certain well posed problems.
So what optimization problem does the Schwarz test solve?
The answer is this: among the tests with prescribed error probabilities at elo0,elo1 it is the one with asymptotically minimal expected average sample number assuming a uniform prior.
This statement is still true if we replace "uniform prior" with any prior that is non-zero everywhere.
			
			
									
						
										
						So what optimization problem does the Schwarz test solve?
The answer is this: among the tests with prescribed error probabilities at elo0,elo1 it is the one with asymptotically minimal expected average sample number assuming a uniform prior.
This statement is still true if we replace "uniform prior" with any prior that is non-zero everywhere.
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: SPRT and margin of error.
Sigh... "expected average" is of course redundant. One of those words is enough.The answer is this: among the tests with prescribed error probabilities at elo0,elo1 it is the one with asymptotically minimal expected average sample number assuming a uniform prior.
- 
				Michel
- Posts: 2292
- Joined: Mon Sep 29, 2008 1:50 am
Re: SPRT and margin of error.
I wrote a small simulator for the Schwarz test in python. The simulator is too slow for release but some testing shows that the Schwarz test is actually quite good.  For small elo differences its performance seems to be similar to that of the SPRT and for large elo differences it finishes much more quickly.
For example for the 200 (Bayes)elo difference that started this topic it finishes in 47 games on average.
			
			
									
						
										
						For example for the 200 (Bayes)elo difference that started this topic it finishes in 47 games on average.