How much stronger is the new Stockfish 2.1?

lmader · Post by **lmader** » Wed May 11, 2011 3:49 am

Good call. Norman is approaching troll status. Crikey, even trying out the idea that Stockfish should have derivative status?! Really?! Gimme a break. But Bob got fished

Not that one can really blame him, it's tough dealing with the like of that. All I can say is, don't feed the trolls.

So, any new stats on the strength of Stockfish 2.1??

kranium · Post by **kranium** » Wed May 11, 2011 7:57 pm

lmader wrote:Good call. Norman is approaching troll status. Crikey, even trying out the idea that Stockfish should have derivative status?! Really?! Gimme a break. But Bob got fished Not that one can really blame him, it's tough dealing with the like of that. All I can say is, don't feed the trolls.

So, any new stats on the strength of Stockfish 2.1??

Yep
Typical Talkchess behaviour...

Have nothing of substance to add?
Don't like someones opinion...?
Log in and attack him.
Call him names!

Thanks for the substantive, interesting, and informative post Lar!

Kaj Soderberg · Post by **Kaj Soderberg** » Wed May 11, 2011 8:53 pm

kranium wrote:
lmader wrote:Good call. Norman is approaching troll status. Crikey, even trying out the idea that Stockfish should have derivative status?! Really?! Gimme a break. But Bob got fished Not that one can really blame him, it's tough dealing with the like of that. All I can say is, don't feed the trolls.

So, any new stats on the strength of Stockfish 2.1??
Yep
Typical Talkchess behaviour...

Have nothing of substance to add?
Don't like someones opinion...?
Log in and attack him.
Call him names!

Thanks for the substantive, interesting, and informative post Lar!

Now, without sarcasm as there must be things that can be discussed next to cloning, when you mention nothing of substance to add one wonders what's going on with Fire 1.4. It's been a while since the last release. Usually that can mean two things: lack of inspiration/time, or something interesting cooking. You could add some power to your arguments that you are capable of improving existing software significantly. That would give something to make the discussion more interesting. Endlessly repeating points of view does not. Let me be clear that stealing software for me is unacceptable. Production of something definitively better could be interesting. I'd rather see a discussion on the latter, as the former has been repeated a thousand times and nothing will ever change history and simple souls who are easily seduced or misguided. So, something worthwhile coming up?

Ambivalent regards,
Kaj

Don · Post by **Don** » Thu May 12, 2011 4:06 pm

mcostalba wrote:
bob wrote: Any more personal pot-shots you care to toss out while you are on such a roll???

If the only way to make progress is to copy the code of others, I would simply work on something else. Fortunately, at least for some of us, that is not the case...
Actually I am not on roll. Last release was very disappointing in terms of ELO gain and, what worse, it proves our testing scheme is to be deeply rethought.

I think the problem for us is that with a well developed program such as yours or ours it becomes more and more necessary to test at longer time controls - but that makes it more difficult to get enough samples to have much confidence in the changes you promote. A lot of our improvements nowadays affect deeper searches much differently than short ones.

We used to get great results with super fast testing but we find now that it does not always translate. For example one level we test at to get a quick estimate of the performance is game in 1 second + 0.01 increment (we like fischer.) If we are patient we can get reasonably low error margins in about an hour as we get over 5 games per second on my 6 core machine. This and time controls like it used to give relatively hight correlation to longer time control testing. But now it doesn't as much. We have discovered that whether the result is good or bad it may not translate to more realistic time controls closer to the fastest time controls humans use.

Out of superstition our most serious tests do not play komodo vs komodo in mosts tests and we use 3 different strong programs (one of them is stockfish) to test against. We do this because "everybody knows" that self-testing is invalid and we are being conservative. We don't see any real evidence that is actually a factor but it seems like it should be so we do it.

There is no solution except to obtain more CPU power for testing. When you consider that most of the candidate changes will yield plus or minus about 1 to 3 ELO it's like trying to measure the width of a human hair with a yardstick. We cannot wait for 200,000 games. Also to be considered is the noise produced by the hardware and compiler. I can make an inconsequential change to the program and it's 1/4 percent faster or slower for no obvious reason. (about 15 years ago it was much worse, but I think compilers are much better now at laying out the code to avoid caching anomalies and such.) The tester also introduces some anomalies into the results.

Unfortunately, for us testing is a major bottleneck as we can produce versions with new ideas much faster than we can test them. We spend more time discussing what to test next than actually coding it up because we have to wait anyway.

One other comment. I see things in other peoples programs (and my own) that I don't believe can be reasonably tested. Things that will change the impact on the program at levels high enough that they cannot know the effect, unless they are testing these single changes for weeks at a time (or have a few hundred cores to work with.) It's hard to know for example if certain type of aggressive extensions hurt more than they help at long time controls. Anything that has a strong impact on the branching factor is scary, even if it tests well at game in 15 seconds.

Yes we made a lot of clean up / refactoring work and I am a bit proud of this because I think now SF sources are better than ever from a code style point of view (I particularly like search() functions unification, it required a lot of work and regression testing to do it right), but engine is almost not improved since previous version.

I think you take yourself a bit too much seriously, regarding the reading of Ippo sources I remember you wrote that you gave it a quick look but you find it such a mess that decided to not spend time on it.

I think we ALL take ourselves too seriously

I know for sure that I do.

Finally, regarding copying (after verification testing of course) ideas. My point of view is that I really don't care to be the hero, the lone author that makes all by himself. If there are good ideas around I am happy to give it a look, and in Ippo there were many good ideas, buried under a ton of crappy code. I think one of the merit of SF has been to dig out some of those ideas and rewrite in a clean and documented form so that all the engine developers could easily access them: like when archaeologists dig out a beautiful ancient Egypt treasure, remove the dust and put under the lights in a museum for everybody to see.

hgm · Post by **hgm** » Thu May 12, 2011 4:33 pm

I still think that the answer is 'tree matches', or related techniques. Just have a test-suite of several thousand representative games played by some previous version, and when you want to evaluate such a small 1-3 Elo change, run all positions of these games through the engine again, to see where their moves would differ. Then only focus on the positions where this happened, and start a tree match from those to see which was the objectively better move.

Now that I refactored the XBoard code such that it can load engines at run time, and I can easily add a third engine, I am finally in a position where I can implement the tree matches easily. So I will probably try that in the near future.

Don · Post by **Don** » Thu May 12, 2011 6:10 pm

hgm wrote:I still think that the answer is 'tree matches', or related techniques. Just have a test-suite of several thousand representative games played by some previous version, and when you want to evaluate such a small 1-3 Elo change, run all positions of these games through the engine again, to see where their moves would differ. Then only focus on the positions where this happened, and start a tree match from those to see which was the objectively better move.

Now that I refactored the XBoard code such that it can load engines at run time, and I can easily add a third engine, I am finally in a position where I can implement the tree matches easily. So I will probably try that in the near future.

This is something we tried years ago and had a huge amount of appeal to me then. However the problem is how to interpret the results. There is much difficulty determining which of 2 different moves is better - and on manual inspection of many positions we found that it was difficult to say with any objectivity which move was actually better.

Also, ELO strength is impossible to measure with any serious resolution when looking at a few tends of thousand positions that differ. ELO cannot be measured move by move, it has more to do with following through on ideas and such. A slightly weaker plan better executed for instance.

This could be automated and checked out however. It's certainly an appealing idea. If you have 2 similar programs that vary, let a a deeper searching version of both programs see if they can agree on which move is better. If it's one of the moves one of them played, as a first order estimate it is likely to be a better move. Then you tally which program played the better move.

You could use a single 3rd program as an oracle, perhaps a stronger program. However, the similarity tester I created indicates that move choice is more a matter of style than strength when the programs are within a few hundred ELO of each other. Most programs will agree on what the best move is 95 percent of the time if there really is a best move.

I think chess strength is more about how you play move sequences and cannot be effectively isolated to a tally of how many you "match." It's like tennis, John McEnroe would not have been an exceptional player if he decided to always play from the baseline. But that worked great for Ivan Lendl. You cannot say one style is objectively superior to the other because it has to be compatible with the entire approach to the game each player uses and is good at.

hgm · Post by **hgm** » Thu May 12, 2011 7:02 pm

This is why I want to test the moves that differ to see which is better for the engine. So for every move that differs, I let both versions continue the game with both moves (recursively, in case they differ again later).

This should be really easy to implement in XBoard, with everything I already have. Just add a third engine, which always plays the same color as the second engine. When that color gets to move, set both second and third engine thinking. Probably one after the other, to prevent a varying CPU load. So if the second engine produces a move in this mode, don't play it, but save it on a stack, and send the previous (opponent) move to the third engine. If the third engine produces a move, force it into the second engine in stead of its original move if they were different, and pop a move from the stack if they were the same. Also send the move to the first engine.

Don · Post by **Don** » Thu May 12, 2011 7:49 pm

hgm wrote:This is why I want to test the moves that differ to see which is better for the engine. So for every move that differs, I let both versions continue the game with both moves (recursively, in case they differ again later).

This should be really easy to implement in XBoard, with everything I already have. Just add a third engine, which always plays the same color as the second engine. When that color gets to move, set both second and third engine thinking. Probably one after the other, to prevent a varying CPU load. So if the second engine produces a move in this mode, don't play it, but save it on a stack, and send the previous (opponent) move to the third engine. If the third engine produces a move, force it into the second engine in stead of its original move if they were different, and pop a move from the stack if they were the same. Also send the move to the first engine.

So basically you have 2 engines (presumably closely related) and you explore every game they would play if they vary, right?

That is probably a good way to do this.

hgm · Post by **hgm** » Thu May 12, 2011 8:25 pm

You could play two versions of an engine against each other, but, considering the bad repotation of self-play, it seems better to play two versions of one engine (A' and A") against a completely unrelated engine B. What I basically want is to construct the tree that arises when engine B plays all moves from positions where one color has the move, and A' and A" play the moves from all positions where the other color has the move (potentially causing the tree to branch if their moves differ).

The score would be calculated by averaging the scores of the sub-trees in every branch node, and the score in favor of A' and A" by taing the difference of the score of the sub-trees they prefer.

Don · Post by **Don** » Thu May 12, 2011 8:35 pm

hgm wrote:You could play two versions of an engine against each other, but, considering the bad repotation of self-play,

self-play does not have a bad reputation with me. In our serious testing we avoid it purely out of superstition, but I have yet to see convincing evidence that self-testing is bad. I know it's POSSIBLE, but I also know that computer vs computer is just another kind of self-testing. I believe this is myth fueled more by imagination than reality.

At any rate, it sounds interesting however you do it and I would believe that either method is perfectly valid.

it seems better to play two versions of one engine (A' and A") against a completely unrelated engine B. What I basically want is to construct the tree that arises when engine B plays all moves from positions where one color has the move, and A' and A" play the moves from all positions where the other color has the move (potentially causing the tree to branch if their moves differ).

The score would be calculated by averaging the scores of the sub-trees in every branch node, and the score in favor of A' and A" by taing the difference of the score of the sub-trees they prefer.

How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?

Re: How much stronger is the new Stockfish 2.1?