SIMEX 2.1

chrisw · Post by **chrisw** » Tue Sep 03, 2019 1:26 am

Rebel wrote: ↑Sun Sep 01, 2019 5:44 pm Made an update, changes:

1. New MEA that fixes a speed issue when running under Win10. MEA was extremely slow, is OK now and even runs faster on Win 7 systems.

2. Cosmetic changes to the HTML output that makes it better readable. It will no longer mark versions of the same engine as orange (similarity 60-64%) or red (65% or higher) by checking the first 2 characters of engines.

Example: http://rebel13.nl/html/ccrl.html

41 high rated CCRL engines and no single 60% found!

One sane list.

My congratulations to the current chess programmers of these engines, especially the new kids on the block while so many strong open source engines are available on Github.

Download SIMEX 2.1 - http://rebel13.nl/dl/simex2.7z

Documentation - http://rebel13.nl/misc/simex.html

Looking at your ccrl.html, massive cross table of engine results on the SimTester, I’m wondering what the similarity numbers mean. This is a list of low similarity programs in general, but one or two things jump out.

Booot6, very consistent results right across the board, it gets 25% to 30% similarity against just about all other programs.

Hiarcs14, likewise, very consistent, 30% to 35% similarity, right across the board, against everything.

They both match with other programs, maybe 1 in 3, or 1 in 4 positions, choosing the same move. Can we assume this is how genuinely independent engines behave? It’s baseline behaviour, some moves are the same, after all, these are relatively usual type chess positions, and you expect the choice of candidate moves to be not so wide.

Then there are other engines, usually scoring sim values of around 30% or so, but then, in the crosstable with Stockfish, start to move up to 50% similarity, now they choose 1 move in 2 the same. What does this mean, this move to 50%? It’s not an uncommon trend. Find your own examples.

How to disentangle all the information in that great mass of numbers ....

Rebel · Post by **Rebel** » Tue Sep 03, 2019 9:47 am

chrisw wrote: ↑Tue Sep 03, 2019 1:26 am Looking at your ccrl.html, massive cross table of engine results on the SimTester, I’m wondering what the similarity numbers mean. This is a list of low similarity programs in general, but one or two things jump out.

Booot6, very consistent results right across the board, it gets 25% to 30% similarity against just about all other programs.

Hiarcs14, likewise, very consistent, 30% to 35% similarity, right across the board, against everything.

They both match with other programs, maybe 1 in 3, or 1 in 4 positions, choosing the same move. Can we assume this is how genuinely independent engines behave? It’s baseline behaviour, some moves are the same, after all, these are relatively usual type chess positions, and you expect the choice of candidate moves to be not so wide.

Then there are other engines, usually scoring sim values of around 30% or so, but then, in the crosstable with Stockfish, start to move up to 50% similarity, now they choose 1 move in 2 the same. What does this mean, this move to 50%? It’s not an uncommon trend. Find your own examples.

How to disentangle all the information in that great mass of numbers ....

I hate long forum postings but I am afraid this is going to be long in order to address your points and questions properly.

Computer chess has a long history of people cloning open source code, making changes and call it their own intellectual property often a breach of copyright, GPL, EULA.

It started already in the 90's when Bob Hyatt made his Crafty open source, several programmers were exposed breaching the EULA of Crafty. And it was not always easy to detect clones, to proof a derivative work one had to look at the disassembled code.

Things got worse when in 2005 Fabien's Fruit 2.1 came out under GPL and contrary to the popular Crafty Fruit was a top engine. And many couldn't resist the temptation and Fruit 2.1 became the source of many clones and/or derivatives breaching the GPL, some even commercialized their derivative work. And it was most of time hard to unmask them as such. On the other hand there were also honest programmers who obeyed the Fruit GPL and made wonderful successors, giving credit to Fabien and releasing the source code.

A few years later a guy named Yuri Osipov did something very remarkable and disturbing at the same time. He disassembled the then leading program Rybka 1.0 and made readable source code from it, called it STRELKA and wanted to sell it to the Chess Assistant folks for distribution.

The same thing happened to Rybka 3 (in 2009), disassembled readable source code which became known as Ippolit, later Robbolito and the chess community was flooded with Rybka 3 derivatives, commercial ones included. It was a total mess.

But help was underway, about the same time (2010 if I remember correctly) Don Dailey released his SIM03 program which measured the similarity between engines by letting each engine ponder about 8238 positions and count equal moves resulting in a similarity percentage. Brilliancy is always simple.

Arrived at this point in time many engines already were exposed as derivative works via the hard way (disassembly or programmer confessions) and SIM03 just confirmed it! No more disassembling needed.

And a number of rightful questions came up such as:

1. By which similarity percentage marker an engine is suspect to be a derivative work?
2. How reliable is the tool, can it for instance produce a false positive?
3. When an engine scores below the similarity percentage marker does this mean the engine is an original one?

On (1), there was a big discussion at the time, some wanted the marker on 55%, others on 65%, but the majority was in favor for 60%, a close general consensus.

On (2), After 10 years no false positive has shown up. There was a discussion that a Fritz 11 version (the standard version or the deep version) produced a false positive because it showed a high similarity percentage with the CCRL ponder hit statistics and SIM03 became under pressure. As it later turned out that particular Fritz version was indeed producing an exceptional high similarity percentage by extracting the 8238 positions from SIM03 to PGN and then running them under Fritz. The image of SIM03 was safe again.

On(3), similarity testing can only proof suspect engines, never disproof. An engine can have a low similarity and can still be a derivative work without not so much elo loss.

dkappe · Post by **dkappe** » Tue Sep 03, 2019 2:13 pm

Someone over at the lczero discord ran this on a variety of nets, including some supervised learning ones. Overall they seemed to have much higher similarity than ab engines.

Dann Corbit · Post by **Dann Corbit** » Tue Sep 03, 2019 9:10 pm

Here is a question:
Is it wrong to use another engines evaluation terms, if you wrote your own code to do it?
If it is, where is the cutoff?

For instance, lots of engines will have wood value of about 3 for knights and bishops. Pulled it right out of the chess book.
Lots of engines will have a bishop pair bonus.
Lots of engines will have lots of interesting evaluation terms that are similar from engine to engine.
Different things that go into king safety are similar from engine to engine.

So, is a similar evaluation wrong, and if so, what makes it wrong?

To me, if you use someone else's code without permission that would make it wrong.
But the ideas are free. At least, they ought to be.

I am interested to hear other opinions about what makes similar evaluation (and similar search for that matter) right or wrong.
To me, I see people trying to draw a line. But I do not clearly understand exactly what the line is supposed to mean.

dkappe · Post by **dkappe** » Tue Sep 03, 2019 9:36 pm

Looking at Good Gyal 7, which is a supervised learning net trained on lichess and stockfish data, and testing it against a collection of diverse leela nets, it’s about a 55%-60% similarity score against most other nets. That’s fairly low out of the group — seeing values over 60% is not uncommon. Compare this with SF, for which similarity scores against these nets hover in the 40’s.

Again, is the nature of the search — MCTS — a reason for this similarity?

chrisw · Post by **chrisw** » Tue Sep 03, 2019 10:04 pm

Dann Corbit wrote: ↑Tue Sep 03, 2019 9:10 pm Here is a question:
Is it wrong to use another engines evaluation terms, if you wrote your own code to do it?
If it is, where is the cutoff?

For instance, lots of engines will have wood value of about 3 for knights and bishops. Pulled it right out of the chess book.
Lots of engines will have a bishop pair bonus.
Lots of engines will have lots of interesting evaluation terms that are similar from engine to engine.
Different things that go into king safety are similar from engine to engine.

So, is a similar evaluation wrong, and if so, what makes it wrong?

To me, if you use someone else's code without permission that would make it wrong.
But the ideas are free. At least, they ought to be.

I am interested to hear other opinions about what makes similar evaluation (and similar search for that matter) right or wrong.
To me, I see people trying to draw a line. But I do not clearly understand exactly what the line is supposed to mean.

Well, there’s a lot of data being collected. Let’s see where the data takes us.

mclane · Post by **mclane** » Tue Sep 03, 2019 10:50 pm

Hiarcs is a unique chess program, I guess mchess is similar. But we cannot Test it because it is not uci,
I like programmer who go their own ways and ideas,
I remember when I first tested an Ed Schroeder chess engine, rebel and mm4 and later mm5.
And many others followed.
I personally find Monte Carlo the best. But Ed Schroeder did very good progress to computerchess,
I remember when I first saw a program by Chris Whittington, I instantly knew this guy is cool. It had a very shallow search but always a very good move, it was chess player soandso or whatever name.
But I knew this program has big potential.

Same with hiarcs.
Mark zu uniacke was impressed by Marty Hirschs work in mchess,
So he began to build his own way of doing mchess.
And IMO he did it very good, in 1993 he got a championship title.
Don Dailey was IMO involved in so many computerchess projects, many we never registered in those old days, today we know better,
I met don Dailey at the championship in cologne in 1986 and in aegon tournament and one day we began testing together.
His strength was that he was open for anything, he was always friendly, never jealous or angry,
One day he decided to began again from the scratch, he called this new 64 bit engine DOCH, this engine is the engine we today call Komodo.

He was a pioneer.
Like so many others . I met over the long period of time.

Computerchess has given me so much,
I learned that the programmers are very nice guys and I was very happy that we could share the same things. Games. Games games . Zillions of chess games. And that the programs often played like the programmers.
And that all these programs helped computerchess being the best hobby I know.

This is why cloning makes IMO no sense.

The best thing is that the engines are part of the programmers ideology.
If he is safety, the engine will play safe.
If he is a gambler, the engine will play so.

The baby, the engine, is a child of the programmer.
It’s part of him. And I can see the programmer in the games the engine plays.

Rebel · Post by **Rebel** » Tue Sep 03, 2019 10:52 pm

Dann Corbit wrote: ↑Tue Sep 03, 2019 9:10 pm Here is a question:
Is it wrong to use another engines evaluation terms, if you wrote your own code to do it?
If it is, where is the cutoff?

For instance, lots of engines will have wood value of about 3 for knights and bishops. Pulled it right out of the chess book.
Lots of engines will have a bishop pair bonus.
Lots of engines will have lots of interesting evaluation terms that are similar from engine to engine.
Different things that go into king safety are similar from engine to engine.

So, is a similar evaluation wrong, and if so, what makes it wrong?

To me, if you use someone else's code without permission that would make it wrong.
But the ideas are free. At least, they ought to be.

I am interested to hear other opinions about what makes similar evaluation (and similar search for that matter) right or wrong.
To me, I see people trying to draw a line. But I do not clearly understand exactly what the line is supposed to mean.

The line has been drawn in the past, 16 programmers signing a letter.

chrisw · Post by **chrisw** » Wed Sep 04, 2019 1:04 am

Dann Corbit wrote: ↑Tue Sep 03, 2019 9:10 pm Here is a question:
Is it wrong to use another engines evaluation terms, if you wrote your own code to do it?
If it is, where is the cutoff?

For instance, lots of engines will have wood value of about 3 for knights and bishops. Pulled it right out of the chess book.
Lots of engines will have a bishop pair bonus.
Lots of engines will have lots of interesting evaluation terms that are similar from engine to engine.
Different things that go into king safety are similar from engine to engine.

So, is a similar evaluation wrong, and if so, what makes it wrong?

To me, if you use someone else's code without permission that would make it wrong.
But the ideas are free. At least, they ought to be.

I am interested to hear other opinions about what makes similar evaluation (and similar search for that matter) right or wrong.
To me, I see people trying to draw a line. But I do not clearly understand exactly what the line is supposed to mean.

Ed has already posted that the line drawn for derivative/clone and so on by the original Simex was really quite arbitrary, agreed after 'discussion', aka it's a political line.

It ought to be possible with all the data we have, to try and rationalise the 'line in the sand' or at least give it some meaning. So, to this end:

By comparing SF10 with SF9 and so on, one development year apart, two development years apart and so on and so on back to SF1. Then averaging the similarity scores for one year back, two years back and so on, we have a metric for what the various similarity scores mean in terms of years of SF development. NB I’m assuming Stockfish releases a new version every year or so.

Code: Select all

Stockfish development years           Sim score
1                                        73.69
2                                        64.24
3                                        58.66
4                                        53.71
5                                        49.95
6                                        46.51
7                                        44.02
8                                        40.92
9                                        39.14

[img]SF-dev-years.png[/img]

From the table, we can see that a sim score of 50% between two engines, would represent about five years of continuous development for Stockfish, from, say SF5 to SF10.
65% would be about 1.9 years separation
60% would be about 2.5 years separation.
55% would be about 3.5 years separation.
50% would be about 5.0 years separation

Of course, all Stockfishes are derivatives of earlier Stockfishes.
As programs such as Booot6 and Hiarcs14 show us, the similarity score for independent programs (well, I assume, and the data confirms) is down in the 30 to 40% range. Many, many programs that Ed has tested score 30-40% across the board against everything they've been tested against. Others don't. Well, now you have a kind-of comparison metric. If an engine gets a Simex score of 65% against some other engine, that's comparable to the kind of score Stockfish 8 and Stockfish 10, or SF6 and SF4 and so on, would get.

Rebel · Post by **Rebel** » Thu Sep 05, 2019 10:50 am

chrisw wrote: ↑Wed Sep 04, 2019 1:04 am Of course, all Stockfishes are derivatives of earlier Stockfishes.
As programs such as Booot6 and Hiarcs14 show us, the similarity score for independent programs (well, I assume, and the data confirms) is down in the 30 to 40% range. Many, many programs that Ed has tested score 30-40% across the board against everything they've been tested against. Others don't. Well, now you have a kind-of comparison metric. If an engine gets a Simex score of 65% against some other engine, that's comparable to the kind of score Stockfish 8 and Stockfish 10, or SF6 and SF4 and so on, would get.

^^^^
That.

SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1

Re: SIMEX 2.1