Is It Me or is the Source Comparison Page Rigged?

John · Post by **John** » Mon Sep 01, 2008 8:50 am

bob wrote: ... We might find that significant parts can be made to match up, so we know that significant parts came from A. Or we might not be able to show much at all came from A ... .

Is there any quantitative definition or consensus regarding the meaning of the word "significance" in the above passage? Or does it simply mean "Our common sense and experience will tell us what is significant"?

There is strong evidence that the latter view is mistaken ... no matter that the people who hold that belief are intelligent, experienced, and working in good faith. Because when the tools of evaluation and their associated criteria for significance are not specified in advance, the door is open wide to fallacies and pseudoscience.

Mike S. · Post by **Mike S.** » Mon Sep 01, 2008 11:37 am

tiger wrote: The side by side comparison had originally been posted by Norman but looked horrible because the formatting had been destroyed by the CCC message parser. I tried to make it look better.

And as a result of this beautification, just by correcting a formatting problem, the Fruit 2.1 source code examples suddenly were "rybkanized", as if by magic?!

tiger · Post by **tiger** » Mon Sep 01, 2008 12:09 pm

Mike S. wrote:
tiger wrote: The side by side comparison had originally been posted by Norman but looked horrible because the formatting had been destroyed by the CCC message parser. I tried to make it look better.
And as a result of this beautification, just by correcting a formatting problem, the Fruit 2.1 source code examples suddenly were "rybkanized", as if by magic?!

Not by magic. By operations that do not change the semantics of the program.

The courts do not work at the source code level. They work at least one the level of abstraction higher by looking at the semantics of the program (what it does, not how it is written exactly).

The semantical analysis allows to avoid being fooled by changed variable names, inversed lines and so on.

The semantical analysis can also be done by changing one of the source, or both, so they look as identical as possible. It cannot be done in any way however: only changes that do not affect the semantics of the program are allowed.

The courts have chosen this method because... it works! If you take two programs that have been written independantly, you will not manage to make them look appear identical or even nearly identical. For example with the semantical analysis, both program created in this thread by Bob and ChrisW cannot be made to look identical.

On the other hand, if you modify a program by changing names, reordering some lines, changing the formatting and such techniques designed to hide the origin of the program, then the semantical analysis will find that your modified program is identical to the original.

// Christophe

chrisw · Post by **chrisw** » Mon Sep 01, 2008 1:27 pm

RegicideX wrote:
bob wrote: Fine. you are unable to follow the discussion.
Baseless assertion, followed by verbiage.

Aha. so you do _not_ understand "semantical equivalence". This is simply proving that for the same inputs, the two pieces of code produce the same output. "
Then the Rybka code and the Fruit code are not semantically equivalent -- there are plenty of differences in their outcome for the same input.

But of course, that's not what you mean. You mean that you should get to ignore differences --including semantic non-equivalence-- and focus only on what you find similar.

Order is immaterial so long as changes do not violate data depenencies, name dependencies or control dependencies.
Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.

Changing the order without saying so is at best sneaky, at worst dishonest.

It is worth going back to the first presentation of the misleading source comparison data, the "Here's something to start with" thread ....

Norman Schmidt explains the "methodology", with absolutely no mention at all of any reordering of either variable or program flow. No mention of "semantics".

Norman wrote:
Please note that: Fruits ASSERTs have been removed comments have been removed there are differences between the two...mainly: where Fruit has TRUE, Rybka has 1 where Fruit has FALSE, Rybka has 0
where Fruit has double data types, Rybka uses integer
search->info struct has been replaced with a direct reference to the independant variable.
truthfully many of the differences reflect the implementation of backend C bitboard routines in place of object-oriented C++ class references
Fruit often does error checking that is absent in Rybka,
if (ptr == NULL) my_fatal("parse_go(): missing argument\n");
Another difference:
Fruit calls string_equal to accept input, whereas Rybka uses the C function:
int strcmp( const char *string1, const char *string2 )
but these functions are essentially equivilent...
string_equal is simply a Fabien re-write with ASSERTs included:
string_equal( const char s1 [], const char s2 [] )
{
ASSERT(s1 != NULL);
ASSERT(s2 != NULL);
return strcmp(s1, s2) == 0;
}
which is something he did often...see Fruit 2.1 util.cpp
one click...a global search and replace (for ex: replace true with 1, would recifiy some of the larger differences).

Sven Schule challenged on the variable ordering:

Sven wrote:
since you don't have the Rybka source, how can you know about the location and order of variable declarations in Rybka? Declarations of local variables often do not produce assembler code, you just see the variables when they are used.

Hyatt replied without any reference to the creative reordering that had been done:

Bob wrote:
If you look at the order of the variables as they either appear in memory or on the stack, you can almost infer the order they were declared.

And Alexander Schmidt was convinced:

Alexander wrote:
TY Norman, this is completely convincing. I have not much programming experience (mainly vba) but this are too many similaries for a coincidence.
Unfortunately people will not read it and not believe it...

And Christophe Theron remained absolutely silent on the creative reordering:

Christophe wrote:
Nothing at all.

Sven · Post by **Sven** » Mon Sep 01, 2008 1:34 pm

Once more, I want to add my point of view to the discussion without addressing anyone personally, I am only interested in keeping the facts higher than the speculations.

I doubt that we can draw any serious conclusion from the fact that the piece of code from Rybka that implements the same functionality as the "parse_go()" function of Fruit may be partially "semantically equivalent" to the corresponding Fruit code. My reason is simple:

That code is supposed to have exactly this functionality, and if it wouldn't, then the program would not work as is should!

Let me comment on this in more detail. I start with the Fruit part, then I continue commenting on the Rybka side and "semantic equivalence". When line numbers are given, I refer to http://pagesperso-orange.fr/ct_chess/Fr ... rt_go.html as this is the page many of us seem to be looking at in most cases.

"parse_go()" in Fruit 2.1 can be regarded as consisting of three major parts:

part 1: code to parse the parameter 'string[]' in order to recognize which UCI parameters have been set (and to which values) along with the "go" command (lines 1-143); parsing results are stored in local variables which are declared and initialized only for this purpose in the beginning of the function, and will be evaluated in the next step;

part 2: code to initialize the global data structures SearchInput, SearchInfo, SearchRoot, SearchCurrent, and SearchBest; all these are cleared first with the function call "search_clear()" (line 144), then the SearchInput struct is filled with data derived from the local variables containing the parsing results (lines 145-204);

part 3: code to actually perform the search, and when done, to send the search result (best move) via the interface.

Now we come to the other side.

Part 1: The UCI protocol specifies exactly the meaning of parameter names like "binc", "movestogo", or "movetime". For programming this part, there is very little room for different but still correct implementations. It is obvious that you use strcmp() combined with strtok() here since you are going through a string consisting of space-separated tokens from left to right. Using something else than strcmp() together with strtok() would be possible but would require useless re-invention of the wheel. (Fruit's "string_equal()" internally uses strcmp().)

The fact that the order of handling the different UCI parameters is identical on both sides can have a trivial explanation: the parameters are handled in alphabetical order of their names, which is the obvious way of doing it.

So it should not surprise anyone that the code of part 1 shows strong similarities between the Rybka side and the Fruit side. Rybka also has no error handling in this part as opposed Fruit, that's only a minor additional detail.

Part 2: As shown for part 1, there is some information to be stored that results from parsing the input. Again, there is not much room to choose here for the programmer. Certain data has to be stored.

But: please note that in the Rybka code there seems to be no equivalent to "search_clear()" from Fruit, nor do I see an equivalent for the five data structures mentioned above.

So even if Rybka does something in this part 2 that is "semantically equivalent" to the Fruit 2.1 side, I can only say that
a) this is necessary since the information derived from input parsing, mostly the time control settings, have to be made available for the search, and
b) Rybka seems to use at least slightly different data structures for this type of information.

The only thing that I see here is that the structure of the part 2 implementation appears to be similar on both sides. It is of course possible that the Rybka author borrowed this structure from Fruit, but this is not a proof of copy and paste, and even if there were a proof, it would be ridiculous to base on it since this is most trivial code.

Also I see some small semantic differences, like the missing "Ponder" UCI option (line 191), the different time_max calculation (lines 184-186), the missing global flag time_is_limited (lines 178, 188), or the difference of using ">=" or ">" at two places (lines 182, 193).

Part 3: When parsing is done and information derived from it is available, you start the search and afterwards send the best move. That's what has to be done here, no reason to do something else. Note, however, that Rybka (as someone already pointed out in some posting some days or weeks ago) seems to have no equivalent to search_update_current() from Fruit.

As I showed, I see no reason to claim full semantic equivalence of Fruit 2.1 and Rybka 1.0 beta regarding the "parse_go()" code since there are enough differences. Furthermore I showed that semantic equivalences are partially there but are implied by the very purpose of that piece of code.

Only for a very limited portion of part 2 I can imagine that the Rybka author may have been heavily inspired by Fruit when designing his code to store the data derived from input parsing in his own data structures. IMHO this is far away from GPL violation.

Sven

GenoM · Post by **GenoM** » Mon Sep 01, 2008 3:39 pm

I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 4:10 pm

GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?

No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.

The actual source comparison shows some similarities but also lots of dissimilarities -- which is what is to be expected.

But if you see 10 or 15 variables which are initialized in the same order in machine code, then despite the changes made by the compiler, something is fishy -- not necessarily conclusive, but fishy.

It turns out though, that the order of variable initialization was faked to make the code look more similar than the disassembler made it look -- the same goes for the order in which various "if" clauses are checked. So there is really nothing fishy in the code regarding the order.

There is plenty of semantic non-equivalent code, and what similarity there is, is mostly about the general structure of the program -- making allowance for the fact that even in the structure there are dissimilarities.

GenoM · Post by **GenoM** » Mon Sep 01, 2008 4:21 pm

RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not.<...>

So you agree that valid conclusions can be drawn from comparison between an actual source code and the disassembled one.

Thanks.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 4:38 pm

GenoM wrote:
So you agree that valid conclusions can be drawn from comparison between an actual source code and the disassembled one.

Thanks.

Yes -- I also agree that invalid conclusions can be drawn from comparison between an actual source code and the disassembled one.

Without specifics this is meaningless.

bob · Post by **bob** » Mon Sep 01, 2008 4:51 pm

RegicideX wrote:
bob wrote: But that is difficult to see for the casual person, so moving things around, while maintaining that semantic equivalency, so that we get identical source code finishes the project off and leaves a clearly identifiable connection.
The poor casual person should be told that things were switched around a lot to make it look more similar than it is. That's the fishy part. (Not to mention that at least a variable actually got deleted in the process.)

And since the codes being compared are not semantically equivalent and contain lots of semantically in-equivalent parts, it is the shuffling around that does a lot if not most of the work of making the codes look similar.

And the point would be? Let's take a student who can't write his own program, and so he plagiarizes something from someone who took the class sometime in the past.

In the source, he is going to make several transformations:

(1) change variable and procedure names.

(2) completely rewrite the comments

(3) change some blocks of code into procedures that are called to further obfuscate the origin.

(4) change the order of instructions so that things are done in different orders when possible, without violating semantic equivalence so that the program will still work as required by the assignment definition.

(5) change code structure when possible. Replace a for-loop with an equivalent while() loop, Look for other ways to rewrite small sections of code to further hide the origin.

To catch the plagiarism, the steps have to be reversed. I'm not going to do this for all programs, so I look for some key similarities first, and if I spot some, then I start unwinding the changes or trying to change the source program to the plagiarized program, depending on which looks easier.

There is a _ton_ of stuff that gets moved around. And anyone with experience in doing this clearly understands this issue. If you join in to a discussion about such issues, the low-level methodology is not going to be discussed among people that have done this many times. When discussing programming, you won't find people continually writing "of course, the for() loop might not be executed at all, if the terminating condition is satisfied before the loop starts..." That is just basic knowledge about the topic at hand. Just as renaming variables, reordering instructions, etc, is a common plan for verifying plagiarism.

So "outsiders" see a discussion and lack some background? That happens. You could say the "insiders" ought to discuss things more clearly, making sure there are no actions that an "outsider" would not understand. I would go the opposite way and say that an "outsider" ought to do some research on the topic _first_. The programming section of this message board would be completely useless if every post had to explain every high-level topic that is mentioned, every time it is mentioned.

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?