Is It Me or is the Source Comparison Page Rigged?

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 9:04 pm

Having the source code is not needed at all when a court tries to evaluate the similarities in two programs, because the courts work at a higher abstraction level.

They work at the semantical level or even higher in some cases.

But from the quote someone provided here, the courts also discard code that is in some way "forced" to be similar semantically -- for instance the implementation of a simple predefined interface protocol must be semantically similar at a higher level of abstraction. Thus the courts might very well throw out such evidence.

The courts might also want to find similarity for significant portions of code.

It's just speculation since we're not lawyers, but I don't think that bringing the court practice into this advances the anti-Rybka case.

tiger · Post by **tiger** » Mon Sep 01, 2008 9:11 pm

RegicideX wrote:
Because 99% of the people here really do not understand the process of taking a source program and translating it to an optimized machine language program.
You can claim all the expertise in the world -- it will not make the claim that there is no creativity involved in constructing a C source code from machine code anything less that sheer baloney.

Bob is not claiming all the expertise in the world. What he explains to you is basic knowledge of information technology.

I do not know if you have this knowledge yourself. Actually from the points you make I doubt you have. I may be wrong naturally.

// Christophe

bob · Post by **bob** » Mon Sep 01, 2008 9:16 pm

RegicideX wrote:
Having the source code is not needed at all when a court tries to evaluate the similarities in two programs, because the courts work at a higher abstraction level.

They work at the semantical level or even higher in some cases.
But from the quote someone provided here, the courts also discard code that is in some way "forced" to be similar semantically -- for instance the implementation of a simple predefined interface protocol must be semantically similar at a higher level of abstraction. Thus the courts might very well throw out such evidence.

The courts might also want to find similarity for significant portions of code.

It's just speculation since we're not lawyers, but I don't think that bringing the court practice into this advances the anti-Rybka case.

I didn't see that, but whomever said it is wrong. The mechanism for disassembling is well-known. And the concept of "semantic equivalence" has been around forever...

tiger · Post by **tiger** » Mon Sep 01, 2008 9:20 pm

RegicideX wrote:

Alex, if somebody takes the source code of a program and starts switching variable names and starts moving lines of code up and down (while making sure those moves do not change anything significant), they will come up with a new file.c.
You are adding the completely unwarranted assumption that Vas was actually doing code reshuffling from the Fruit source. You have zero evidence of that. Indeed, given the completely trivial nature of the UCI parser programm it is probably less of a hassle to simply write the programm from scratch.

And it's not at all true that only variable reordering is different -- there are many differences.

Claiming that you do have identical blocks of code when you don't have them is misleading -- no matter how much bullshit gathers in this thread.

The process is not misleading. It is an accepted process in trying to detect similar programs.

It is "semantical abstraction".

The assumption that line reordering has happened must be done. As long as this reordering does not change the semantics of the program, it could for example be done automatically by a program. Some reordering is done by the compiler.

The concern you have is that the process could somehow lead to false positives. It could in a very few cases, but could not happen by chance when several routines are taken into account. That is why the courts use this process.

The case of the UCI parser is interesting because it has brought up this issue. The methodology can be better explained because it has been correctly pointed out that we should have mentionned it. However the methodology is correct. You may argue that it can give a false positive on the UCI parser, but that does not make it invalid even in this case. A false positive can happen, but many false positive cannot.

Once we agree on the methodology, I guess that analyzing more evidence will be less painful.

// Christophe

bob · Post by **bob** » Mon Sep 01, 2008 9:21 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.
OK. but we are talking about chess programs. My program (crafty) currently has 44,864 lines of code. That is hardly "small". the final part about "similar output" is meaningless. We have so many ways to search trees, so many ways to extend and reduce, so many ways to order the moves to make the search more efficient, a nearly limitless number of ways to turn a position into a numeric value that can be passed thru the search to find the best move, that this point is simply meaningless. Finding semantically equivalent code can happen on two levels. Nobody cares about tiny pieces of code such as PopCnt(), LSB(), MSB() and the like. But when you get into procedures that are hundreds of lines long, big chunks of semantically identical code leads to but one conclusion. Again, for a given algorithm, there are an infinite number of ways syntactically to express it. So duplication is _highly_ improbable. And that is the basis that caused a couple of people to start to look at this more closely. It is simply so improbable unless code was physically copied.

The actual source comparison shows some similarities but also lots of dissimilarities -- which is what is to be expected.
Partially true. There will necessarily be parts of the code that are different, since program B (supposedly derived from A) is significantly stronger. But chunks of duplicated code is _not_ normal in any program of more than a few lines. So that concept I do not agree with.

But if you see 10 or 15 variables which are initialized in the same order in machine code, then despite the changes made by the compiler, something is fishy -- not necessarily conclusive, but fishy.
Agreed. And if they are all initialized, even if _not_ in the same order, it is _still_ fishy In fact, the compiler could change that order for other reasons if it wanted to For example, it is more efficient to initialize things in the order they appear in memory to take advantage of cache prefetching. And if you are trying to de-compile, you see the final order, not the programmed order, and rearranging makes perfect sense when trying to match things up.

It turns out though, that the order of variable initialization was faked to make the code look more similar than the disassembler made it look -- the same goes for the order in which various "if" clauses are checked. So there is really nothing fishy in the code regarding the order.
EXTREMELY poor choice of wording with "faked". Nothing being done in this investigation is "faked". If you want to use words that are inflammatory or accusatory, feel free. But don't expect much in the way of discussion if that is the way you want to proceed. Get some experience or knowledge about the field and you will see how wrong "faked" is.

There is plenty of semantic non-equivalent code, and what similarity there is, is mostly about the general structure of the program -- making allowance for the fact that even in the structure there are dissimilarities.
again, that is baloney. yes there is non-equivalent code. But in a large program, one expects to find very little semantically equivalent code. That is the issue. Do you actually believe that in writing a 44,000 line program, that I will by pure chance duplicate what others have done _exactly_ here and there? when I don't even see this happen in 100-1000 line student programs???
The situation with your students is different because one student does not study the work of another student.

Uri
That has to be the single most naive statement I have ever read. I asked another faculty member his opinion, and his reply was "what planet is _this_ guy from?" that pretty much says it all. Fraternities have copies of every test I have ever given, every assignment I have given and member's papers that go with them. Foreign students have the same sort of "underground". That is why I keep old assignments.

It is _really_ time to get serious here and stop that kind of nonsensical remark. Again, talk to _anybody_ that works in a University CS department. This is a problem _everywhere_.
I guess that I assumed wrongly that the students do not have access to programs that do the same task.

In this case I wonder if you do not blame some students for copying when they did not copy and maybe programmers who see source try to avoid some structure that they remember from another code because they are afraid that if they do not do it they are going to be blamed for copying.

Maybe the only good test is giving students with similiar code an exam
to see if they understand the code that they give or to ask them to start to build the task from scratch and compare what they do in some hours of work without copying.

Uri

Easy question, deserves an answer. I have _never_ had a single circumstance where the student said "you are wrong". Most admit it when I show them the two programs side by side. The most common response is "OK, I made a mistake, can I have another chance to write it on my own?"

BTW, on final projects, which are generally turned in around the last class, I have on _several_ occasions created a set of "choose one of the following and answer based on your final project code" such as "how did you guarantee that your program produces identical results with 1, 2, 3, ... n processors (has to do with generating streams of random numbers)?" There are _many_ ways to do this, but the most direct is to compare.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 9:23 pm

Bob is not claiming all the expertise in the world. What he explains to you is basic knowledge of information technology.

Bob is making long winded explanations of the obvious and then he slips in logical mistakes which make the source reconstruction look more objective than it is.

It is complete nonsense to claim that there is no creativity involved in writing a source C code from machine code. In fact, Bob often agrees with this, saying correctly that there are many C source codes corresponding to assembly -- and then in the next sentence he claims that he can determine the source completely objectively.

That's simply contradictory.

bob · Post by **bob** » Mon Sep 01, 2008 9:24 pm

RegicideX wrote:
Because 99% of the people here really do not understand the process of taking a source program and translating it to an optimized machine language program.
You can claim all the expertise in the world -- it will not make the claim that there is no creativity involved in constructing a C source code from machine code anything less that sheer baloney.

Believe what you want. I _know_. If you want to live in a state of total denial about this particular subject, that's fine. But you are in the same boat as someone trying to convince a physicist that an atom is the smallest particle that exists. They just _know_ better. Same here.

tiger · Post by **tiger** » Mon Sep 01, 2008 9:25 pm

chrisw wrote:
bob wrote:
chrisw wrote:
RegicideX wrote:
bob wrote: Fine. you are unable to follow the discussion.
Baseless assertion, followed by verbiage.

Aha. so you do _not_ understand "semantical equivalence". This is simply proving that for the same inputs, the two pieces of code produce the same output. "
Then the Rybka code and the Fruit code are not semantically equivalent -- there are plenty of differences in their outcome for the same input.

But of course, that's not what you mean. You mean that you should get to ignore differences --including semantic non-equivalence-- and focus only on what you find similar.

Order is immaterial so long as changes do not violate data depenencies, name dependencies or control dependencies.
Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.

Changing the order without saying so is at best sneaky, at worst dishonest.

It is worth going back to the first presentation of the misleading source comparison data, the "Here's something to start with" thread ....

Norman Schmidt explains the "methodology", with absolutely no mention at all of any reordering of either variable or program flow. No mention of "semantics".
Norman wrote:
Please note that: Fruits ASSERTs have been removed comments have been removed there are differences between the two...mainly: where Fruit has TRUE, Rybka has 1 where Fruit has FALSE, Rybka has 0
where Fruit has double data types, Rybka uses integer
search->info struct has been replaced with a direct reference to the independant variable.
truthfully many of the differences reflect the implementation of backend C bitboard routines in place of object-oriented C++ class references
Fruit often does error checking that is absent in Rybka,
if (ptr == NULL) my_fatal("parse_go(): missing argument\n");
Another difference:
Fruit calls string_equal to accept input, whereas Rybka uses the C function:
int strcmp( const char *string1, const char *string2 )
but these functions are essentially equivilent...
string_equal is simply a Fabien re-write with ASSERTs included:
string_equal( const char s1 [], const char s2 [] )
{
ASSERT(s1 != NULL);
ASSERT(s2 != NULL);
return strcmp(s1, s2) == 0;
}
which is something he did often...see Fruit 2.1 util.cpp
one click...a global search and replace (for ex: replace true with 1, would recifiy some of the larger differences).
Sven Schule challenged on the variable ordering:
Sven wrote:
since you don't have the Rybka source, how can you know about the location and order of variable declarations in Rybka? Declarations of local variables often do not produce assembler code, you just see the variables when they are used.
Hyatt replied without any reference to the creative reordering that had been done:
Bob wrote:
If you look at the order of the variables as they either appear in memory or on the stack, you can almost infer the order they were declared.
And Alexander Schmidt was convinced:
Alexander wrote:
TY Norman, this is completely convincing. I have not much programming experience (mainly vba) but this are too many similaries for a coincidence.
Unfortunately people will not read it and not believe it...
And Christophe Theron remained absolutely silent on the creative reordering:
Christophe wrote:
Nothing at all.
Can we possibly use the _same_ vocabulary here, which includes the definition of words.

We have a set of sticks in pile A that are colored, where each stick is a different color. We have a second set of sticks and have been asked to answer the question "are the sticks in pile B identical to the sticks in pile A".
Of course your sets of sticks match. They're the necessary and forced sub-components of the Go Parser.

Zach has shown his pile of stick (his UCI parser) and they did not match at all.

Or maybe you can take Zach listing and make it match? Why don't you try?

// Christophe

bob · Post by **bob** » Mon Sep 01, 2008 9:28 pm

RegicideX wrote:
Bob is not claiming all the expertise in the world. What he explains to you is basic knowledge of information technology.
Bob is making long winded explanations of the obvious and then he slips in logical mistakes which make the source reconstruction look more objective than it is.

It is complete nonsense to claim that there is no creativity involved in writing a source C code from machine code. In fact, Bob often agrees with this, saying correctly that there are many C source codes corresponding to assembly -- and then in the next sentence he claims that he can determine the source completely objectively.

That's simply contradictory.

No it isn't. this expression [a-e][a-e]*[x-z] is a well-known syntax for a regular expression.

true or false, one can write a program to enumerate all strings that will match that expression, starting in lexical order, or in order shortest-to-longest, or in any other order one wants to impose. Does that require any creativity? Not in my book. If you call that creativity, then you are using a different definition of the word than I use. And the word I use comes from "create" as in "make new". Is counting from 1 to N "creative"? Neither is enumerating strings from the above regular expression. neither is enumerating possible source codes that could compile to a fixed binary format... It is enumeration, plain and simple...

bob · Post by **bob** » Mon Sep 01, 2008 9:31 pm

RegicideX wrote:

Alex, if somebody takes the source code of a program and starts switching variable names and starts moving lines of code up and down (while making sure those moves do not change anything significant), they will come up with a new file.c.
You are adding the completely unwarranted assumption that Vas was actually doing code reshuffling from the Fruit source. You have zero evidence of that. Indeed, given the completely trivial nature of the UCI parser programm it is probably less of a hassle to simply write the programm from scratch.

And it's not at all true that only variable reordering is different -- there are many differences.

Claiming that you do have identical blocks of code when you don't have them is misleading -- no matter how much bullshit gathers in this thread.

None of the above is "unwarranted". First we know that rybka beta was stronger than fruit. So if you accept the assumption that rybka came from fruit, then the natural consequence of that is that somehow the fruit code was modified. I am not so concerned by "hiding" because he never intended to release the source code, although he knew it was always a potential issue at a ICGA tournament where source can be demanded. But if you accept the first premise that rybka came from fruit, all the other possibilities fall out naturally.

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?