Is It Me or is the Source Comparison Page Rigged?

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 9:36 pm

this expression [a-e][a-e]*[x-z] is a well-known syntax for a regular expression.

true or false, one can write a program to enumerate all strings that will match that expression, starting in lexical order, or in order shortest-to-longest, or in any other order one wants to impose.

Yes, but in fact we are not presented with several possible source codes which correspond to an assembly code -- we are presented with a single source code, with the pretense that it is the objective, "no creativity involved" source code.

It is like saying that "abz" is the only possible expression derived from " [a-e][a-e]*[x-z] "

tiger · Post by **tiger** » Mon Sep 01, 2008 9:36 pm

Sven Schüle wrote:Once more, I want to add my point of view to the discussion without addressing anyone personally, I am only interested in keeping the facts higher than the speculations.

I doubt that we can draw any serious conclusion from the fact that the piece of code from Rybka that implements the same functionality as the "parse_go()" function of Fruit may be partially "semantically equivalent" to the corresponding Fruit code. My reason is simple:

That code is supposed to have exactly this functionality, and if it wouldn't, then the program would not work as is should!

Let me comment on this in more detail. I start with the Fruit part, then I continue commenting on the Rybka side and "semantic equivalence". When line numbers are given, I refer to http://pagesperso-orange.fr/ct_chess/Fr ... rt_go.html as this is the page many of us seem to be looking at in most cases.

"parse_go()" in Fruit 2.1 can be regarded as consisting of three major parts:

part 1: code to parse the parameter 'string[]' in order to recognize which UCI parameters have been set (and to which values) along with the "go" command (lines 1-143); parsing results are stored in local variables which are declared and initialized only for this purpose in the beginning of the function, and will be evaluated in the next step;

part 2: code to initialize the global data structures SearchInput, SearchInfo, SearchRoot, SearchCurrent, and SearchBest; all these are cleared first with the function call "search_clear()" (line 144), then the SearchInput struct is filled with data derived from the local variables containing the parsing results (lines 145-204);

part 3: code to actually perform the search, and when done, to send the search result (best move) via the interface.

Now we come to the other side.

Part 1: The UCI protocol specifies exactly the meaning of parameter names like "binc", "movestogo", or "movetime". For programming this part, there is very little room for different but still correct implementations. It is obvious that you use strcmp() combined with strtok() here since you are going through a string consisting of space-separated tokens from left to right. Using something else than strcmp() together with strtok() would be possible but would require useless re-invention of the wheel. (Fruit's "string_equal()" internally uses strcmp().)

The fact that the order of handling the different UCI parameters is identical on both sides can have a trivial explanation: the parameters are handled in alphabetical order of their names, which is the obvious way of doing it.

So it should not surprise anyone that the code of part 1 shows strong similarities between the Rybka side and the Fruit side. Rybka also has no error handling in this part as opposed Fruit, that's only a minor additional detail.

Part 2: As shown for part 1, there is some information to be stored that results from parsing the input. Again, there is not much room to choose here for the programmer. Certain data has to be stored.

But: please note that in the Rybka code there seems to be no equivalent to "search_clear()" from Fruit, nor do I see an equivalent for the five data structures mentioned above.

So even if Rybka does something in this part 2 that is "semantically equivalent" to the Fruit 2.1 side, I can only say that
a) this is necessary since the information derived from input parsing, mostly the time control settings, have to be made available for the search, and
b) Rybka seems to use at least slightly different data structures for this type of information.

The only thing that I see here is that the structure of the part 2 implementation appears to be similar on both sides. It is of course possible that the Rybka author borrowed this structure from Fruit, but this is not a proof of copy and paste, and even if there were a proof, it would be ridiculous to base on it since this is most trivial code.

Also I see some small semantic differences, like the missing "Ponder" UCI option (line 191), the different time_max calculation (lines 184-186), the missing global flag time_is_limited (lines 178, 188), or the difference of using ">=" or ">" at two places (lines 182, 193).

Part 3: When parsing is done and information derived from it is available, you start the search and afterwards send the best move. That's what has to be done here, no reason to do something else. Note, however, that Rybka (as someone already pointed out in some posting some days or weeks ago) seems to have no equivalent to search_update_current() from Fruit.

As I showed, I see no reason to claim full semantic equivalence of Fruit 2.1 and Rybka 1.0 beta regarding the "parse_go()" code since there are enough differences. Furthermore I showed that semantic equivalences are partially there but are implied by the very purpose of that piece of code.

Only for a very limited portion of part 2 I can imagine that the Rybka author may have been heavily inspired by Fruit when designing his code to store the data derived from input parsing in his own data structures. IMHO this is far away from GPL violation.

Sven

Thank you Sven for your critical analysis of this part of the evidence. As far as I can tell, this is the FIRST time this evidence is looked at without bias.

So it took one week or so to have someone look at it objectively and report here.

What is important is that we agree on the methodology. The page you are refering to was just supposed to be posted in a message here on CCC. It is not complete because it does not mention the methodology or links to the methology. This can be seen as a mistake and will be corrected, but the page has only be uploaded to my server because I could not insert HTML code in a CCC message.

If only more people would accept to look at the evidence, current and future, like you have looked at it, that would be a huge progress.

// Christophe

bob · Post by **bob** » Mon Sep 01, 2008 9:42 pm

RegicideX wrote:
this expression [a-e][a-e]*[x-z] is a well-known syntax for a regular expression.

true or false, one can write a program to enumerate all strings that will match that expression, starting in lexical order, or in order shortest-to-longest, or in any other order one wants to impose.
Yes, but in fact we are not presented with several possible source codes which correspond to an assembly code -- we are presented with a single source code, with the pretense that it is the objective, "no creativity involved" source code.

It is like saying that "abz" is the only possible expression derived from " [a-e][a-e]*[x-z] "

What is the goal again? to determine if machine code B is semantically equivalent with source code A? And to accomplish this we have to de-compile B which will lead to multilpe alternatives at times, and it is important to show those, which prove nothing, as opposed to showing that there is at least one decompilation of B that leads directly back to the source of A, establishing semantic equivalence.

So do we need to publish the potentially millions of alternatives, so that we can provide the one that is of interest???

I guess _I_ am not "getting it"...

BTW, you want to (like others) put words into my mouth. You would not see me say "this is the _only_ output from this expression". You would see me say "this is _the_ interesting output that matches the questionable case exactly, showing that that line is semantically equivalent to this expression. That's all anyone is trying to do. Find the needle in the haystack. To present the entire haystack is completely hopeless, that could be terabytes and nobody would be interested.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 9:49 pm

What is the goal again? to determine if machine code B is semantically equivalent with source code A? And to accomplish this we have to de-compile B which will lead to multilpe alternatives at times, and it is important to show those, which prove nothing, as opposed to showing that there is at least one decompilation of B that leads directly back to the source of A, establishing semantic equivalence.

But we have not established semantic equivalence between any significant parts of the code -- except that a lot of similar variables are present, which is a result of the UCI specification. At best there is semantic similarity between various parts.

And if semantic similarity is all we have, then is does matter how you translate the source code and what other possible sources are compatible with the machine code.

Rolf · Post by **Rolf** » Mon Sep 01, 2008 9:59 pm

bob wrote:
RegicideX wrote:
this expression [a-e][a-e]*[x-z] is a well-known syntax for a regular expression.

true or false, one can write a program to enumerate all strings that will match that expression, starting in lexical order, or in order shortest-to-longest, or in any other order one wants to impose.
Yes, but in fact we are not presented with several possible source codes which correspond to an assembly code -- we are presented with a single source code, with the pretense that it is the objective, "no creativity involved" source code.

It is like saying that "abz" is the only possible expression derived from " [a-e][a-e]*[x-z] "
What is the goal again? to determine if machine code B is semantically equivalent with source code A? And to accomplish this we have to de-compile B which will lead to multilpe alternatives at times, and it is important to show those, which prove nothing, as opposed to showing that there is at least one decompilation of B that leads directly back to the source of A, establishing semantic equivalence.

So do we need to publish the potentially millions of alternatives, so that we can provide the one that is of interest???

I guess _I_ am not "getting it"...

BTW, you want to (like others) put words into my mouth. You would not see me say "this is the _only_ output from this expression". You would see me say "this is _the_ interesting output that matches the questionable case exactly, showing that that line is semantically equivalent to this expression. That's all anyone is trying to do. Find the needle in the haystack. To present the entire haystack is completely hopeless, that could be terabytes and nobody would be interested.

I dont know if computer sciences have a neutral vocabulary. If so, you "all" should have used it but what you did was writing normal English with disrespectful speech against and about Vas and his Rybka. THAT is the evil in your campaign and it's the only aspect I made for weeks now but without success. If it were always going like "I saw some code in Rybka 1 and then I compared it with ... and the result IMO was as follows... what do people think about my findings?" I would never have dared to enter such a tech debate. But it wasnt led like this. And this is your failure which will last on you all, also in case something comes out in your advantage. If not, then IMO you are as done as Theron&Schmnidt anyway. IMO.

tiger · Post by **tiger** » Mon Sep 01, 2008 10:00 pm

RegicideX wrote:
Having the source code is not needed at all when a court tries to evaluate the similarities in two programs, because the courts work at a higher abstraction level.

They work at the semantical level or even higher in some cases.
But from the quote someone provided here, the courts also discard code that is in some way "forced" to be similar semantically -- for instance the implementation of a simple predefined interface protocol must be semantically similar at a higher level of abstraction. Thus the courts might very well throw out such evidence.

The courts might also want to find similarity for significant portions of code.

It's just speculation since we're not lawyers, but I don't think that bringing the court practice into this advances the anti-Rybka case.

I see a big advance if the points you are making help everybody to understand and accept the methodology.

I realize it had not been explained thoroughly before. It is true that some listings have been posted and maybe we assumed wrongly that people would understand that it is correct to present them this way.

So even if the UCI protocol parser was thrown out by a court, the benefit of presenting its analysis is that the methodology has a chance to be discussed before we can apply it to more significant evidence.

// Christophe

bob · Post by **bob** » Mon Sep 01, 2008 10:07 pm

RegicideX wrote:
What is the goal again? to determine if machine code B is semantically equivalent with source code A? And to accomplish this we have to de-compile B which will lead to multilpe alternatives at times, and it is important to show those, which prove nothing, as opposed to showing that there is at least one decompilation of B that leads directly back to the source of A, establishing semantic equivalence.
But we have not established semantic equivalence between any significant parts of the code -- except that a lot of similar variables are present, which is a result of the UCI specification. At best there is semantic similarity between various parts.

And if semantic similarity is all we have, then is does matter how you translate the source code and what other possible sources are compatible with the machine code.

Again, give them _time_. The process is not so much complicated as it is time-consuming. As far as the rest of your post, I agree with it in general. But this is not a completed project yet. There are lots of other ways to compare programs. A simple one is evaluation terms. What is the probability that two different authors end up with hundreds of identical evaluation numbers? not very high. We have seen lots of "give-aways" in the past. duplicate internal strings for messages (identical wording is improbable), duplicate filenames with the original, duplicate module names. Duplicate data structures. there are lots of axes to examine in this n-dimensional space. each will be covered in time. The more intersections between the two programs, the greater the probability that they have a common ancestry.

tiger · Post by **tiger** » Tue Sep 02, 2008 1:24 am

RegicideX wrote:
What is the goal again? to determine if machine code B is semantically equivalent with source code A? And to accomplish this we have to de-compile B which will lead to multilpe alternatives at times, and it is important to show those, which prove nothing, as opposed to showing that there is at least one decompilation of B that leads directly back to the source of A, establishing semantic equivalence.
But we have not established semantic equivalence between any significant parts of the code -- except that a lot of similar variables are present, which is a result of the UCI specification. At best there is semantic similarity between various parts.

And if semantic similarity is all we have, then is does matter how you translate the source code and what other possible sources are compatible with the machine code.

Semantic similarity is what the courts are looking for in order to find copyright infringement.

I hope you were not thinking that they required identical source codes?

I do not know how many times I have repeated this, so it looks like you have not read much of the discussion.

// Christophe

GenoM · Post by **GenoM** » Tue Sep 02, 2008 1:36 am

John wrote:
bob wrote:"... I know how to discover plagiarism. So do thousands of others. It is done daily. All done the same way...."
Yes it is. And that way is "Show me your source code." But the present case is completely different, isn't it? Because modern optimizing compilers scrub away the idiosyncratic human traits that teachers (and automated tools too) use to detect plagiarism.

RegicideX wrote:
Because 99% of the people here really do not understand the process of taking a source program and translating it to an optimized machine language program.
You can claim all the expertise in the world -- it will not make the claim that there is no creativity involved in constructing a C source code from machine code anything less that sheer baloney.

Compare with that:

RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not.<...>

Seems I have got it pretty right, havn't I?

Your way of arguing is pretending to be very scientific-like, but as we see, you are not clear about your own position and views.

michiguel · Post by **michiguel** » Tue Sep 02, 2008 5:05 am

bob wrote:
John wrote:
bob wrote: ... We are talking about a direct and well known process for translating a high-level language into machine language, and then back again ...
Bob, you could serve the CCC well, by providing a link to peer-reviewed descriptions of this process.

Especially vital are reliable estimates of Type I versus Type II errors, and equally important, inter-rater reliability.

Minimizing both kinds of error, and maximizing the reliability, is where the advice of statisticians and psychologists is indispensable.
What on earth are you talking about? There is no "error" in this process. Given the line of C programming, a = b * c + 1;, there is one way to translate that to assembly language. You might slightly alter the order of the instructions to improve speed, but if the compiler does not have a bug, then the result must be semantically equivalent. Given the assembly code the compiler produced, it is again a direct translation to go from the assembly language back to the semantically-equivalent C.

so where are you trying to take this? I have no idea who you are, or what your background is. You can find mine easily enough. I have written assembliers and compilers, I have taught the courses for many years, and the others involved are also quite good and have several "looking over their shoulders, including myself" to make sure that translation errors do not occur.

So, where are you trying to go with this? And why?

There is no error in the process but there might be in the interpretation.
When someone ask "What are the chances that code A has not been copied and derivatized from code B?", you open the door to statistics with the word "chances". The process has no error, but you end up with a similarity that may be able to be quantified. If the code is 100% identical in semantics is one thing, but what if it is not? Where do you draw the line? How do you defined "% of similarity"? We know that it is easy to define 100%, but anything else might not be trivial. You certainly cannot deny emphatically that statistics play any role. A quick search lead me to this paper:

Shared Information and Program Plagiarism Detection
Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, Amit Seker
University of California, Santa Barbara
http://citeseerx.ist.psu.edu/viewdoc/su ... .1.1.10.76

It may not be the best paper, but it is the first I found in which people are trying to put all this in quantifiable terms. This may be far from solved, but as I said, if things can be quantified, statistics have a role.

I quote two paragraphs. See that the problem resembles genome, or DNA sequence comparison. Something that I already pointed out and it was not paid attention:

"A common thread between information theory and computer science is the study of the amount of information
contained in an ensemble [17, 18] or a sequence [9]. A fundamental and very practical question has challenged
us for the past 50 years: Given two sequences, how do we measure their similarity in the sense that the measure
captures all of our intuitive concepts of “computable similarities”? Practical reincarnations of this question
abound. In genomics, are two genomes similar? On the internet, are two documents similar? Among a pile of
student Java programming assignments, are some of them plagiarized?
This paper is a part of our continued effort to develop a general and yet practical theory to answer the
challenge. We have proposed a general concept of sequence similarity in [3, 11] and further developed more
suitable theories in [8] and then in [10]. The theory has been successfully applied to whole genome phylogeny
[8], chain letter evolution [4], language phylogeny [2, 10], and more recently classification of music pieces in
MIDI format [6]. In this paper, we report our project of the past three years aimed at applying this general
theory to the domain of detecting programming plagiarisms."

Miguel

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

FIRST objective critical analysis

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?