Is It Me or is the Source Comparison Page Rigged?

kranium · Post by **kranium** » Mon Sep 01, 2008 3:44 am

RegicideX wrote:
Dr.Wael Deeb wrote:With all my respect,we don't need yet another Rybka flamewar thread....
We've had more than enough....
I agree -- and I would ask everyone to be civil about this.

But people are forming an opinion about the similarity of two pieces of code by studying source code that was modified to look more similar than it is. This is not right.

Names are in fact different -- but making them similar anyway is understandable. But changing the order of variable declaration, the order of variable initialization and the order of "if" statements is a bit too much. This on top of the fact that various blocks of code do not actually correspond to each other.

I think it's important to point this out given that the good name of a programmer is involved.

Hi Alex-

according to the legal sources we consulted:
- it is perfectly correct and legally sound to reconstruct the source code and make it appear as similar as possible as long as the semantics of are left untouched

http://www.linux.com/feature/113252/

i.e. you can and must move instructions around, pick variable names and use any formatting you see fit in order to make the reconstructed code look as similar as possible. However, you must not change the semantics.
The reconstructed code, when compiled and run, must work exactly as the original.

for ex: this means that
bool variable = 1;
is the same as
bool variable = true;

or if the same line of code appears on line 26 in one source, but on line 28 in another source...it doesn't mean the comparison in suddenly invalid.

if the lines of code are semantically identical, and the order of variable declarations is meaningless, then aligning two lines of code that are the same next to each other only aids the comparison.

bob · Post by **bob** » Mon Sep 01, 2008 4:24 am

John wrote:Bob, your chess and programming skills are undoubted, but have you ever taught a statistics course?

I commend this page on Bonferroni corrections to all.

Modern computerized analysis methods allow (literally) millions of hypotheses to be searched, and in consequence, it is infeasible to evaluate the significance of any criteria applied ex post facto.

That is why ex post facto analysis reliably confirms investigator prejudices, and in consequence, yield results that are without scientific merit.

This is not about "statistics" in any way, so I don't see the relevance. This is about people proficient in C and assembly language programming comparing two programs to determine if there are significant blocks of code that are semantically equivalent. That doesn't require any math, or anything else, other than the requisite programming expertise to make the comparison. So I have absolutely no idea where this discussion is supposed to be heading...

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 4:42 am

bob wrote: Fine. you are unable to follow the discussion.

Baseless assertion, followed by verbiage.

Aha. so you do _not_ understand "semantical equivalence". This is simply proving that for the same inputs, the two pieces of code produce the same output. "

Then the Rybka code and the Fruit code are not semantically equivalent -- there are plenty of differences in their outcome for the same input.

But of course, that's not what you mean. You mean that you should get to ignore differences --including semantic non-equivalence-- and focus only on what you find similar.

Order is immaterial so long as changes do not violate data depenencies, name dependencies or control dependencies.

Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.

Changing the order without saying so is at best sneaky, at worst dishonest.

John · Post by **John** » Mon Sep 01, 2008 4:44 am

bob wrote:"...I have absolutely no idea where this discussion is supposed to be heading..."

Bod, I commend to you, and all who are looking at code, the illuminating case study of "tuning" that begins at the bottom of page 15 of this (peer-reviewed) study of research bias.

In the end, the main scientific failings found were biased selection of criteria combined with what the authors called "naive statistical expectations".

These failings were minor, but their consequence was major: an egregious and highly public episode of pseudoscience.

The people most likely to commit this class of errors are those who know a great deal of mathematics and computer science, but less about statistics and cognition.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 4:49 am

according to the legal sources we consulted:
- it is perfectly correct and legally sound to reconstruct the source code

This is not a question of legality -- it is at best a question of the ethics of discourse.

If your purpose is to present the case so that everyone can form his/her opinion, then at the very least you should say that you changed the order of the lines a lot in order to make the sources look similar.

John · Post by **John** » Mon Sep 01, 2008 5:09 am

Norman Schmidt wrote: "... according to the legal sources we consulted ..."

May I respectfully suggest, that the quality of your investigation might be improved, and the CCC community better served, if you consulted instead a statistician and a cognitive psychologist, as is routine practice in medical outcomes research?

bob · Post by **bob** » Mon Sep 01, 2008 5:52 am

John wrote:
bob wrote:"...I have absolutely no idea where this discussion is supposed to be heading..."
Bod, I commend to you, and all who are looking at code, the illuminating case study of "tuning" that begins at the bottom of page 15 of this (peer-reviewed) study of research bias.

In the end, the main scientific failings found were biased selection of criteria combined with what the authors called "naive statistical expectations".

These failings were minor, but their consequence was major: an egregious and highly public episode of pseudoscience.

The people most likely to commit this class of errors are those who know a great deal of mathematics and computer science, but less about statistics and cognition.

Again, that is simply irrelevant here. This is a direct process, comparing A to B. It would be far easier if we had the source to both, we could then run it thru one of several semantic analysis software tools and get a quick "these are close". But we don't have that, so someone has to reverse-engineer and compare. There is no subjective component, it is just hard work. There is no scientific or statistical method for counting the number of shingles on the roof of a house. You just go count 'em. There is no sampling error, no sampling bias, no testing bias, it is just a pure translation process. I've never heard of someone trying to decipher some ancient writing and worrying about "bias". Just precise scientific work.

The properties you are mentioning just do not apply here.

bob · Post by **bob** » Mon Sep 01, 2008 5:58 am

RegicideX wrote:
bob wrote: Fine. you are unable to follow the discussion.
Baseless assertion, followed by verbiage.

Aha. so you do _not_ understand "semantical equivalence". This is simply proving that for the same inputs, the two pieces of code produce the same output. "
Then the Rybka code and the Fruit code are not semantically equivalent -- there are plenty of differences in their outcome for the same input.

But of course, that's not what you mean. You mean that you should get to ignore differences --including semantic non-equivalence-- and focus only on what you find similar.

That is _exactly_ the point. Searching for semantic equivalences inside two programs that are themselves _not_ semantically equivalent. Because semantic equivalence equates to related source code.

So this time around, you actually "got it right". Nobody has even begun to say the two programs are equivalent. The goal is to see if rybka 1 was derived from fruit source by copying, or was it written from scratch only borrowing ideas. "ideas" do not produce blocks of code with semantic equivalence. there are too many ways to write the same idea in terms of programming code for that to be viable. And the more equivalent blocks that are found, the _lower_ the probability of "accidental duplication" becomes. Even if the two programs are not equivalent in total (which no one has claimed, btw.)

Order is immaterial so long as changes do not violate data depenencies, name dependencies or control dependencies.
Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.

Again, before making statements, do a little research on translating source code to machine language. Instructions get re-ordered by the compiler, in fact, to improve speed and reduce pipeline stalls. The hardware reorders instructions internally using what is called "out of order execution". The only rule is that this can not change the semantic meaning of the code or it will break things. So this happens _everywhere_ already. If you don't understand that, some research will clear it up and end these tangents that waste time and bandwidth.

Changing the order without saying so is at best sneaky, at worst dishonest.

it is actually neither. Unless you consider your compiler and processor to be "sneaky or dishonest"...

bob · Post by **bob** » Mon Sep 01, 2008 6:00 am

John wrote:
Norman Schmidt wrote: "... according to the legal sources we consulted ..."
May I respectfully suggest, that the quality of your investigation might be improved, and the CCC community better served, if you consulted instead a statistician and a cognitive psychologist, as is routine practice in medical outcomes research?

Might I respectfully suggest that your suggestion has absolutely no place in this discussion. We are talking about a direct and well known process for translating a high-level language into machine language, and then back again. Can we possibly get back to the real topic. There is _zero_ statistics in this process. There is _zero_ medical outcomes research that is relevant. I can hardly believe some of the comments that we see in a supposedly technical forum. It is just nonsense.

John · Post by **John** » Mon Sep 01, 2008 6:08 am

bob wrote: ... "I've never heard of someone trying to decipher some ancient writing and worrying about 'bias' " ....

My wife has a degree in Egyptology, and so I can testify that your example is *exceedingly* ill-chosen! The Egyptology community squabbles even more passionately ... and upon even flimsier evidence ... than the chess programmers.

My point remains ... no amount of intelligence and experience can compensate for a bad protocol. Because as Mark Twain said "science needs only a spoonful of supposition to build a mountain of demonstrated fact!"

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?