Is It Me or is the Source Comparison Page Rigged?

bob · Post by **bob** » Mon Sep 01, 2008 8:07 pm

I believe that 99% of this discussion is the result of a huge communication gap. Because 99% of the people here really do not understand the process of taking a source program and translating it to an optimized machine language program. Or, to make it even more complex, to an machine language program optimized for a specific Intel processor type (PIV vs PII, vs core 1, vs core 2, vs SSE, vs SSE2, etc.

So it is a complilcated process, but not something that can't be understood and reversed. My first lecture to students in my first class tries to explain "why you are here". And it is about taking the black box (the computer, the o/s, the compilers, the applications, the graphical device, network cards, etc.) and opening it up to remove the magic and see what is _really_ going on. we still have a number here that see a black box, where everything is magic, and a few that understand everything inside and understand it all.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 8:09 pm

Alex, if somebody takes the source code of a program and starts switching variable names and starts moving lines of code up and down (while making sure those moves do not change anything significant), they will come up with a new file.c.

You are adding the completely unwarranted assumption that Vas was actually doing code reshuffling from the Fruit source. You have zero evidence of that. Indeed, given the completely trivial nature of the UCI parser programm it is probably less of a hassle to simply write the programm from scratch.

And it's not at all true that only variable reordering is different -- there are many differences.

Claiming that you do have identical blocks of code when you don't have them is misleading -- no matter how much bullshit gathers in this thread.

Uri Blass · Post by **Uri Blass** » Mon Sep 01, 2008 8:12 pm

bob wrote:
Uri Blass wrote:
bob wrote:
RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.
OK. but we are talking about chess programs. My program (crafty) currently has 44,864 lines of code. That is hardly "small". the final part about "similar output" is meaningless. We have so many ways to search trees, so many ways to extend and reduce, so many ways to order the moves to make the search more efficient, a nearly limitless number of ways to turn a position into a numeric value that can be passed thru the search to find the best move, that this point is simply meaningless. Finding semantically equivalent code can happen on two levels. Nobody cares about tiny pieces of code such as PopCnt(), LSB(), MSB() and the like. But when you get into procedures that are hundreds of lines long, big chunks of semantically identical code leads to but one conclusion. Again, for a given algorithm, there are an infinite number of ways syntactically to express it. So duplication is _highly_ improbable. And that is the basis that caused a couple of people to start to look at this more closely. It is simply so improbable unless code was physically copied.

The actual source comparison shows some similarities but also lots of dissimilarities -- which is what is to be expected.
Partially true. There will necessarily be parts of the code that are different, since program B (supposedly derived from A) is significantly stronger. But chunks of duplicated code is _not_ normal in any program of more than a few lines. So that concept I do not agree with.

But if you see 10 or 15 variables which are initialized in the same order in machine code, then despite the changes made by the compiler, something is fishy -- not necessarily conclusive, but fishy.
Agreed. And if they are all initialized, even if _not_ in the same order, it is _still_ fishy In fact, the compiler could change that order for other reasons if it wanted to For example, it is more efficient to initialize things in the order they appear in memory to take advantage of cache prefetching. And if you are trying to de-compile, you see the final order, not the programmed order, and rearranging makes perfect sense when trying to match things up.

It turns out though, that the order of variable initialization was faked to make the code look more similar than the disassembler made it look -- the same goes for the order in which various "if" clauses are checked. So there is really nothing fishy in the code regarding the order.
EXTREMELY poor choice of wording with "faked". Nothing being done in this investigation is "faked". If you want to use words that are inflammatory or accusatory, feel free. But don't expect much in the way of discussion if that is the way you want to proceed. Get some experience or knowledge about the field and you will see how wrong "faked" is.

There is plenty of semantic non-equivalent code, and what similarity there is, is mostly about the general structure of the program -- making allowance for the fact that even in the structure there are dissimilarities.
again, that is baloney. yes there is non-equivalent code. But in a large program, one expects to find very little semantically equivalent code. That is the issue. Do you actually believe that in writing a 44,000 line program, that I will by pure chance duplicate what others have done _exactly_ here and there? when I don't even see this happen in 100-1000 line student programs???
The situation with your students is different because one student does not study the work of another student.

Uri
That has to be the single most naive statement I have ever read. I asked another faculty member his opinion, and his reply was "what planet is _this_ guy from?" that pretty much says it all. Fraternities have copies of every test I have ever given, every assignment I have given and member's papers that go with them. Foreign students have the same sort of "underground". That is why I keep old assignments.

It is _really_ time to get serious here and stop that kind of nonsensical remark. Again, talk to _anybody_ that works in a University CS department. This is a problem _everywhere_.

I guess that I assumed wrongly that the students do not have access to programs that do the same task.

In this case I wonder if you do not blame some students for copying when they did not copy and maybe programmers who see source try to avoid some structure that they remember from another code because they are afraid that if they do not do it they are going to be blamed for copying.

Maybe the only good test is giving students with similiar code an exam
to see if they understand the code that they give or to ask them to start to build the task from scratch and compare what they do in some hours of work without copying.

Uri

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 8:16 pm

Because 99% of the people here really do not understand the process of taking a source program and translating it to an optimized machine language program.

You can claim all the expertise in the world -- it will not make the claim that there is no creativity involved in constructing a C source code from machine code anything less that sheer baloney.

Rolf · Post by **Rolf** » Mon Sep 01, 2008 8:33 pm

bob wrote:
RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.
OK. but we are talking about chess programs. My program (crafty) currently has 44,864 lines of code. That is hardly "small". the final part about "similar output" is meaningless. We have so many ways to search trees, so many ways to extend and reduce, so many ways to order the moves to make the search more efficient, a nearly limitless number of ways to turn a position into a numeric value that can be passed thru the search to find the best move, that this point is simply meaningless. Finding semantically equivalent code can happen on two levels. Nobody cares about tiny pieces of code such as PopCnt(), LSB(), MSB() and the like. But when you get into procedures that are hundreds of lines long, big chunks of semantically identical code leads to but one conclusion. Again, for a given algorithm, there are an infinite number of ways syntactically to express it. So duplication is _highly_ improbable. And that is the basis that caused a couple of people to start to look at this more closely. It is simply so improbable unless code was physically copied.

The actual source comparison shows some similarities but also lots of dissimilarities -- which is what is to be expected.
Partially true. There will necessarily be parts of the code that are different, since program B (supposedly derived from A) is significantly stronger. But chunks of duplicated code is _not_ normal in any program of more than a few lines. So that concept I do not agree with.

But if you see 10 or 15 variables which are initialized in the same order in machine code, then despite the changes made by the compiler, something is fishy -- not necessarily conclusive, but fishy.
Agreed. And if they are all initialized, even if _not_ in the same order, it is _still_ fishy In fact, the compiler could change that order for other reasons if it wanted to For example, it is more efficient to initialize things in the order they appear in memory to take advantage of cache prefetching. And if you are trying to de-compile, you see the final order, not the programmed order, and rearranging makes perfect sense when trying to match things up.

It turns out though, that the order of variable initialization was faked to make the code look more similar than the disassembler made it look -- the same goes for the order in which various "if" clauses are checked. So there is really nothing fishy in the code regarding the order.
EXTREMELY poor choice of wording with "faked". Nothing being done in this investigation is "faked". If you want to use words that are inflammatory or accusatory, feel free. But don't expect much in the way of discussion if that is the way you want to proceed. Get some experience or knowledge about the field and you will see how wrong "faked" is.

There is plenty of semantic non-equivalent code, and what similarity there is, is mostly about the general structure of the program -- making allowance for the fact that even in the structure there are dissimilarities.
again, that is baloney. yes there is non-equivalent code. But in a large program, one expects to find very little semantically equivalent code. That is the issue. Do you actually believe that in writing a 44,000 line program, that I will by pure chance duplicate what others have done _exactly_ here and there? when I don't even see this happen in 100-1000 line student programs???

Can I help you out? What is faked is the whole campaign in public because it's made as if you already had the verdict in advance and as if you were police and judge in the same person. That is a fake, at least for experts of such smear acts, as Ed called it. This all in public is intentional destruction of character, no matter if your basic assumptions is correct. It's the manner you proceed it here. Is it a surprise that you must be as experienced like Ed to know this?

kranium · Post by **kranium** » Mon Sep 01, 2008 8:34 pm

RegicideX wrote:

Alex, if somebody takes the source code of a program and starts switching variable names and starts moving lines of code up and down (while making sure those moves do not change anything significant), they will come up with a new file.c.
You are adding the completely unwarranted assumption that Vas was actually doing code reshuffling from the Fruit source. You have zero evidence of that. Indeed, given the completely trivial nature of the UCI parser programm it is probably less of a hassle to simply write the programm from scratch.

And it's not at all true that only variable reordering is different -- there are many differences.

Claiming that you do have identical blocks of code when you don't have them is misleading -- no matter how much bullshit gathers in this thread.

Alex,

there are identical blocks of code, and 'equivalent' blocks of code.
we discussed this thoroughly (and I mean thoroughly!) during the past weeks, during the flame wars, which pretty much blanketed the whole discussion.

must we do it all again?

despite the differences of opinion, it seemed we did come to some 'understanding' on the difference. you missed that part.

...i would urge you try to find it and and it, but good luck it's buried in the mountain of posts that drowned out any serious debate.

The code comparison was never labeled everything is 'identical'..., although there also certainly a significant number of lines that 'are identical'.

PS we also had that discussion, chrisw said 33, i said 41, and Bob says 50...
you came into the debate late...but i'm going to provide a link, please do the work yourself and catch up with everything, because otherwise we're just going over and over about the same things.

tiger · Post by **tiger** » Mon Sep 01, 2008 8:47 pm

John wrote:
bob wrote:"... I know how to discover plagiarism. So do thousands of others. It is done daily. All done the same way...."
Yes it is. And that way is "Show me your source code."

That must be the 10th time or so I repeat it, but let's go.

Having the source code is not needed at all when a court tries to evaluate the similarities in two programs, because the courts work at a higher abstraction level.

They work at the semantical level or even higher in some cases.

But the present case is completely different, isn't it? Because modern optimizing compilers scrub away the idiosyncratic human traits that teachers (and automated tools too) use to detect plagiarism.

To the best of my knowledge, there are no peer-reviewed accounts, and no automated tools, and few or no people with professional experience, relating to the machine-code detection methods being discussed here on the CCC.

Automatic tools are not necessary here because we generally work on functions of reasonable size.

Peer review is done before the evidence is presented and we hoped that it would have happened after the evidence had been published, unfortunately it has not been the case because all we have seen are arguments denying the existence of any copying in any program (I'm exaggerating a little bit, but it was the idea).

// Christophe

That is not to say that such detection is not possible. But it would be exceedingly difficult---many orders of magnitude more difficult than showing classroom plagiarism via source code comparisons. Even a long list of identical numerical coefficients would *not* constitute evidence of plagiarism, because it is neither illegal nor immoral to copy algorithms.

To assert otherwise, would imply that the On-Line Encyclopedia of Integer Sequences is evidence of a massive global conspiracy of plagiarism among mathematicians ... hmmm ... wait a minute ... who *are* these mathematicians? ... what is their motive in setting up their web site? ... maybe they *are* plagiarizing one another?

It is remarkable how avidly the human mind embraces conspiracy-based explanations. Careful protocol design helps guard against this.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 8:54 pm

but i'm going to provide a link, please do the work yourself and catch up with everything, because otherwise we're just going over and over about the same things."

I've read most of the threads about Rybka in the last week or so -- I was just a lurker, because I was not registered. While it's possible that I missed something --there is a lot of noise to wade through-- I doubt that I'll find anything that will surprise me.

Rolf · Post by **Rolf** » Mon Sep 01, 2008 8:55 pm

tiger wrote:
John wrote:
bob wrote:"... I know how to discover plagiarism. So do thousands of others. It is done daily. All done the same way...."
Yes it is. And that way is "Show me your source code."

That must be the 10th time or so I repeat it, but let's go.

Having the source code is not needed at all when a court tries to evaluate the similarities in two programs, because the courts work at a higher abstraction level.

They work at the semantical level or even higher in some cases.

But the present case is completely different, isn't it? Because modern optimizing compilers scrub away the idiosyncratic human traits that teachers (and automated tools too) use to detect plagiarism.

To the best of my knowledge, there are no peer-reviewed accounts, and no automated tools, and few or no people with professional experience, relating to the machine-code detection methods being discussed here on the CCC.

Automatic tools are not necessary here because we generally work on functions of reasonable size.

Peer review is done before the evidence is presented and we hoped that it would have happened after the evidence had been published, unfortunately it has not been the case because all we have seen are arguments denying the existence of any copying in any program (I'm exaggerating a little bit, but it was the idea).

// Christophe

That is not to say that such detection is not possible. But it would be exceedingly difficult---many orders of magnitude more difficult than showing classroom plagiarism via source code comparisons. Even a long list of identical numerical coefficients would *not* constitute evidence of plagiarism, because it is neither illegal nor immoral to copy algorithms.

To assert otherwise, would imply that the On-Line Encyclopedia of Integer Sequences is evidence of a massive global conspiracy of plagiarism among mathematicians ... hmmm ... wait a minute ... who *are* these mathematicians? ... what is their motive in setting up their web site? ... maybe they *are* plagiarizing one another?

It is remarkable how avidly the human mind embraces conspiracy-based explanations. Careful protocol design helps guard against this.

How can you write such a weak message? He claims you work on the assumption of a campaign with the end that the accused must defend himself and how could he do it? how could someone defend his own innocence? Well in CC by showing his code. So that is what you hoped for, also Bob, because you all want to know the newest tricks in CC. Very decent method! Bravo!

tiger · Post by **tiger** » Mon Sep 01, 2008 8:57 pm

chrisw wrote:
bob wrote:
chrisw wrote:
bob wrote:
RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.
OK. but we are talking about chess programs.
No we're not. We're talking about the anti chief piece of evidence, the corrupted, manipulated, misleading, transmogrified and cheated alleged source code listing of "Go Parser". A piece of code quite forced, maybe 50 lines long, that takes a text string of known possible words and sets appropriate variables for the engine. All forced and all trivial in the sub-components.

Your side deliberately, secretly, misleadingly, manipulated the source listing to line it up with a target in order to fool the forum and discredit a fellow programmer and damage his business.
OK, for starters, all the programmers out there with a procedure named "go_parser" please raise your hand. Mine is not up. If you had any idea what was going on, you would know that this has not been a "misleading" procedure to this point. It has been done the only way it _can_ be done, the same way it has been done thousands of times in the past.

But you do know that.
For the Rybka alleged source code, you can't even state it is called Go Parser. It might be called Fred. Zach does something along those lines, he posted some source with "infinite", "ponder" and so on.

I don't think you did the misleading, Bob. You were misled.

Fooled once. Shame on them. You have an algorithm for case twice, I believe.

So once again you show that you want to confuse the debate, arguing that a different variable or function name means original code.

Copy a novel. Change the names of the characters. You have written an original novel.

// Christophe

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?