Is It Me or is the Source Comparison Page Rigged?

bob · Post by **bob** » Mon Sep 01, 2008 4:55 pm

RegicideX wrote:
And when we decompile, we end up with the "C program" the compiler created from our original, rather than our less efficient original. Then we have to continue the conversion, maintaining semantic equivalence to try to work our way back to the original code.

I don't see any "creativity" in that. we are not creating _anything_.
So let me get this straight: You agree (as you should) that you can only reconstruct a probably truncated piece of code from the machine code. But you don't think that there is any creativity involved in recreating the original code -- the original code that was truncated and rewritten by the compiler.

The mind boggles.

That's your problem, not mine. I don't follow the "truncated piece of code". I can reconstruct _exactly_ what the compiler turned into machine language, with 100% accuracy. Then I can reconstruct many semantically equivalent programs, one of which the compiler actually used to create the source it produced object code for (unrolled loops have to be re-rolled, common subexpressions that were removed have to be re-duplicated. etc. And I have a target, the program I suspect was the original source that was copied.

how you think that is creative actually _does_ "boggle the mind". But then, I do understand the process.

bob · Post by **bob** » Mon Sep 01, 2008 5:00 pm

John wrote:
bob wrote: ... We might find that significant parts can be made to match up, so we know that significant parts came from A. Or we might not be able to show much at all came from A ... .
Is there any quantitative definition or consensus regarding the meaning of the word "significance" in the above passage? Or does it simply mean "Our common sense and experience will tell us what is significant"?

There is strong evidence that the latter view is mistaken ... no matter that the people who hold that belief are intelligent, experienced, and working in good faith. Because when the tools of evaluation and their associated criteria for significance are not specified in advance, the door is open wide to fallacies and pseudoscience.

I believe that is the only "subjective" point that might be argued. Given (say) a 10,000 line program, what is the chance that (say) 1,000 lines (10%) are absolutely identical semantically? Or, another way, what is the chance that given a 10,000 sentence book, that 1,000 sentences are semantically equivalent. I maintain that is essentially an impossible event.

But there is no need in arguing this point until the process is completed, would you not agree? There is no need to argue about 10% or 20% if there is only 1%. Or 50%. So yes, that will be a point for debate, and it will likely never collapse to a single agreed-on number based on past discussions here. But we need to get a ball-park number first. And also see exactly where the matches are, in the core engine code or in the user interface stuff that is less important.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 5:12 pm

bob wrote: Then I can reconstruct many semantically equivalent programs, one of which the compiler actually used to create the source it produced object code for

Therefore, you must use creativity to simply guess what was the original source code among the many possible variants. Q.E.D.

And I have a target, the program I suspect was the original source that was copied.

Come on! You don't see how that makes your procedure entirely unscientific?

Confirmation bias --trying to prove your hunch instead of trying to disprove it-- is a basic and fundamental thing you should avoid in an investigation. Everyone is guilty of it -- but we should try to avoid it.

Not only do you not try to avoid it -- but you think that it is good methodology to chose exactly the code reconstruction that justifies your hunch!

chrisw · Post by **chrisw** » Mon Sep 01, 2008 5:13 pm

GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?

Well, they're saying several things.

The Go Parser has, by force, to carry out several operations to parse the input text string. The input text string contains words from a restricted list of word, such as "infinite" or "ponder". The Go Parser parses the text and sets variables for the engine to tell it what it to, again forced.

Go Parser must carry out operations A,B,C,D,E,F,G,H and it must leave the results of its parse in variables, call them a,b,c,d,e,f,g,h

Operation A might be comparestrings("infinite", inputstring); if true, set variable a.

Operation A's across a range of programs doing Go Parser will be semantically identical usually. Likewise B,C,D,E,F,G,H. No reason why not. Too trivial for anything else.

Fruit does them in order ABCDEFGH, declaring variables abcdefgh
Rybka does them in order BDCEFGHA, declaring variable dcbfghab
Even Zach's ZCT does them, slightly differently, but basically if you look at his code, there they are. "infinite", "ponder" etc. Has to be so.
Each program has its own idiosyncracities in addition, some more so, some less so.

We have Go Parsers, all doing the same thing, by force, and composed of sub-components also doing the same things by force. The sub-components being so trivial and so forced that they are reasonably likely to be semantically identical. Not always, not necessarily, but usually.

The main likely differences are in the ordering of it all. It matters not very much what order the parsing is donea nd the variable filled.

It's mostly in the ordering that differences between program writers will emerge. With some differences in style. Some differences in variable type. Some differences in initial values. All of these differences we see in the Rybka and Fruit code.

What the antis did was to take the ordering, both of variables and program flow and make them identical. They converted Rybka's BDCEFGHA to be identical to Fruit's ABCDEFGH, didnt reveal that they'ld done this, implying that the Rybka source was just a natural disassembly and misled the forum and the computer chess community in order to cause damage to Vas and Rybka.

The Go Parser evidence is completely busted. If this was a court case and this information of misleading by hidden manipulation of evidential data revealed part way through the case, the prosection barristers would resign and the judge would throw the case out. With costs.

bob · Post by **bob** » Mon Sep 01, 2008 5:14 pm

RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.

OK. but we are talking about chess programs. My program (crafty) currently has 44,864 lines of code. That is hardly "small". the final part about "similar output" is meaningless. We have so many ways to search trees, so many ways to extend and reduce, so many ways to order the moves to make the search more efficient, a nearly limitless number of ways to turn a position into a numeric value that can be passed thru the search to find the best move, that this point is simply meaningless. Finding semantically equivalent code can happen on two levels. Nobody cares about tiny pieces of code such as PopCnt(), LSB(), MSB() and the like. But when you get into procedures that are hundreds of lines long, big chunks of semantically identical code leads to but one conclusion. Again, for a given algorithm, there are an infinite number of ways syntactically to express it. So duplication is _highly_ improbable. And that is the basis that caused a couple of people to start to look at this more closely. It is simply so improbable unless code was physically copied.

The actual source comparison shows some similarities but also lots of dissimilarities -- which is what is to be expected.

Partially true. There will necessarily be parts of the code that are different, since program B (supposedly derived from A) is significantly stronger. But chunks of duplicated code is _not_ normal in any program of more than a few lines. So that concept I do not agree with.

But if you see 10 or 15 variables which are initialized in the same order in machine code, then despite the changes made by the compiler, something is fishy -- not necessarily conclusive, but fishy.

Agreed. And if they are all initialized, even if _not_ in the same order, it is _still_ fishy In fact, the compiler could change that order for other reasons if it wanted to For example, it is more efficient to initialize things in the order they appear in memory to take advantage of cache prefetching. And if you are trying to de-compile, you see the final order, not the programmed order, and rearranging makes perfect sense when trying to match things up.

It turns out though, that the order of variable initialization was faked to make the code look more similar than the disassembler made it look -- the same goes for the order in which various "if" clauses are checked. So there is really nothing fishy in the code regarding the order.

EXTREMELY poor choice of wording with "faked". Nothing being done in this investigation is "faked". If you want to use words that are inflammatory or accusatory, feel free. But don't expect much in the way of discussion if that is the way you want to proceed. Get some experience or knowledge about the field and you will see how wrong "faked" is.

There is plenty of semantic non-equivalent code, and what similarity there is, is mostly about the general structure of the program -- making allowance for the fact that even in the structure there are dissimilarities.

again, that is baloney. yes there is non-equivalent code. But in a large program, one expects to find very little semantically equivalent code. That is the issue. Do you actually believe that in writing a 44,000 line program, that I will by pure chance duplicate what others have done _exactly_ here and there? when I don't even see this happen in 100-1000 line student programs???

bob · Post by **bob** » Mon Sep 01, 2008 5:15 pm

GenoM wrote:
RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not.<...>
So you agree that valid conclusions can be drawn from comparison between an actual source code and the disassembled one.

Thanks.

Anyone that has ever taken a course in compiler writing would agree with this.

bob · Post by **bob** » Mon Sep 01, 2008 5:19 pm

chrisw wrote:
RegicideX wrote:
bob wrote: Fine. you are unable to follow the discussion.
Baseless assertion, followed by verbiage.

Aha. so you do _not_ understand "semantical equivalence". This is simply proving that for the same inputs, the two pieces of code produce the same output. "
Then the Rybka code and the Fruit code are not semantically equivalent -- there are plenty of differences in their outcome for the same input.

But of course, that's not what you mean. You mean that you should get to ignore differences --including semantic non-equivalence-- and focus only on what you find similar.

Order is immaterial so long as changes do not violate data depenencies, name dependencies or control dependencies.
Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.

Changing the order without saying so is at best sneaky, at worst dishonest.

It is worth going back to the first presentation of the misleading source comparison data, the "Here's something to start with" thread ....

Norman Schmidt explains the "methodology", with absolutely no mention at all of any reordering of either variable or program flow. No mention of "semantics".
Norman wrote:
Please note that: Fruits ASSERTs have been removed comments have been removed there are differences between the two...mainly: where Fruit has TRUE, Rybka has 1 where Fruit has FALSE, Rybka has 0
where Fruit has double data types, Rybka uses integer
search->info struct has been replaced with a direct reference to the independant variable.
truthfully many of the differences reflect the implementation of backend C bitboard routines in place of object-oriented C++ class references
Fruit often does error checking that is absent in Rybka,
if (ptr == NULL) my_fatal("parse_go(): missing argument\n");
Another difference:
Fruit calls string_equal to accept input, whereas Rybka uses the C function:
int strcmp( const char *string1, const char *string2 )
but these functions are essentially equivilent...
string_equal is simply a Fabien re-write with ASSERTs included:
string_equal( const char s1 [], const char s2 [] )
{
ASSERT(s1 != NULL);
ASSERT(s2 != NULL);
return strcmp(s1, s2) == 0;
}
which is something he did often...see Fruit 2.1 util.cpp
one click...a global search and replace (for ex: replace true with 1, would recifiy some of the larger differences).
Sven Schule challenged on the variable ordering:
Sven wrote:
since you don't have the Rybka source, how can you know about the location and order of variable declarations in Rybka? Declarations of local variables often do not produce assembler code, you just see the variables when they are used.
Hyatt replied without any reference to the creative reordering that had been done:
Bob wrote:
If you look at the order of the variables as they either appear in memory or on the stack, you can almost infer the order they were declared.
And Alexander Schmidt was convinced:
Alexander wrote:
TY Norman, this is completely convincing. I have not much programming experience (mainly vba) but this are too many similaries for a coincidence.
Unfortunately people will not read it and not believe it...
And Christophe Theron remained absolutely silent on the creative reordering:
Christophe wrote:
Nothing at all.

Can we possibly use the _same_ vocabulary here, which includes the definition of words.

We have a set of sticks in pile A that are colored, where each stick is a different color. We have a second set of sticks and have been asked to answer the question "are the sticks in pile B identical to the sticks in pile A".

that is not a process that requires any creativity whatsoever. I just start re-ordering the sticks in pile B so that they match, one by one with pile A. When I finish, if everything matches, they are identical. If not, they are different. I did nothing creative here.

In the case of comparing two different programs for semantic equivalence, re-ordering is not dishonest, it is a _BASE_ part of the process. You know that, you just want to continue to inflame and distort.

GenoM · Post by **GenoM** » Mon Sep 01, 2008 5:20 pm

RegicideX wrote:
bob wrote: Then I can reconstruct many semantically equivalent programs, one of which the compiler actually used to create the source it produced object code for
Therefore, you must use creativity to simply guess what was the original source code among the many possible variants. Q.E.D.

And I have a target, the program I suspect was the original source that was copied.
Come on! You don't see how that makes your procedure entirely unscientific?

Confirmation bias --trying to prove your hunch instead of trying to disprove it-- is a basic and fundamental thing you should avoid in an investigation. Everyone is guilty of it -- but we should try to avoid it.

Not only do you not try to avoid it -- but you think that it is good methodology to chose exactly the code reconstruction that justifies your hunch!

As far as I understand it, every scientist, when setting up an experiment has his expectations. Can anyone do it without expectations? No, I think, one can not avoid expectations.
But the difference between good scientist and the bad one is that good scientist would check his expectations with given results and correct his expectations to match the results and the bad one would correct results to match his expectations.
Some sort of bias is unavoidable, but final conclusion is what that matters. So long noone here didn't make such so your speculations are misplaced.

bob · Post by **bob** » Mon Sep 01, 2008 5:20 pm

Mike S. wrote:
tiger wrote: The side by side comparison had originally been posted by Norman but looked horrible because the formatting had been destroyed by the CCC message parser. I tried to make it look better.
And as a result of this beautification, just by correcting a formatting problem, the Fruit 2.1 source code examples suddenly were "rybkanized", as if by magic?!

So some would have you believe yes. The entire process being used is flawed and dishonest, apparently.

chrisw · Post by **chrisw** » Mon Sep 01, 2008 5:24 pm

bob wrote:
RegicideX wrote:
GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.
OK. but we are talking about chess programs.

No we're not. We're talking about the anti chief piece of evidence, the corrupted, manipulated, misleading, transmogrified and cheated alleged source code listing of "Go Parser". A piece of code quite forced, maybe 50 lines long, that takes a text string of known possible words and sets appropriate variables for the engine. All forced and all trivial in the sub-components.

Your side deliberately, secretly, misleadingly, manipulated the source listing to line it up with a target in order to fool the forum and discredit a fellow programmer and damage his business.

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?