Is It Me or is the Source Comparison Page Rigged?

John · Post by **John** » Mon Sep 01, 2008 6:18 am

bob wrote: ... We are talking about a direct and well known process for translating a high-level language into machine language, and then back again ...

Bob, you could serve the CCC well, by providing a link to peer-reviewed descriptions of this process.

Especially vital are reliable estimates of Type I versus Type II errors, and equally important, inter-rater reliability.

Minimizing both kinds of error, and maximizing the reliability, is where the advice of statisticians and psychologists is indispensable.

bob · Post by **bob** » Mon Sep 01, 2008 6:19 am

John wrote:
bob wrote: ... "I've never heard of someone trying to decipher some ancient writing and worrying about 'bias' " ....
My wife has a degree in Egyptology, and so I can testify that your example is *exceedingly* ill-chosen! The Egyptology community squabbles even more passionately ... and upon even flimsier evidence ... than the chess programmers.

My point remains ... no amount of intelligence and experience can compensate for a bad protocol. Because as Mark Twain said "science needs only a spoonful of supposition to build a mountain of demonstrated fact!"

Here there is no bad protocol. There are dozens of current textbooks on compiling. There are hundreds of papers on optimization techniques. there are hundreds of papers on microprocessor design, out of order execution, dataflow analysis, semantic analysis, the list goes on and on. There is _zero_ statistical process in this stuff. Compilation is a direct and straightforward process. Perhaps deciphering was a bad example. More like taking a specific problem definition and turning it into a set of equations to perform that work. It is a direct process. No double-blind placebo drug tests, no monte-carlo processes. Just direct translation. There is no "interpretation" when it comes to either C code, or assembly language. It is absolutely direct and constant in how each behaves with no variability whatsoever.

I have no idea where this is trying to be directed. But there is too much technical expertise here in these specific areas to allow the topic to be hijacked into something that would be closer to voodoo. There is no supposition. no guesswork. no alternative meanings of instructions. Do I need statistical analysis to prove that the integral of 2x is X^2 + C in mathematics? There is a direct proof that is not open to alternative interpretations. Same with turning C into assembler and back again.

bob · Post by **bob** » Mon Sep 01, 2008 6:24 am

John wrote:
bob wrote: ... We are talking about a direct and well known process for translating a high-level language into machine language, and then back again ...
Bob, you could serve the CCC well, by providing a link to peer-reviewed descriptions of this process.

Especially vital are reliable estimates of Type I versus Type II errors, and equally important, inter-rater reliability.

Minimizing both kinds of error, and maximizing the reliability, is where the advice of statisticians and psychologists is indispensable.

What on earth are you talking about? There is no "error" in this process. Given the line of C programming, a = b * c + 1;, there is one way to translate that to assembly language. You might slightly alter the order of the instructions to improve speed, but if the compiler does not have a bug, then the result must be semantically equivalent. Given the assembly code the compiler produced, it is again a direct translation to go from the assembly language back to the semantically-equivalent C.

so where are you trying to take this? I have no idea who you are, or what your background is. You can find mine easily enough. I have written assembliers and compilers, I have taught the courses for many years, and the others involved are also quite good and have several "looking over their shoulders, including myself" to make sure that translation errors do not occur.

So, where are you trying to go with this? And why?

tiger · Post by **tiger** » Mon Sep 01, 2008 6:33 am

RegicideX wrote:

according to the legal sources we consulted:
- it is perfectly correct and legally sound to reconstruct the source code

This is not a question of legality -- it is at best a question of the ethics of discourse.

If your purpose is to present the case so that everyone can form his/her opinion, then at the very least you should say that you changed the order of the lines a lot in order to make the sources look similar.

You are correct in this regard. I have hosted the side by side listing you are refering to on my web site temporarily, because all I wanted to do at first is present in a message on this board. The side-by-side comparison comes from a spreadsheet that has been exported to HTML with OpenOffice. But you cannot paste HTML in a message and I had no other solution than to upload it to my site and insert a link from the CCC message to my site.

The side by side comparison had originally been posted by Norman but looked horrible because the formatting had been destroyed by the CCC message parser. I tried to make it look better. It should be considered inside the context of Norman's original message.

Thank you for pointing out that we should always give our methodology along with the data, the documents are going to be improved in this regard.

// Christophe

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 7:10 am

bob wrote:
Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.
Again, before making statements, do a little research on translating source code to machine language. Instructions get re-ordered by the compiler, in fact, to improve speed and reduce pipeline stalls.

Maybe there would be something to your "correction" if it were not for the fact that I said the same thing about the compiler changing the code in a previous post.

It still doesn't change the fact that a lot of identically ordered variable initialization in machine code should be alarming. Reading what I actually wrote should help you.

Changing the order without saying so is at best sneaky, at worst dishonest.
it is actually neither. Unless you consider your compiler and processor to be "sneaky or dishonest"...

That's pretty silly. I don't expect my compiler to try to make arguments and present evidence in a honest and straightforward manner -- I do expect that from human interlocutors, and it's clear that humans changed the code to make it look more similar than it is, without mentioning anything about it.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 7:16 am

tiger wrote:
RegicideX wrote:
This is not a question of legality -- it is at best a question of the ethics of discourse.

If your purpose is to present the case so that everyone can form his/her opinion, then at the very least you should say that you changed the order of the lines a lot in order to make the sources look similar.

You are correct in this regard.
[...]

Thank you for pointing out that we should always give our methodology along with the data, the documents are going to be improved in this regard.

// Christophe

Thank you for that. There will probably still be disagreements about the similarity of the code, but at least we should agree on what we're comparing.

bob · Post by **bob** » Mon Sep 01, 2008 7:44 am

RegicideX wrote:
bob wrote:
Order makes the source codes look much more similar that they really are. If ten variables in a row are initialized in the same order in machine code then that's alarming. Having variables initialized all over the place, and many of them nonexistent is not at all alarming.
Again, before making statements, do a little research on translating source code to machine language. Instructions get re-ordered by the compiler, in fact, to improve speed and reduce pipeline stalls.

Maybe there would be something to your "correction" if it were not for the fact that I said the same thing about the compiler changing the code in a previous post.

It still doesn't change the fact that a lot of identically ordered variable initialization in machine code should be alarming. Reading what I actually wrote should help you.

Again, I am not sure where this is supposed to go. We _know_ about compiling from C to assembly, and we _know_ how to "uncompile" back to the C code. Let's say we take a C source program A and compile it to machine language and call this "B". Then we "decompile" the machine language B and end up with C. A and C are semantically equivalent, by definition. But they might not be indentical for many reasons. 1. variable names (non-global anyway) gets lost during compilation unless the compiler is told to keep them around for debugging use. 2. We have the compiler's "final product" to look at, and we can't possibly know how it rearranged source code to make it execute faster. So once we have C, which absolutely came from A, now it is just a matter of massaging C to make it look like A, or massaging A to make it look like C, and we eventually end up with a perfect match. We established semantic equality to start with, between A, B and C. But that is difficult to see for the casual person, so moving things around, while maintaining that semantic equivalency, so that we get identical source code finishes the project off and leaves a clearly identifiable connection.

Nothing sinister. Nothing dishonest. Those of us that do this understand the difficulty of the "decompiling" because the compiler does far more than just translate C to machine language. It can re-order. move/eliminate common sub-expressions to save time. unroll loops. And when we decompile, we end up with the "C program" the compiler created from our original, rather than our less efficient original. Then we have to continue the conversion, maintaining semantic equivalence to try to work our way back to the original code.

I don't see any "creativity" in that. we are not creating _anything_. it is quite technical in nature, and requires specific skills. But it is _not_ creative. If there was any justification, I suppose a large project group could automate the process. But it would only be useful for this one task, when most are only interested in the first half of the process, to get from source to fast machine language. But clearly, by theory, this is a two-way algorithm. If you can go from A to B, you can go from B back to A. This must be true. It is absolutely true. Might be one tough job, but so is building a fence across the USA's southern border. hard, but nobody would say "impossible" or "improbable it could be completed". Just a big job. And somehow this gets run into some sort of statistical sampling issues and clinical trials and the like, when it is a direct transformation from A to B with no witchcraft or voodoo involved at any point.

Changing the order without saying so is at best sneaky, at worst dishonest.
it is actually neither. Unless you consider your compiler and processor to be "sneaky or dishonest"...

That's pretty silly. I don't expect my compiler to try to make arguments and present evidence in a honest and straightforward manner -- I do expect that from human interlocutors, and it's clear that humans changed the code to make it look more similar than it is, without mentioning anything about it.

That's the problem. we were having a technical discussion between people that _understood_ how this worked. We tried having it here and there was a demand to see what was being done. Several produced the incomplete results available so far. And now it seems we were dishonest for showing data that anybody familiar with the process would instantly understand. It was not intended to be dishonest, and that is why more data is not presently forthcoming, so that more can be completed, cross-checks, and displayed in a way that won't generate hundreds of questions and claims of dishonesty.

if you ask a good compiler guy about this, he would not think twice about what is being discussed here, it would be expected. At some point, the idea solution would be to take source A, and executable B that some believe contains parts of A in it, and decompile to C. And show that step first. And then start the "massaging" to undo the various tricks the compiler used to speed up the code, and show that. And then try to massage the resulting C', order, names, and such, to maintain semantic equality with B, but while attempting to make it match A as closely as possible. The closer they can be made to match, the more code A and B then have in common. If they could be made to match perfectly (which we do not expect since we know A and B play differently) then we would have established absolute proof that B came from A. As it is, we might find that significant parts can be made to match up, so we know that significant parts came from A. Or we might not be able to show much at all came from A.

Work continues. new results are coming daily. And they are being carefully checked. And one day everyone will see everything that has been found and can make an intelligent judgement on the results. Without all the name calling, claims of dishonesty and dark motives and such.

Just let the "process" proceed at the only pace it can proceed at, which is limited by individual's ability to spend X hours a day on this. And sooner, rather than later, there will be something to look at that is more polished and easier to follow than what has been shown to date. I'd rather see the ongoing discussions carried out here, but these threads simply make that impossible.

In the past, we were able to do this. For crafty clone claims, someone would post some evidence, I would analyze and post more evidence, and we would carry out the investigation in the open where everyone could follow it in real-time. But that didn't work in this case, for obvious reasons...

So, we wait.

bob · Post by **bob** » Mon Sep 01, 2008 7:51 am

RegicideX wrote:
tiger wrote:
RegicideX wrote:
This is not a question of legality -- it is at best a question of the ethics of discourse.

If your purpose is to present the case so that everyone can form his/her opinion, then at the very least you should say that you changed the order of the lines a lot in order to make the sources look similar.

You are correct in this regard.
[...]

Thank you for pointing out that we should always give our methodology along with the data, the documents are going to be improved in this regard.

// Christophe
Thank you for that. There will probably still be disagreements about the similarity of the code, but at least we should agree on what we're comparing.

And you have to remember "assumed context". You overhear two docs talking about using dopamine to stabilize a stroke victim. You listen in but don't say anything since it is way over your head. (mine too). And then later, you complain to them "but you didn't say how dangerous this drug was and that its use is reserved for critical type cases." And they say "but we _knew_ that and the conversation was between us, and we didn't need to tell each other the obvious."

That is where some of this comes from. When an inexperienced person asks me a question here, I always explain the "why" as well as the "how" since I don't know their "context" and want to be clear. But in this discussion, we were not "inexperienced" and the ideas were well-known and everyone working on this was doing it the _same_ way. That others were not following along never occurred to us in the discussions we were having.

And even more importantly, some did not _want_ to follow along, and continued to try to blow the conversations out of the water with misquotes and other such attempts to side-track and divert attention from the real issue being discussed.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 7:57 am

bob wrote: But that is difficult to see for the casual person, so moving things around, while maintaining that semantic equivalency, so that we get identical source code finishes the project off and leaves a clearly identifiable connection.

The poor casual person should be told that things were switched around a lot to make it look more similar than it is. That's the fishy part. (Not to mention that at least a variable actually got deleted in the process.)

And since the codes being compared are not semantically equivalent and contain lots of semantically in-equivalent parts, it is the shuffling around that does a lot if not most of the work of making the codes look similar.

RegicideX · Post by **RegicideX** » Mon Sep 01, 2008 8:24 am

And when we decompile, we end up with the "C program" the compiler created from our original, rather than our less efficient original. Then we have to continue the conversion, maintaining semantic equivalence to try to work our way back to the original code.

I don't see any "creativity" in that. we are not creating _anything_.

So let me get this straight: You agree (as you should) that you can only reconstruct a probably truncated piece of code from the machine code. But you don't think that there is any creativity involved in recreating the original code -- the original code that was truncated and rewritten by the compiler.

The mind boggles.

Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?

Re: Is It Me or is the Source Comparison Page Rigged?