michiguel wrote:bob wrote:What on earth are you talking about? There is no "error" in this process. Given the line of C programming, a = b * c + 1;, there is one way to translate that to assembly language. You might slightly alter the order of the instructions to improve speed, but if the compiler does not have a bug, then the result must be semantically equivalent. Given the assembly code the compiler produced, it is again a direct translation to go from the assembly language back to the semantically-equivalent C.John wrote:Bob, you could serve the CCC well, by providing a link to peer-reviewed descriptions of this process.bob wrote: ... We are talking about a direct and well known process for translating a high-level language into machine language, and then back again ...
Especially vital are reliable estimates of Type I versus Type II errors, and equally important, inter-rater reliability.
Minimizing both kinds of error, and maximizing the reliability, is where the advice of statisticians and psychologists is indispensable.
so where are you trying to take this? I have no idea who you are, or what your background is. You can find mine easily enough. I have written assembliers and compilers, I have taught the courses for many years, and the others involved are also quite good and have several "looking over their shoulders, including myself" to make sure that translation errors do not occur.
So, where are you trying to go with this? And why?
There is no error in the process but there might be in the interpretation.
When someone ask "What are the chances that code A has not been copied and derivatized from code B?", you open the door to statistics with the word "chances". The process has no error, but you end up with a similarity that may be able to be quantified. If the code is 100% identical in semantics is one thing, but what if it is not? Where do you draw the line? How do you defined "% of similarity"? We know that it is easy to define 100%, but anything else might not be trivial. You certainly cannot deny emphatically that statistics play any role. A quick search lead me to this paper:
Shared Information and Program Plagiarism Detection
Xin Chen, Brent Francia, Ming Li, Brian Mckinnon, Amit Seker
University of California, Santa Barbara
http://citeseerx.ist.psu.edu/viewdoc/su ... .1.1.10.76
It may not be the best paper, but it is the first I found in which people are trying to put all this in quantifiable terms. This may be far from solved, but as I said, if things can be quantified, statistics have a role.
I quote two paragraphs. See that the problem resembles genome, or DNA sequence comparison. Something that I already pointed out and it was not paid attention:
"A common thread between information theory and computer science is the study of the amount of information
contained in an ensemble [17, 18] or a sequence [9]. A fundamental and very practical question has challenged
us for the past 50 years: Given two sequences, how do we measure their similarity in the sense that the measure
captures all of our intuitive concepts of “computable similarities”? Practical reincarnations of this question
abound. In genomics, are two genomes similar? On the internet, are two documents similar? Among a pile of
student Java programming assignments, are some of them plagiarized?
This paper is a part of our continued effort to develop a general and yet practical theory to answer the
challenge. We have proposed a general concept of sequence similarity in [3, 11] and further developed more
suitable theories in [8] and then in [10]. The theory has been successfully applied to whole genome phylogeny
[8], chain letter evolution [4], language phylogeny [2, 10], and more recently classification of music pieces in
MIDI format [6]. In this paper, we report our project of the past three years aimed at applying this general
theory to the domain of detecting programming plagiarisms."
Miguel
First, for plagiarism in the classroom, the process is much simpler. You can either do semantic analysis by hand, run run both programs thru an automated tool.
Here there is no room for "interpretation". yes, one can make _mistakes_. And that is a reason for multiple persons double-checking. Going from C to assembly and assembly to C is not magic or voodoo. Each is a well-defined process. This investigation is going one level more complicated, which is to go from asm to a specific C source to match them up. That is also a well-defined process. there is a little "search" involved, but it is not any sort of "creative" process as the relationship between C and assembly is pretty straightforward in either direction.
Also, while a general-purpose tool would be wonderful, it would also be _some_ project. Here we are looking at a specific machine language instance, and trying to determine how well it matches a specific C instance. That is a precisely defined goal that simplifies this greatly from the issues considered in a general-purpose automated tool.
There are tools around, but they are aimed at comparing multiple programs in a common language, say C. machine to C is a different problem entirely and is far less common, which is probably why there is not a lot of work done in it. I had a PhD student a few years ago that looked at "process migration on heterogeneous processors" and the problems there were quite interesting. Given a machine language on machine A, and a corresponding state S that defines where the program is at this instant in time, he wanted to migrate that to a different architecture with a different machine language, so he had to first map A to A' (a machine language translation) but then, _much_ more interesting, was mapping the state S to S'. Different numbers of registers, different instruction sets, it was an interesting study. And probably more related to the current discussion than other things I have worked on.