bob wrote:RegicideX wrote:GenoM wrote:I've tried to follow this scientific discussion about methodology of research.
I can guess John Sidles and Alex K. are mostly worried about that translated code of Rybka is not exactly the same as the original code of Rybka. They are insisting on view that translated code can not be exactly the same as the original source code so from the results of comparision can not be drawn any valid (by scientific means) conclusions.
Have I got it right?
No, you have not. It is perfectly true that the compiler adds a layer of uncertainty by changing the code around -- that's just part of the picture though. The real point is that a certain degree of similarity in not unlikely in programs that are short, simple and need to provide extremely similar if not identical output.
OK. but we are talking about chess programs. My program (crafty) currently has 44,864 lines of code. That is hardly "small". the final part about "similar output" is meaningless. We have so many ways to search trees, so many ways to extend and reduce, so many ways to order the moves to make the search more efficient, a nearly limitless number of ways to turn a position into a numeric value that can be passed thru the search to find the best move, that this point is simply meaningless. Finding semantically equivalent code can happen on two levels. Nobody cares about tiny pieces of code such as PopCnt(), LSB(), MSB() and the like. But when you get into procedures that are hundreds of lines long, big chunks of semantically identical code leads to but one conclusion. Again, for a given algorithm, there are an infinite number of ways syntactically to express it. So duplication is _highly_ improbable. And that is the basis that caused a couple of people to start to look at this more closely. It is simply so improbable unless code was physically copied.
The actual source comparison shows some similarities but also lots of dissimilarities -- which is what is to be expected.
Partially true. There will necessarily be parts of the code that are different, since program B (supposedly derived from A) is significantly stronger. But chunks of duplicated code is _not_ normal in any program of more than a few lines. So that concept I do not agree with.
But if you see 10 or 15 variables which are initialized in the same order in machine code, then despite the changes made by the compiler, something is fishy -- not necessarily conclusive, but fishy.
Agreed. And if they are all initialized, even if _not_ in the same order, it is _still_ fishy In fact, the compiler could change that order for other reasons if it wanted to For example, it is more efficient to initialize things in the order they appear in memory to take advantage of cache prefetching. And if you are trying to de-compile, you see the final order, not the programmed order, and rearranging makes perfect sense when trying to match things up.
It turns out though, that the order of variable initialization was faked to make the code look more similar than the disassembler made it look -- the same goes for the order in which various "if" clauses are checked. So there is really nothing fishy in the code regarding the order.
EXTREMELY poor choice of wording with "faked". Nothing being done in this investigation is "faked". If you want to use words that are inflammatory or accusatory, feel free. But don't expect much in the way of discussion if that is the way you want to proceed. Get some experience or knowledge about the field and you will see how wrong "faked" is.
There is plenty of semantic non-equivalent code, and what similarity there is, is mostly about the general structure of the program -- making allowance for the fact that even in the structure there are dissimilarities.
again, that is baloney. yes there is non-equivalent code. But in a large program, one expects to find very little semantically equivalent code. That is the issue. Do you actually believe that in writing a 44,000 line program, that I will by pure chance duplicate what others have done _exactly_ here and there? when I don't even see this happen in 100-1000 line student programs???