Many to One

AdminX · Post by **AdminX** » Wed Sep 02, 2009 2:39 pm

Is it possible to create a virtual processor cpu (1 core) that a single processor chess engine could use to take advantage of a quad core or better CPU? Thus allowing that one virtual cpu to have close to the power of all real cores combined.

plattyaj · Post by **plattyaj** » Wed Sep 02, 2009 2:56 pm

AdminX wrote:Is it possible to create a virtual processor cpu (1 core) that a single processor chess engine could use to take advantage of a quad core or better CPU? Thus allowing that one virtual cpu to have close to the power of all real cores combined.

No.

Andy.

bob · Post by **bob** » Wed Sep 02, 2009 7:52 pm

AdminX wrote:Is it possible to create a virtual processor cpu (1 core) that a single processor chess engine could use to take advantage of a quad core or better CPU? Thus allowing that one virtual cpu to have close to the power of all real cores combined.

It is remotely possible, since computer science theory has a proof that a single-core CPU and an N-core CPU are equivalent, assuming you match up the clock rates. ie 8ghz x 1 and 1 ghz by 8 are computationally equivalent, and it is theoretically possible to write a program to make the 8-core box look like a single-core box.

But that is theory.

In practice, the 8 cores offer 8 program counters, 8 sets of registers, etc. And sequential program flow can affect the ability to execute 8 streams of instructions simultaneously. In fact, today's super-scalar architectures that can execute 3 instruction streams at once find it almost impossible to do this for significant parts of any real program because of data dependencies (a=b*c; d=a*d; where you can not do the a*d until you have completed the b*c) or control dependencies (if (a > 0) x = y / a;) You can't do the x=y/a until you know that a is > 0. So while it is possible to design such a piece of many-to-one software mapping, using it would be nearly impossible for today's programs, particularly computer chess which depends on a sequential algorithm (alpha/beta) as a basic part of the code.

bob · Post by **bob** » Wed Sep 02, 2009 7:53 pm

plattyaj wrote:
AdminX wrote:Is it possible to create a virtual processor cpu (1 core) that a single processor chess engine could use to take advantage of a quad core or better CPU? Thus allowing that one virtual cpu to have close to the power of all real cores combined.
No.

Andy.

It has already been done. Look carefully at the concept of "multiple pipelines". It is just not a given that you can keep all the pipes busy.

hgm · Post by **hgm** » Wed Sep 02, 2009 8:02 pm

I wonder how they prove that. Because to me it seems false. How could you get around the latency problem? If I want to take the tangent of the tangent of the tangent, etc., a billion times, and the CPU has a long-latency instruction to compute a tangent... How would 8 CPUs help me?

I can see that one 8GHz can do everything what eight 1GHz CPUs can do (by time-division multiplexing). But the opposite is not so clear.

bob · Post by **bob** » Wed Sep 02, 2009 9:48 pm

hgm wrote:I wonder how they prove that. Because to me it seems false. How could you get around the latency problem? If I want to take the tangent of the tangent of the tangent, etc., a billion times, and the CPU has a long-latency instruction to compute a tangent... How would 8 CPUs help me?

I can see that one 8GHz can do everything what eight 1GHz CPUs can do (by time-division multiplexing). But the opposite is not so clear.

You did see the word "theory"? In theory latency goes away. This is another example of the concept of a single-tape turing machine, and then a multiple tape turing machine (for those that know what I am talking about). And then you prove that the multiple-tape machine offers no functional advantage over the single tape turing machine.

In a single tape TM, you have a current state, you can read a character from the tape, or write a character, and change to a new state. On a multiple tape turing machine you can read from any of N tapes, or write to one of them, and change state. The proof has to do with combining the 4 possible tape characters into one "big" character" that has 4 x X bits (4 for a 4-tape TM, X is the number of bits per character). The 1 tape machine produces horrifically complex programs compared to the 4-tape, but functionally, they are proven to be equivalent.

The idea of using 4 cores to simulate 1 core is an old concept. In fact, early supercomputers did exactly this. You could buy a TI ASC machine, with 1, 2, 3 or 4 pipes, depending on how much you wanted to spend. And it could execute 1, 2, 3 or 4 instructions per clock, if your program did not run into data dependencies or control dependencies that prevented this. the 4 core problem is just another example of the same issue. 4 separate cores offer better performance than the 4-to-1 simulator approach, because you have a better chance of finding the necessary independent instruction streams if you have independent programs running.

In any case, a tour thru a good architecture book will give examples for everything from SIMD (which is very much akin to the topic at hand) compared to MIMD (which is how multiple-core boxes are currently viewed).

AdminX · Post by **AdminX** » Thu Sep 03, 2009 11:57 am

bob wrote:
AdminX wrote:Is it possible to create a virtual processor cpu (1 core) that a single processor chess engine could use to take advantage of a quad core or better CPU? Thus allowing that one virtual cpu to have close to the power of all real cores combined.
It is remotely possible, since computer science theory has a proof that a single-core CPU and an N-core CPU are equivalent, assuming you match up the clock rates. ie 8ghz x 1 and 1 ghz by 8 are computationally equivalent, and it is theoretically possible to write a program to make the 8-core box look like a single-core box.

But that is theory.

In practice, the 8 cores offer 8 program counters, 8 sets of registers, etc. And sequential program flow can affect the ability to execute 8 streams of instructions simultaneously. In fact, today's super-scalar architectures that can execute 3 instruction streams at once find it almost impossible to do this for significant parts of any real program because of data dependencies (a=b*c; d=a*d; where you can not do the a*d until you have completed the b*c) or control dependencies (if (a > 0) x = y / a;) You can't do the x=y/a until you know that a is > 0. So while it is possible to design such a piece of many-to-one software mapping, using it would be nearly impossible for today's programs, particularly computer chess which depends on a sequential algorithm (alpha/beta) as a basic part of the code.

Thanks Bob for that very informative answer.

hgm · Post by **hgm** » Thu Sep 03, 2009 1:14 pm

bob wrote:You did see the word "theory"? In theory latency goes away.

Yes, I did see the word "theory". But a gather this is a special kind of theory, then. Like: "In theory, all things fall upward". Yet there are plenty of theories that state the opposite, e.g. by Newton or Einstein. In fact they are so common that the "fall-upwards' thingy would more commonly be referred to as "nonsense" rather than "theory".

I am pretty sure that information science has developed to the pointthat there are also theories that do take latency into account.

bob · Post by **bob** » Thu Sep 03, 2009 8:24 pm

hgm wrote:
bob wrote:You did see the word "theory"? In theory latency goes away.
Yes, I did see the word "theory". But a gather this is a special kind of theory, then. Like: "In theory, all things fall upward". Yet there are plenty of theories that state the opposite, e.g. by Newton or Einstein. In fact they are so common that the "fall-upwards' thingy would more commonly be referred to as "nonsense" rather than "theory".

I am pretty sure that information science has developed to the pointthat there are also theories that do take latency into account.

Latency is everywhere. What is to say it would be significantly worse in this specific example? It certainly is not going to be a 50% penalty. So the answer to the original question remains, "it is theoretically possible, there is a proof that shows equivalency from a pure capability standpoint, what the ultimate real performance would be is harder to predict.

But again, this is absolutely possible. It would not be that hard to expand on a driver I already have that takes independent chess games and farms them out to different nodes by using a simple fork() and then tcp/ip communication. The same approach would work for this, except that the "independent chess games" are replaced by "independent instructions from the chess engine instruction stream". The problem, as I mentioned, is that it gets more and more difficult to find independent instructions as the number of streams/pipes/etc goes up. But it can certainly be done

BTW, when I used the term "theory" I did not imply any sort of bogus theory. Rather theory that has been subjected to a rigorous proof. The TM example I gave is in most good CS theory textbooks. Which is far different from a theory about negative or inverse gravity. The original pentium processor (and every processor since) did exactly this, by fetching from a single instruction stream and them farming the different instructions out to different pipelines. This idea is simply a clear extension of that idea, which has been proven to work, obviously.

Many to One

Many to One

Re: Many to One

Re: Many to One

Re: Many to One

Re: Many to One

Re: Many to One

Re: Many to One

Re: Many to One

Re: Many to One