UCI protocol and SMP

bob · Post by **bob** » Sun Nov 16, 2008 6:55 am

hgm wrote:
bob wrote:Typically, to most users, a "chip" is a thing that goes into one socket, We now have a single chip that has two internal pieces of silicon and 4 central processing units. When the original pentium pro came out, we had a single chip, with two pieces of silicon (one for CPU one for L2 cache) and just one processor.

However, I challenge _anyone_ to find any computer architecture book that uses the term "core" to mean "processor" or "cpu". The term CPU is the generally accepted technical term which is often shortened to "processor".
You don't seem very consistent in your argument. Who do you want to please? The users that think a chip is something that goes into a socket, will not be the people reading books on computer architecture. They are the people that read the ads for "Intel inside" computers that talk about "cores".

I don't see why it would be so important what books on computer architectur say. In principle, architecture is an orthogonal dimension to technology. The same architectural design can be distributed over many chips, or integrated in a single chip, depending on the capacity of the technology used, and the complexity of the design. "Socket" and "package" are not concepts from computer architecture, but very useful concepts in the technological implementation. Of course computer architecture has no need to distinguish CPUs sharing a package ("cores") from CPUs that are a package, or even a complete printed-circuit board. Not knowing the concept of "package", it is blind to the difference. But that doesn't mean the difference does not exist...

I believe I have been very consistent. The term "cpu" or "processor" is well-known and well-understood. When you boot an operating system that's the first thing you are told about during the boot-up process, how many physical processors/cpus were found... Either term seems like a natural idea. Tell the engine how many processors/cpus to use for the parallel search. not how many threads. Older posix threads always started one more than you specified anyway. And many programs use a separate thread for I/O to read moves. Do you count those or not? I've seen users run more threads/processes than they have physical processors/cpus. Killing parallel search performance. I think it is simply cleaner to say "I want to use 2 cpus to play chess, whether the machine has 2 or 8 cores. Then you let the program start enough threads to use those processors. Ferret used to start many more processes (threads) than physical processors, but most were kept "idle" until a split was done, then one would block at the split point and another would join in to keep the same number of _active_ threads. I think that is too complicated for a user and it is easier/clearer to specify the number of processors to use, and let the program deal with threads/processes however it wants...

As far as the "difference" in packages, why does it matter to the end user or the chess engine? My program runs on Crays, which use _many_ circuit boards for a single CPU, to single chip / single-cpu, to single-chip / multiple cpu. It doesn't care, and neither do I at this level of abstraction. I just want to tell it how much of my computing resources I am willing to dedicate to playing chess, which is, I believe, pretty simple to understand.

mcostalba · Post by **mcostalba** » Sun Nov 16, 2008 10:05 pm

bob wrote: I just want to tell it how much of my computing resources I am willing to dedicate to playing chess, which is, I believe, pretty simple to understand.

If this is the goal then I agree with you that nr. of CPU is very natural.

But I have the impression that people is talking about two different concepts (and possibly parameters).

In your case you are talking about a resource limiter: "what fractional part pf my processing power I want to dedicate to chess?"

That's the question that, for you, nr. of CPU parameter is going to answer.

Other people, me included, were talking about an internal setting of the engine directly linked to the number of parallel searchers the engine will spawn (please pass me the term parallel, we could argue if it has sense in a single core CPU, but it would add noise to the discussion without adding much).

In your case, setting "nr. CPU" to 3 when running on a single core should be disabled, while it is possible in the other case. Why we should do so? not for testing? well, to find races or locking bugs someone could think to artificially raise the number of threads to a very high number to stress test the thread code.

My preference for threads, in the case of Glaurung, is due to a very simple reason: they are threads (pthreads under Linux otherwise Windows threads) so the parameter is clear, not ambiguous and well defined having the _same_ name of the quantity it is going to tweak.

I agree is not universally valid for _any_ engine.

Marco

bob · Post by **bob** » Mon Nov 17, 2008 12:25 am

mcostalba wrote:
bob wrote: I just want to tell it how much of my computing resources I am willing to dedicate to playing chess, which is, I believe, pretty simple to understand.
If this is the goal then I agree with you that nr. of CPU is very natural.

But I have the impression that people is talking about two different concepts (and possibly parameters).

In your case you are talking about a resource limiter: "what fractional part pf my processing power I want to dedicate to chess?"

That's the question that, for you, nr. of CPU parameter is going to answer.

Other people, me included, were talking about an internal setting of the engine directly linked to the number of parallel searchers the engine will spawn (please pass me the term parallel, we could argue if it has sense in a single core CPU, but it would add noise to the discussion without adding much).

In your case, setting "nr. CPU" to 3 when running on a single core should be disabled, while it is possible in the other case. Why we should do so? not for testing? well, to find races or locking bugs someone could think to artificially raise the number of threads to a very high number to stress test the thread code.

My preference for threads, in the case of Glaurung, is due to a very simple reason: they are threads (pthreads under Linux otherwise Windows threads) so the parameter is clear, not ambiguous and well defined having the _same_ name of the quantity it is going to tweak.

I agree is not universally valid for _any_ engine.

Marco

The problem with threads is that you are not going to be able to set the _true_ number of threads. Some engines use an extra one for I/O. At least one uses way more threads than processors but never has more then NCPUS of them active at any instant of time.

The idea is just like hash tables. You don't tell me how many _entries_ to use, or how many tables to use. Or if I have two tables (the Belle approach) you don't tell me how big to make each. You just say use "256MB of memory or less. I think that is the correct approach here as well. You tell me how many processors I can try to burn, and how I accomplish that is up to me, whether I use N threads or 4*N threads. You are trying to control/use resources. Don't try to figure out how a program accomplishes that, just tell it what to use of your physical resources...

BTW running way more threads than processors is not a good way to debug. You burn 90% of the total time spinning on locks which is a total waste. If you want to stress it, run it on 8 or 16 cores.

Michel · Post by **Michel** » Mon Nov 17, 2008 8:12 am

BTW running way more threads than processors is not a good way to debug. You burn 90% of the total time spinning on locks which is a total waste

We all know that running more active threads than processes costs performance. But we are talking about testing here. Running an engine with debug code enabled also costs performance. But for testing that's ok.

And it is not 90%. On a single CPU you should of course not use spinlocks. Even when testing.

bob · Post by **bob** » Mon Nov 17, 2008 5:14 pm

Michel wrote:
BTW running way more threads than processors is not a good way to debug. You burn 90% of the total time spinning on locks which is a total waste
We all know that running more active threads than processes costs performance. But we are talking about testing here. Running an engine with debug code enabled also costs performance. But for testing that's ok.

And it is not 90%. On a single CPU you should of course not use spinlocks. Even when testing.

So you change your program between testing and normal running, which can hide additional bugs...

Harald · Post by **Harald** » Mon Nov 17, 2008 10:35 pm

bob wrote: The problem with threads is that you are not going to be able to set the _true_ number of threads. Some engines use an extra one for I/O. At least one uses way more threads than processors but never has more then NCPUS of them active at any instant of time.

The idea is just like hash tables. You don't tell me how many _entries_ to use, or how many tables to use. Or if I have two tables (the Belle approach) you don't tell me how big to make each. You just say use "256MB of memory or less. I think that is the correct approach here as well. You tell me how many processors I can try to burn, and how I accomplish that is up to me, whether I use N threads or 4*N threads. You are trying to control/use resources. Don't try to figure out how a program accomplishes that, just tell it what to use of your physical resources..

Hm, I'm just trying to find out how a program that uses 4*N+1 threads
can restrict itself to N CPUs.

Harald

bob · Post by **bob** » Tue Nov 18, 2008 12:13 am

Harald wrote:
bob wrote: The problem with threads is that you are not going to be able to set the _true_ number of threads. Some engines use an extra one for I/O. At least one uses way more threads than processors but never has more then NCPUS of them active at any instant of time.

The idea is just like hash tables. You don't tell me how many _entries_ to use, or how many tables to use. Or if I have two tables (the Belle approach) you don't tell me how big to make each. You just say use "256MB of memory or less. I think that is the correct approach here as well. You tell me how many processors I can try to burn, and how I accomplish that is up to me, whether I use N threads or 4*N threads. You are trying to control/use resources. Don't try to figure out how a program accomplishes that, just tell it what to use of your physical resources..
Hm, I'm just trying to find out how a program that uses 4*N+1 threads
can restrict itself to N CPUs.

Harald

Bruce did this in ferret. The issue is you have a thread searching and you want to split the work with other processors. You can do it as I did it in Crafty, which is very complicated, or you can just ask N threads to help you, and block the current thread until the search is done. Bruce chose to have lots of threads idle, but _never_ more than N (where you have N cpus) busy at any instant in time. That's why I said this is really more of a "use this many resources" rather than a "use this many threads"...

UCI protocol and SMP

Re: UCI protocol and SMP

Re: UCI protocol and SMP

Re: UCI protocol and SMP

Re: UCI protocol and SMP

Re: UCI protocol and SMP

Re: UCI protocol and SMP

Re: UCI protocol and SMP