Core i7 and chess

cyberfish · Post by **cyberfish** » Tue Jan 20, 2009 10:24 am

Again. Time to stop thinking. This is _not_ how it works. If you only have 4 threads, and you turn hyper-threading on, you may well see two threads scheduled on different logical processors, but the two logical processors are actually one physical CPU. That would be bad. Absolutely no question about it...

I thought I saw an option like "make the scheduler aware of HT" (and presumably try to balance load across physical CPUs) in Linux kernel config. So maybe schedulers are aware of HT (and won't treat every logical processor equally)?

hgm · Post by **hgm** » Tue Jan 20, 2009 11:16 am

I would expect a good OS to have this option. It is not difficult to implement at all: if the processor with the smallest number of active hyperthreads has two less threads than the one with the largest number, terminate one hyperthread of the latter, and reschedule the process on a newly activated hyperthread of the former.

bob · Post by **bob** » Tue Jan 20, 2009 6:30 pm

Karmazen & Oliver wrote:
bob wrote: With one thread, HT on or HT off will have absolutely no effect. So I am not sure what your point is here... you have two logical processors. One is used, one is not. So how can this possibly do anything useful???
oh apologize here, perhaps I don't explain to me with clarity, we have reached the conclusion that in certain processes the HT can harm, what I meant is that if "an engine", in a single processor, with two logical cores, to use a version deep, it can harm to the yield, we can leave activated the HT, and to force to the engine to use one alone of the processors, that will avoid problems associated to the use of two logical cores and system left "something" useful for other different processes in second plane...

OK, I would agree with that...

I meant that in that way, HT=on, Cores=1 forced in setup engine, at least we will get that it doesn't harm to the yield, the HT is useful in a great majority of applications, let us say that in 5 of each 10 applications the HT is useful, therefore negotiating the modules in that way doesn't need to give up the HT in all the processes...

When thinking in that way I try to extrapolate that behavior to a processor more complex, as an icore 7, but you were guessed right when indicating that the operating system doesn't go that is to distinguish and it could balance the load incorrectly. Therefore the only thing that I seek with that idea of forcing the number of processors in the setup of the engine, it is that at least, don't harm to the yield in that application and we can have the HT in other different processes.

In a linux box, with NUMA support built in, you could make that work as well. The kernel recognizes the processors in pairs. So with a 4-core I7, the O/S would see processors 0 and 1, then 2 and 3, etc. Where 0 and 1 are two hyper-threaded processors on one physical processor. One can set the processor affinity to bind thread 0 to processor 0, thread 1 to processor 2, etc, and end up with the perfect mapping of threads to physical processors... I am not sure if you can do that in windows as well, but I have done it on Linux boxes like AMD where it is important for processes to not bounce around the processor pool because of memory latency issues...

Only for some old system, but a lot of people have semi-olds cpu that have one core and HT...

bob wrote: I think it is worthless in the case of chess, period. And in general, it is a poor idea as well. There are some applications that run faster with hyper-threading. There are some that run far slower...
good, I personally think that there is more applications than the HT takes advantage to applications that don't take advantage of it... and some programs exist in those that the improvement is tiny, less than 1%... resume for me: (50% software better, 30 % worthless or zero, 20 worse %... bad result)

for example...

http://www.devhardware.com/c/a/Computer ... 5nm-CPU/2/

in an intel D 955 EE, 2 cores + 2 cores logic... the test super PI, is very good for HT, this cpu have to very good test when we test 3 or 4 programs superpi test.

2 superpi.exe programs

40.484 seconds (intel D 955 EE 2+2cores)
35.813 seconds. (AMD Fx60 2 cores)

but when we use:

4 superpi.exe programs...

53.218 seconds Intel Pentium D 955 EE
74.177 seconds. AMD FX60

Regarding in a normal use, I perceive, but it is a personal opinion that if the processor not this busy one to 100% of the load, when the processor has HT=on, the behavior it is more slight, its answer is more sensitive and more quick to the changes of applications, ...

What is not very appropriate is that to activate the HT in a processor iCore7 takes among 1/3~1/4 bigger consumption of power, if you estimate you some 90 W of consumption max in some reviews, to have active the HT implies among 20-30 W this is a true atrocity.

Can't understand the increased power. The processor has the same number of pipelines, functional units, etc with or without HT. You duplicate some state information (program counter, registers, etc) but I don't see how that would turn it into a power burner...

I think that serious possible to take advantage in a better way the use of the HT, in fact, programs calculate intensive as superpi.exe they take advantage of the HT correctly, but evidently that perhaps is because they are 4 different instances of programs, in those that each one has access different to an area by program and they don't mix among them.

The only advantage hyperthreading offers is when a thread/process stalls on one processor because of a cache miss, the other thread can continue to run. On a normal CPU memory stalls cause the process running on that CPU to sit and wait on memory. Hyperthreading lets the other process use the processor's resources and continue executing. If you do a lot of tuning so that you minimize cache misses, hyperthreading offers little or nothing. If you have a _lot_ of memory accessing in one process, and a lot of cache hits in the other, they can both run on one physical CPU and share it quite well and see a speed gain when averaged across both.

it is possible that the only form of taking advantage of the HT a little, be with the simultaneous load of different engines that uses a different area by process... and is with the ponderyn=off.

The problem more big that find the HT, it is that the system overloads one of the cores (core1+core1virtual to 95%, and leave the alone core2+core2 to 5%), that is a great nonsense...

The Linux scheduler addresses this, or at least it did. Ingo Molnar worked on this for quite a while and did a nice job in handling this specific case. When I tested, two processors on a dual-cpu HT box would see one process run on each physical CPU, which was perfect. I tested this on my office box (a dual PIV xeon with hyperthreading) when he was developing the code. I have not tried it since however...

greetings, THE chat-comunication with you has been very interesting.

Sincerely, Oliver

bob · Post by **bob** » Tue Jan 20, 2009 6:32 pm

cyberfish wrote:
Again. Time to stop thinking. This is _not_ how it works. If you only have 4 threads, and you turn hyper-threading on, you may well see two threads scheduled on different logical processors, but the two logical processors are actually one physical CPU. That would be bad. Absolutely no question about it...
I thought I saw an option like "make the scheduler aware of HT" (and presumably try to balance load across physical CPUs) in Linux kernel config. So maybe schedulers are aware of HT (and won't treat every logical processor equally)?

I mentioned this already. Ingo Molnar made some changes that addressed this issue several years ago. But for windows, I did not see anything that worked, although I have not tried it in a long time since I don't run windows myself...

I assume this is still in the Linux kernel, but since I always run with HT off, I have not looked in quite a while...

Karmazen & Oliver · Post by **Karmazen & Oliver** » Tue Jan 20, 2009 11:12 pm

The problem more big that find the HT, it is that the system overloads one of the cores (core1+core1virtual to 95%, and leave the alone core2+core2virtual to 5%), that is a great nonsense...

bob wrote: The Linux scheduler addresses this, or at least it did. Ingo Molnar worked on this for quite a while and did a nice job in handling this specific case. When I tested, two processors on a dual-cpu HT box would see one process run on each physical CPU, which was perfect. I tested this on my office box (a dual PIV xeon with hyperthreading) when he was developing the code. I have not tried it since however...

When I said that use the max cores = 8, ht=on, and limit the setup engine to a value < 8 cores, was based in this web-text:

http://www.tomshardware.com/reviews/Int ... 041-5.html

To remedy that situation, SMT looks for instruction parallelism in two threads instead of just one, with the goal of leaving as few units unused as possible. This approach can be extremely effective when the two threads are executing tasks that are highly separate. On the other hand, two threads involving intensive calculation, for example, will only increase the pressure on the same calculating units, putting them in competition with each other for access to the cache. It goes without saying that SMT is of no interest in this type of situation, and can even negatively impact performance...

in other theme, it´s interesting your coment previus... linux is flexible & great... and the programers in linux too ...

but, maybe you find interesting this:

http://www.tomshardware.com/reviews/Int ... 041-6.html

Still, the impact of SMT on performance is positive most of the time and the cost in terms of resources is still very limited, which explains why the technology is making a comeback. But programmers will have to pay attention because with Nehalem, all threads are not created equal. To help solve this puzzle, Intel provides a way of precisely determining the exact topology of the processor (the number of physical and logical processors), and programmers can then use the operating system affinity mechanism to assign each thread to a processor. This kind of thing shouldn’t be a problem for game programmers, who are already in the habit of working that way because of the way the Xenon processor (the one used in the Xbox 360) works. But unlike consoles, where programmers have very low-level access, on a PC the operating system’s thread scheduler will always have the last word.

Since SMT puts a heavier load on the out-of-order execution engine, Intel has increased the size of certain internal buffers to avoid turning them into bottlenecks. So the reorder buffer, which keeps track of all the instructions being executed in order to reorder them, has increased from 96 entries on the Core 2 to 128 entries on Nehalem. In practice, since this buffer is partitioned statically to keep any one thread from monopolizing all the resources, its size is reduced to 64 entries for each thread with SMT. Obviously, in cases where a single thread is executed, it has access to all the entries, which should mean that there won’t be any specific situations where Nehalem turns out to have worse performance than its predecessor.

After everything, perhaps the programmers if they can make something, so that the chess engines is executed in the good cores, leaving the other virtual 4 for tasks in plane segund, in that way, it´s possible to use 5 or 6 cores (HT=on, setup engine, 4+1,4+2, with a better result that alone 4 and with the HT=off)... in others system diferents linux, similar windows 7... Mac Os X...

http://www.xbitlabs.com/articles/cpu/di ... -i7_8.html

However, as we remember from our experience with Pentium 4 processors that supported a similar technology called Hyper-Threading, it may also have a negative effect in some cases. It usually happens for two reasons. The first reason why the performance may drop is the silly work of the operating system task manager that may not distinguish between the physical and virtual CPU cores and assign a pair of threads to the same physical core despite the fact that other cores are not utilized at the time. The second reason has to do with the fact that some of the internal processor buffers are shared equally between the threads when SMT is enabled. Therefore, the performance of a core processing only one thread may sometimes be lower with enabled SMT.

it is possible that this doesn't have to be this way now, in the new iCore 7 and new operating systems... at least in website Tomsharware, they indicate that NOW it is possible that the programmer makes something in this respect... the improvement can be substantial, 5~20%. Although I think that that alone it would be useful if it is not overloaded completely to the cpu, and the chess engine it uses alone among 1~3 cores virtuality... to max load 4+(1~3) cores.

bob wrote: I think it is worthless in the case of chess, period. And in general, it is a poor idea as well. There are some applications that run faster with hyper-threading. There are some that run far slower...

oliver wrote: good, I personally think that there is more applications than the HT takes advantage to applications that don't take advantage of it... and some programs exist in those that the improvement is tiny, less than 1%... resume for me: (50% software better, 30 % worthless or zero, 20 worse %... bad result)
for example...
http://www.devhardware.com/c/a/Computer ... 5nm-CPU/2/

in an intel D 955 EE, 2 cores + 2 cores logic... the test super PI, is very good for HT, this cpu have to very good test when we test 3 or 4 programs superpi test.

2 superpi.exe programs

40.484 seconds (intel D 955 EE 2+2cores)
35.813 seconds. (AMD Fx60 2 cores)

but when we use: 4 superpi.exe programs...

53.218 seconds Intel Pentium D 955 EE
74.177 seconds. AMD FX60

What is not very appropriate is that to activate the HT in a processor iCore7 takes among 1/3~1/4 bigger consumption of power, if you estimate you some 90 W of consumption max in some reviews, to have active the HT implies among 20-30 W this is a true atrocity.

bob wrote: Can't understand the increased power. The processor has the same number of pipelines, functional units, etc with or without HT. You duplicate some state information (program counter, registers, etc) but I don't see how that would turn it into a power burner...

Good, I associate that increase from the consumption to an use more effective of the cache, ok, we know that if the things don't work well it can go worse, but I believe that there are areas of the cpu that are busier, in many applications something the used of the data it is optimized in the cache, the transfer of data is possibly more intensive (date corrects or incorrect use, but cache have better use...) causing an it increases of the real use of the processor, that will avoid that the systems of energy saving are not activated and for a extreme example the voltage of the processor, in these case, it´s more time to the max-load in the processor, because it´s ... I this more used (100% load) ... let us say that although it is small fractions of seconds (ms), "those dedicated" to that the cache exchanges useful data, those small whiles that the cpu used to rest in peace or pace,

, in the case of being activated the HT, they are used to calculate a little more, certain it is that there are times it calculates bad, but when it makes it well, (well be for the software type, or for the programming style (system+engine) and correct use of the cores), when it´s makes it well, the result is interesting... This would make sense with the personal perception that when the HT this ON, the instantaneous answer is more "quick", more sensitive, this has observed it in some old micros... let us say that it´s gets jammed less...; -)

Here, I read something of this, on the increase of the consumption some months ago:

http://www.matbe.com/articles/lire/1131 ... /page8.php

link img

Good, it is certain that the HT is a patch, but a patch that it can improve 30% the yield, that is interesting, my personal opinion is that that patch is necessary due to the intel architecture, and in it finishes instance due to the architecture x86, I believe that that is due to that enormous cache that traditionally use the intel cpus, in parallel prosess those new micros, although very quick for the current software, is not correctly designed, with what is allowed that you solve (sort-times) as the HT they give good result in some ( a lot of ) cases/examples.

in other theme, other desing of multiple cpus cores, (that they have not HT options), Clearly, the new micros of IBM, powerPC, is more fast in Ghz (4~5Ghz) and more efficient in the use and administration of watts and with the data, they are evidently also a lot more expensive. (the power of calculation of those processors is impressive, (Gflops), power 6 ~ 7 IMB, and if was possible to dedicate the massive cpu contained in the cards of video of high range, software CUDA, AMD-ATI etc.. the yield increase would be exponential...after all, the new cards of high range, have 1 GByte or 2 GB of memory and the cost is 3 or 4 more cheap that the expensive cpu.s)

example powerfull GPU´s:

Link 1 video converters software... link 2 fast.

But I believe that the programmers should show interest to take advantage of the HT, a 20 or 30% it is not a worthless value and I don't think that to think in it is it useless "programming", when micros exist of more cores, the operation doesn't have to worsen, after all, alone it is that the system allows to distinguish among real or virtual cores... if we recognize that the increase of energy consumption, (HT=ON), is due to an use more complete (sometimes efficient~100%) of some parts of the processor, we are recognizing that that cpu could make it better... when HT=off...

in a pure and perfect parallel architecture, that would not have to cause any difference, but we are not in a perfect world, intel is not perfect, but it is a lot more cheap that an old digital_Alpha or a new IBM power 6~7... and intel is compatible with our current software...

...

a greeting. Oliver. From Spain.

PS: excuse if it has seemed it long text, but I believe that it was interesting to remark some points.

terminator · Post by **terminator** » Wed Jan 21, 2009 12:14 pm

How then does non-chess software get a speedup on the Intel Core i7? I was reading of the claims of 1.7 to 3.5 speedup for commercial applications on the new RHEL 5.3 Presumably all those applications are not written specifically for the Intel Core i7 and have been out for years and years?

Red Hat wrote:In internal testing, the Red Hat Engineering Performance Group has measured exceptional gains with the new Nehalem processors, with unaudited results showing gains of 1.7x for commercial applications and gains up to 3.5x for high-performance technical computing applications compared to the previous generation of Intel processors.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Jan 22, 2009 8:02 am

terminator wrote:How then does non-chess software get a speedup on the Intel Core i7? I was reading of the claims of 1.7 to 3.5 speedup for commercial applications on the new RHEL 5.3 Presumably all those applications are not written specifically for the Intel Core i7 and have been out for years and years?

1) They are comparing to the previous generation processors. Some maths problems had very bad performance on Intel Core2 because of lack of memory bandwidth. Chess programs don't care so much about that.

2) Chess programs are inherently serial, so they do not parallellize well. Some programs are inherently parallel, and will get great speedups.

In my opinion, once say 64 cores or more becomes common, we might need to look at other search algorithms for chess. But the ICGA has made it clear they do not want research into this.

bob · Post by **bob** » Thu Jan 22, 2009 5:28 pm

terminator wrote:How then does non-chess software get a speedup on the Intel Core i7? I was reading of the claims of 1.7 to 3.5 speedup for commercial applications on the new RHEL 5.3 Presumably all those applications are not written specifically for the Intel Core i7 and have been out for years and years?

Red Hat wrote:In internal testing, the Red Hat Engineering Performance Group has measured exceptional gains with the new Nehalem processors, with unaudited results showing gains of 1.7x for commercial applications and gains up to 3.5x for high-performance technical computing applications compared to the previous generation of Intel processors.

To get a speedup on hyperthreading, you can do one of two things:

(1) run two different applications simultaneously. One application must do a _lot_ of memory accesses, the other should run mostly out of cache. That lets the memory traffic thread make progress, but while it is waiting on memory, the other thread keeps the CPU busy.

(2) run an application that uses two threads, again one or the other or both doing excessive memory accesses. While one is blocked waiting for memory, the other uses the wasted cycles, and vice versa.

If you eliminate the memory traffic and run mostly out of cache, hyper-threading offers exactly nothing except for a slight amount of overhead that actually hurts a few percentage points..

bob · Post by **bob** » Thu Jan 22, 2009 5:30 pm

Gian-Carlo Pascutto wrote:
terminator wrote:How then does non-chess software get a speedup on the Intel Core i7? I was reading of the claims of 1.7 to 3.5 speedup for commercial applications on the new RHEL 5.3 Presumably all those applications are not written specifically for the Intel Core i7 and have been out for years and years?
1) They are comparing to the previous generation processors. Some maths problems had very bad performance on Intel Core2 because of lack of memory bandwidth. Chess programs don't care so much about that.

2) Chess programs are inherently serial, so they do not parallellize well. Some programs are inherently parallel, and will get great speedups.

In my opinion, once say 64 cores or more becomes common, we might need to look at other search algorithms for chess. But the ICGA has made it clear they do not want research into this.

Fortunately the ICGA can only try to "lead" us to architectural mediocrity, but we don't have to follow and can go our own direction as we please...

terminator · Post by **terminator** » Tue Feb 10, 2009 3:54 pm

As hyper-threading technology allows the acceleration of multi-threaded applications which chess programs are multi-threaded?

Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess

Re: Core i7 and chess