Page 2 of 6

Re: RAM speed and engine strength

Posted: Thu May 04, 2017 4:35 am
by Zenmastur
Milos wrote:
Zenmastur wrote:
mjlef wrote:... Where a CPU can do an instruction in a clock cycle or two involving on chip operations (so a couple of billion operations a second), external access, especially access from another NUMA node can take 100 clock cycles or more, depending on the hardware.
On a large TT TLB misses are VERY expensive. IIRC they actual can be the limiting factor on nps. I don't remember the exact figures but I think a TLB miss can cost 600+ cycles. When using a large TT most TT probes will produce a TLB miss. Even at 10,000,000 nps this causes problems an becomes a limiting factor to node processing.

I get about 5% increase in NPS by going from 2133 to 2666 or 2800. But these figures will vary depending on how well your system is tuned and what the latency of the ram is. In some cases it's actually better to run a lower clock frequency on the ram if this allows the ram to be run at lower latencies.
That doesn't make much sense.
It makes perfect sense to me.
Milos wrote:TLB misses should cost roughly the same in terms of latency no matter the size of TT.
They do cost the same for the most part. There is a difference in the look up if you get a hit that the full decode isn't stored. But on a miss they all take a huge number of clock cycles. The problems comes when you have multiple outstanding TLB misses that need to be translated. Older Intel cpu's (Sandy-bridge, Ivy-Bridge, Haswell) have one page miss handler per core (not 100% sure on this being per core, so I'll have to look it up). The misses get put in a cue to be serviced. If you have a lot of TLB misses this can cause the wait time to grow quite large. Can you 1600+ clocks, sure you can! Additionally the page table resides in memory, which means each translation must access main memory multiple times. If the memory bus is heavily loaded get ready to wait even longer because these are cued as well.
Milos wrote: Most of TT probes can never be TLB misses.
Oh really???

The fact is, any time the page count for the TT exceeds the TLB size there will be misses. A 16 Gb TT has a total of 2^22 4k pages. A Haswell CPU has a TLB of 1024 entries. This number assumes 4k pages as the number of entries decreases as page size is increased. TT probes access pages in a pseudo-random manner. This means that on average the referenced page will only be in the TLB about 1/(2^22/2^10)= 0.0244% of the time. The other 99.976 % of the time a TLB miss will occur. So I have absolutely no idea why you think “Most TT probes can never be a TLB miss”!
Milos wrote: And why would TLB miss probability or latency depend on RAM frequency?
Because each TLB miss generates multiple main memory accesses. Therefore the speed and latency of main memory coupled with the % of main memory bus loading become the limiting factors to page miss handling.
Milos wrote: I also doubt you could gain much by running RAM on lower frequency. Usually total latency is pretty much constant for quite some range of frequencies.


“IF” you you have the right motherboard (and I do) you can adjust the main memory frequency and individual timing parameters as you see fit. Of course your memory has to be capable of actually run reliably at the speeds and latencies selected.

You can trust me on this one: The difference between running DDR 2133 at 15-15-15-35 and running DDR4 3600 at 15-15-15-35 is a big deal. You have less than 60% of the latency and over 168% the available bandwidth when using the faster memory. Of course this will make little difference if you application is neither latency bound or bandwidth bound. For chess I would say the latency is more important than bandwidth by a moderately large margin. i.e. something like 60/40, but this depends on how fast the CPU is, with very fast CPU's bandwidth becomes somewhat more import.
Milos wrote:Wiki quotes 10-100 cycles for TLB miss.


I don't know who wrote that on wiki or when it was written. It may have been true when it was written, but this has changed over time due to the huge gains in CPU clock rates while only modest increases in memory clock rates have occurred.
Milos wrote: And if you use large or huge pages there won't be many misses. So even if you have 20% misses and like 1 clk cycle of TLB hit and 30 clk cycles for TLB miss that would produce 6 additional cpu cycles on average per each node evaluated.
As I noted earlier, as page size increase so does the entry size in the TLB. This yields fewer large pages entries in the TLB and even fewer huge pages entries. So while large and huge pages do help, they don't help as much as you might think. In fact, I have read several research papers that claim for general purposes large and huge pages can actually hurt performance.

But, I should note here that different CPU's have different spec's and behave differently under various loads. e.g. Intel's Lake series CPU's have a 1.5k TLB and a second page miss handler, so under very heavy load they will perform differently than older Intel CPU's as will various AMD CPU's. i.e. this is very CPU dependent.

Regards,

Forrest

Re: RAM speed and engine strength

Posted: Thu May 04, 2017 8:03 am
by shrapnel
Leo wrote:I thought you were the die hard Intel man. Did you finally see the light?
And you should "see" properly what I've written.
I'm building a SECOND System, more out of curiosity really.
My PRIMARY System will always be INTEL !

Re: RAM speed and engine strength

Posted: Thu May 04, 2017 3:02 pm
by Leo
shrapnel wrote:
Leo wrote:I thought you were the die hard Intel man. Did you finally see the light?
And you should "see" properly what I've written.
I'm building a SECOND System, more out of curiosity really.
My PRIMARY System will always be INTEL !
OK

Re: RAM speed and engine strength

Posted: Thu May 04, 2017 5:29 pm
by yurikvelo
RAM speed in Gb/sec, IOPS and ms of latency - affect engine NPS.

But RAM clockrate do not affect Gb/sec, IOPS and latency directly and proportionally.

RAM bandwidth is limited by a lot of things - RAM controller clockrate and bus width, controller<->L3 bus, L3<->L2 bus

If you overclock, all buses are overclocked proportionally - you will have perfomance increase. But if you change one of many ratios/dividers to boost RAM clockrate - it doesn't guarantee itself RAM read/write speed gain.

AMD since K8 has integrated RAM controller on-die CPU.

AMD Ryzen integrated RAM controller has its own MB-independent limitation for RAM dividers:

Image

The number of ranks on any DIMM is the number of independent sets of DRAMs that can be accessed for the full data bit&#8208;width of the DIMM ie 64 bits. The ranks cannot be accessed simultaneously as they share the same datapath.

To get 2667 for Ryzen you have to use only 2 slots and fill them with 1Rx RAM.

Re: RAM speed and engine strength

Posted: Thu May 04, 2017 5:46 pm
by shrapnel
yurikvelo wrote:AMD since K8 has integrated RAM controller on-die CPU.

AMD Ryzen integrated RAM controller has its own MB-independent limitation for RAM dividers:

]
Yep, something like that.
The general idea being that since RAM is so closely associated with the CPU in the case of Ryzen, faster RAM becomes even more important to improve overall performance , more so , than in the case of Intel.
One guy even went so far as to suggest that in the case of the 1800X, given very fast RAM, there would almost be no need to overclock the CPU !!
This assumes even more importance, given that Ryzen processors are not too overclock-friendly....

Re: RAM speed and engine strength

Posted: Fri May 05, 2017 4:56 am
by schack
Curious. Using four dimms (G Skill, single side, 2667 rated) I got to 2400Mhz. I guess the BIOS updates really do make a difference?

Re: After re-reading my previous post...

Posted: Sat May 06, 2017 12:40 am
by Zenmastur
...I see I didn't answer one of the questions very well.
Namely this one:
Milos wrote: I also doubt you could gain much by running RAM on lower frequency. Usually total latency is pretty much constant for quite some range of frequencies.


I don't know why you think "total latency" is constant.

An extreme example might help clarify this question. Running DDR4 2400 at 10-12-12-28 1T vice running DDR4 2666 at 16-18-18-35 2T will show significant increases in NPS even though the 2400 is running at a lower speed. The difference is purely a function of latency. In less extreme examples, using the exact same dimms, small advantages can be had IN SOME CASES by running high speed dimms at lower frequencies and much lower latencies. This is mostly for those that are on a budget and want to get the highest possible NPS from their equipment. I've done this on a couple of my systems. This can require quite a bit of time tweaking the ram setting to get them just right. This is why not many people do it.

Regards,

Forrest

Re: After re-reading my previous post...

Posted: Mon May 08, 2017 10:19 am
by corres
[quote="Zenmastur"]...I see I didn't answer one of the questions very well.
Namely this one:

[quote="Milos"] I also doubt you could gain much by running RAM on lower frequency. Usually total latency is pretty much constant for quite some range of frequencies.[/quote]

I don't know why you think "total latency" is constant.

An extreme example might help clarify this question. Running DDR4 2400 at 10-12-12-28 1T vice running DDR4 2666 at 16-18-18-35 2T will show significant increases in NPS even though the 2400 is running at a lower speed. The difference is purely a function of latency. In less extreme examples, using the exact same dimms, small advantages can be had IN SOME CASES by running high speed dimms at lower frequencies and much lower latencies. This is mostly for those that are on a budget and want to get the highest possible NPS from their equipment. I've done this on a couple of my systems. This can require quite a bit of time tweaking the ram setting to get them just right. This is why not many people do it.

Regards,

Forrest
[/quote]

I have an example to back up your statements.
I use Corsair 16GB 3000 MHz DDR4 RAM in my Ryzen 7 1800x rig.
The default speed of RAM is 2133 MHz for motherboards of Ryzen. I enhanced its speed to 2933 MHz and it yields only about 2 % speed up in the speed of my PC.
Because the higher RAM speed the degeneration of RAM and the temperature of processor are higher so I return to the default RAM speed.

Re: After re-reading my previous post...

Posted: Mon May 08, 2017 11:41 am
by Zenmastur
corres wrote:
Zenmastur wrote:...I see I didn't answer one of the questions very well.
Namely this one:
Milos wrote: I also doubt you could gain much by running RAM on lower frequency. Usually total latency is pretty much constant for quite some range of frequencies.


I don't know why you think "total latency" is constant.

An extreme example might help clarify this question. Running DDR4 2400 at 10-12-12-28 1T vice running DDR4 2666 at 16-18-18-35 2T will show significant increases in NPS even though the 2400 is running at a lower speed. The difference is purely a function of latency. In less extreme examples, using the exact same dimms, small advantages can be had IN SOME CASES by running high speed dimms at lower frequencies and much lower latencies. This is mostly for those that are on a budget and want to get the highest possible NPS from their equipment. I've done this on a couple of my systems. This can require quite a bit of time tweaking the ram setting to get them just right. This is why not many people do it.

Regards,

Forrest
I have an example to back up your statements.
I use Corsair 16GB 3000 MHz DDR4 RAM in my Ryzen 7 1800x rig.
The default speed of RAM is 2133 MHz for motherboards of Ryzen. I enhanced its speed to 2933 MHz and it yields only about 2 % speed up in the speed of my PC.
Because the higher RAM speed the degeneration of RAM and the temperature of processor are higher so I return to the default RAM speed.
I would get CPUID and see what speed it say's you are running the CPU and ram at. Also check to see what the SPD chip has the ram programmed for. i.e. what JEDEC and XMP settings are listed.Then I would select a well known chess position and test your favorite chess engine's average NPS on that position. Then change the memeory to run 2400 @ 12-14-14-28 to see if it will run with these settings. It shouldn't have any problems if it's designed to run 3000 @ 15-17-17-35. If it runs at this speed then test the engine again on the same position for NPS.

I would be interested in the results. Thanks.

Regards,

Forrest

Re: After re-reading my previous post...

Posted: Mon May 08, 2017 5:27 pm
by corres
[quote="Zenmastur"]
[quote="corres"]
I would get CPUID and see what speed it say's you are running the CPU and ram at. Also check to see what the SPD chip has the ram programmed for. i.e. what JEDEC and XMP settings are listed.Then I would select a well known chess position and test your favorite chess engine's average NPS on that position. Then change the memeory to run 2400 @ 12-14-14-28 to see if it will run with these settings. It shouldn't have any problems if it's designed to run 3000 @ 15-17-17-35. If it runs at this speed then test the engine again on the same position for NPS.

I would be interested in the results. Thanks.

Regards,

Forrest[/quote]

Before I run the RAM at 2933 MHz I tried running it at 2666 MHz and it worked well. The clk-s were 18 and 40 and the CPU run at 3.7 GHz (default value!) in both cases. For the investigation of modifications in PC speed I used the tool of Fritz Benchmark from Deep Fritz 14. The instability in Fritz Benchmark were higher then 2 % so the speed enhancement I marked before is an estimated value.
Now my Ryzen 7 1800x processor is running at 4.0 GHz with default (2133 MHz) Ram speed in a very steadily manner.

So sorry, but I do not want to disturb it with newer experiments.

Best regards,

Robert