It makes perfect sense to me.Milos wrote:That doesn't make much sense.Zenmastur wrote:On a large TT TLB misses are VERY expensive. IIRC they actual can be the limiting factor on nps. I don't remember the exact figures but I think a TLB miss can cost 600+ cycles. When using a large TT most TT probes will produce a TLB miss. Even at 10,000,000 nps this causes problems an becomes a limiting factor to node processing.mjlef wrote:... Where a CPU can do an instruction in a clock cycle or two involving on chip operations (so a couple of billion operations a second), external access, especially access from another NUMA node can take 100 clock cycles or more, depending on the hardware.
I get about 5% increase in NPS by going from 2133 to 2666 or 2800. But these figures will vary depending on how well your system is tuned and what the latency of the ram is. In some cases it's actually better to run a lower clock frequency on the ram if this allows the ram to be run at lower latencies.
They do cost the same for the most part. There is a difference in the look up if you get a hit that the full decode isn't stored. But on a miss they all take a huge number of clock cycles. The problems comes when you have multiple outstanding TLB misses that need to be translated. Older Intel cpu's (Sandy-bridge, Ivy-Bridge, Haswell) have one page miss handler per core (not 100% sure on this being per core, so I'll have to look it up). The misses get put in a cue to be serviced. If you have a lot of TLB misses this can cause the wait time to grow quite large. Can you 1600+ clocks, sure you can! Additionally the page table resides in memory, which means each translation must access main memory multiple times. If the memory bus is heavily loaded get ready to wait even longer because these are cued as well.Milos wrote:TLB misses should cost roughly the same in terms of latency no matter the size of TT.
Oh really???Milos wrote: Most of TT probes can never be TLB misses.
The fact is, any time the page count for the TT exceeds the TLB size there will be misses. A 16 Gb TT has a total of 2^22 4k pages. A Haswell CPU has a TLB of 1024 entries. This number assumes 4k pages as the number of entries decreases as page size is increased. TT probes access pages in a pseudo-random manner. This means that on average the referenced page will only be in the TLB about 1/(2^22/2^10)= 0.0244% of the time. The other 99.976 % of the time a TLB miss will occur. So I have absolutely no idea why you think “Most TT probes can never be a TLB miss”!
Because each TLB miss generates multiple main memory accesses. Therefore the speed and latency of main memory coupled with the % of main memory bus loading become the limiting factors to page miss handling.Milos wrote: And why would TLB miss probability or latency depend on RAM frequency?
Milos wrote: I also doubt you could gain much by running RAM on lower frequency. Usually total latency is pretty much constant for quite some range of frequencies.
“IF” you you have the right motherboard (and I do) you can adjust the main memory frequency and individual timing parameters as you see fit. Of course your memory has to be capable of actually run reliably at the speeds and latencies selected.
You can trust me on this one: The difference between running DDR 2133 at 15-15-15-35 and running DDR4 3600 at 15-15-15-35 is a big deal. You have less than 60% of the latency and over 168% the available bandwidth when using the faster memory. Of course this will make little difference if you application is neither latency bound or bandwidth bound. For chess I would say the latency is more important than bandwidth by a moderately large margin. i.e. something like 60/40, but this depends on how fast the CPU is, with very fast CPU's bandwidth becomes somewhat more import.
Milos wrote:Wiki quotes 10-100 cycles for TLB miss.
I don't know who wrote that on wiki or when it was written. It may have been true when it was written, but this has changed over time due to the huge gains in CPU clock rates while only modest increases in memory clock rates have occurred.
As I noted earlier, as page size increase so does the entry size in the TLB. This yields fewer large pages entries in the TLB and even fewer huge pages entries. So while large and huge pages do help, they don't help as much as you might think. In fact, I have read several research papers that claim for general purposes large and huge pages can actually hurt performance.Milos wrote: And if you use large or huge pages there won't be many misses. So even if you have 20% misses and like 1 clk cycle of TLB hit and 30 clk cycles for TLB miss that would produce 6 additional cpu cycles on average per each node evaluated.
But, I should note here that different CPU's have different spec's and behave differently under various loads. e.g. Intel's Lake series CPU's have a 1.5k TLB and a second page miss handler, so under very heavy load they will perform differently than older Intel CPU's as will various AMD CPU's. i.e. this is very CPU dependent.