I use Fedora.
Maybe this helps: https://community.linuxmint.com/softwar ... s-unstable
Otherwise, try compiling it directly from the source. Apparently the source code is part of the linux kernel source, under tools/perf.
Moderators: hgm, chrisw, Rebel
I use Fedora.
I just tried this myself. I got a bunch of warnings, probably because the version of the kernel source I downloaded does not correspond to the kernel headers currently installed on my machine, but it produced a binary that seems to work.
Code: Select all
struct Thread {
std::thread thread;
Search* search;
};
class ThreadPool
{
std::array<Thread, MAX_THREADS> threads;
}
First, your windows executable is slower than the Linux one (it is often the case, and I guess because of different ABI).Carbec wrote: ↑Tue Mar 28, 2023 10:46 am Hello,
I now have a dual boot windows/linux. So I made a little experiment. The same test : positions from BT2630.epd, depth max 10.
threads nps(linux) nps(windows)
---------------------------------------
1 : 5.2 : 4.5
2 : 10.0 : 8.6
3 : 14.7 : 13.3
4 : 19.0 : 16.1
5 : 20.9 : 19.1
6 : 22.0 : 21.6
8 : 22.9 : 23.7
Don't ask me why there is a difference !!
I could also use perf c2c, from what I understand, there is no false sharing !1 : 5.1
2 : 10.2
3 : 15.0
4 : 19.5
6 : 22.7
8 : 24.9
The gain in time is a bad metric. The gain in strength is the right metric. Here lazy SMP shines.
The speeds you are reporting now for 1-4 threads indeed suggest that everything is fine.Carbec wrote: ↑Thu Mar 30, 2023 3:04 pm Hello,
I finally change my distribution for Ubuntu-Cinammon. I love it)
Redo my little test : same positions, but 10 s per move. 1 s is too small
to have a goot idea. These are the values for linux, I didn't do the tests
for windows. Im too lazy)
I could also use perf c2c, from what I understand, there is no false sharing !1 : 5.1
2 : 10.2
3 : 15.0
4 : 19.5
6 : 22.7
8 : 24.9
Interesting question, I don't know.syzygy wrote: ↑Sun Mar 26, 2023 1:36 amOr is this wrong if the cores are accessing the same memory channel?
Does someone know how this works? Are memory accesses effectively pipelined by the memory controller, so that access latencies overlap? Or does the memory controller handle simultaneous cache misses by different cores/threads strictly sequentially?
--Modern DRAMs have multiple banks to serve multiple memory requests in parallel. However, when two requests go to the same bank, they have to be served serially, exacerbating the high latency of on-chip memory. Adding more banks to the system to mitigate this problem incurs high system cost. Our goal in this work is to achieve the benefits of increasing the number of banks with a low-cost approach. To this end, we propose three new mechanisms, SALP-1, SALP-2, and MASA (Multitude of Activated Subarrays), to reduce the serialization of different requests that go to the same bank. The key observation exploited by our mechanisms is that a modern DRAM bank is implemented as a collection of subarrays that operate largely independently while sharing few global peripheral structures.