Zach Wegner wrote:diep wrote:Accessing the same cachelines in principle you ALWAYS should avoid whenever possible, as you never know which entity owns that cacheline and whatever that entity had done with it. With local cachelines there is no such problem.
Maybe in principle, but there are plenty of exceptions where you can do reads on shared memory without fear of hitting a dirty cache line. For example, if I'm deciding whether to split or not based on the current iteration number, the chance that it is in our cache but has been modified is virtually nil. In many other instances it's simply unavoidable, or you could simply do without shared memory.

Dead wrong, realize how much system time in a program like crafty goes to cache snoops in a 2 socket system. You want to increase that?
Easier is to have it local and not shared. So in case of AMD that's using a different, better, protocol (for 4 socket systems which is what i was told what most likely is the reason the Xeon MP didn't release yet, as this is tough to get bugfree). It uses the DEC Alpha system, which nowadays is a tad more complex than it was back some years ago. That's both true for intel as well as AMD.
They try all kind of tricks to avoid to ship that message from this memory controller to a remote memory controller. Let me try to say it simple.
The obvious trick to try is that when something is only local and not in any other L3 cache, that you don't need to ship a message at all to "possible update" for this cache line.
In any test you really try and measure very significant, doing things ONLY local at a core is ALWAYS better in multiple socket systems that are NUMA.
Nowadays that's both Intel as well as AMD.
NUMA is simply the cheapest manner of doing things and by now the OS-es schedule reasonably for it (not perfect).
So if we have a model of A and B where both A and B are memory controllers, obviously taking care that a cacheline is only at A and not shared with cores in B is the best.
In diep the evaluation tables and pawntables are local of course. Every core allocates its own tables. The hashtable also gets allocated local but shared across all cores. Note hashtable ALSO stores evaluation information.
If i use a shared hashtable, in theory that could improve the hitrate in evaluation table by 1% or so.
So when i'm garantueed of having fast shared memory, that sure is a point of improvement at single socket machines, but as things move currently it's not worth doing it.
I'm not locking a single table of course. Tim Mann's XOR gets used everywhere.
In Diep for cachelines i want to be sure that IN THIS cacheline can't get written in, i'm manually doing something like this:
Code: Select all
struct MovesMade {
/* DIEP NUMA SMP */
char dummycachelineabxle[CACHELINELENGTH];
lockvar lock_initidlelist;
char dummycachelinefoslo[CACHELINELENGTH];
volatile int totalinitidle;
volatile int initidleproccies[MAXPROCESSES];
char dummycachelinexoiws[CACHELINELENGTH];
int nidlelists; // aantal idle lists
volatile int uselistnr; // processnummer waar idlelist is located voor deze proc
char dummycachelineolkeh[CACHELINELENGTH];
// alles tricky in zelfde cacheline nu.
volatile int totalidle; // totaal aantal CPUs dat idle is
char dummycachelineikqzw[CACHELINELENGTH];
...
The first variable is a lock. Of course other processes might also lock it.
Let's suppose some other proces P1 tries to get the lock from P0.
P0 is just modifying datastructure and writing cacheline.
It means obviously that theoretical spoken it is possible that P1 overwrites data that P0 just tried to modify. To avoid that i'm having a cachelinesize difference between the variables.
Note the odds of this happening is rather tiny. In reality things work a tad more complex than the above theory, causing it to go wrong in less occasions than theoretical it should go wrong.
Which is why so many crap programs do not crash that much, but just seldom.
I hope that answers some questions i also saw in another thread here.