Memory usage in make/unmake vs logic complexity

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 11:13 am

Aleks Peshkov wrote:
matthewlai wrote:I am rewriting my engine because I want to use it as basis for my master's research (I have a few novel ideas on how to apply machine learning), and I don't want to be handicapped by slow low level stuff.
You would be handicapped anyway unless you start from Stockfish.

I don't mind it not being as fast as Stockfish. I just need it to be not something like 90% slower, which is the case with my old engine. Right now it's about 33% slower than Crafty, and I'm OK with that.

Yes, I thought about starting from SF. But my modifications will be so extensive that it wouldn't really save any time. It would be better to build a codebase from the ground up exactly the way I like it.

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 11:19 am

syzygy wrote:
matthewlai wrote:IThat means even a L1 hit is equivalent to about 10 instructions, and a L2 hit is about 24 instructions.
Not correct. The CPU can execute instructions while waiting for L1 or L2.

With copy/make both the old and new state memory will as a rule be in L1. The state is copied many bytes at a time. This is known to be quite competitive on modern CPUs.

I rewrote mine to store and load the entire position states wholesale (about 210 bytes). Same move generation routine, same ApplyMove logic, and I get about 900Knps.

This is with 2 copies, while a more sensible implementation would just require 1 copy. But even with the theoretical maximum speedup of 200% from that, it would still only be in the range of 2Mnps. That's still an order of magnitude slower than my selective-copy approach (I have no idea what to call that anymore).

I basically memcpy all the board state (~210 bytes) into a stack of statically allocated buffers at the beginning of ApplyMove, and memcpy it back for unmake.

Is there a copy/make engine that actually achieves reasonable NPS in perft with no bulk counting and no hash? (with bulk counting the speed of make/unmake would have a much lower impact on total perft time)

sje · Post by **sje** » Sun Aug 31, 2014 11:59 am

Some time ago, I did extensive testing comparing incremental restore vs copy restore. Essentially, incremental restore wins for both bitboard and mailbox chess programs. I suspect that the margin of advantage increases over time as memory speeds increasingly lag behind CPU speeds.

Of course, there are some values which must be copy restored as they don't have time reversed computability. The half move counter is one of these and there will be others like the en passant target square (and associated hash contributions).

But for the board and any bitboard database, it's faster to do incremental restore; there's just too much data to be moved otherwise.

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 12:04 pm

sje wrote:Some time ago, I did extensive testing comparing incremental restore vs copy restore. Essentially, incremental restore wins for both bitboard and mailbox chess programs. I suspect that the margin of advantage increases over time as memory speeds increasingly lag behind CPU speeds.

Of course, there are some values which must be copy restored as they don't have time reversed computability. The half move counter is one of these and there will be others like the en passant target square (and associated hash contributions).

But for the board and any bitboard database, it's faster to do incremental restore; there's just too much data to be moved otherwise.

In my program I am basically copying any field (bitboard or MB square) that has been touched (usually that's about 5 fields). That seems to work as well, and is probably easier than incrementally updating bitboards.

It feels like CPU instructions are almost free now (to determine what needs restoring) with superscalar execution and shorter pipelines (compared to Pentium 4 era stuff).

Daniel Mehrmann · Post by **Daniel Mehrmann** » Sun Aug 31, 2014 10:38 pm

You would be handicapped anyway unless you start from Stockfish.

That's wrong. There are so many ways to implement move_do() or move_undo() based on the framework you like to use.

If you use a different framework you might get faster ways like SF. SF is no god level.

syzygy · Post by **syzygy** » Sun Aug 31, 2014 11:09 pm

matthewlai wrote:Is there a copy/make engine that actually achieves reasonable NPS in perft with no bulk counting and no hash? (with bulk counting the speed of make/unmake would have a much lower impact on total perft time)

Perft speed is pretty much irrelevant. Komodo, at least several versions ago, was said to use copy/make. I don't know perft results for Komodo.

SF uses copy/make for part of the state.

http://talkchess.com/forum/viewtopic.php?t=50805

bob wrote:First point of notice, copy/make STILL slows me down by almost 10%, (...)

Of course that's overall nps, not perft.

Another thread:
http://talkchess.com/forum/viewtopic.php?t=39938

A pure make/unmake might have the advantage that all state is in one struct pointed to by one register. With a hybrid approach as in my engine (but also in SF), the state is divided over two structs. This might speak for saving the value of the hash key during make and restoring it during unmake instead of creating the new value in the next element of a state array and using the value in that next element. But I doubt that it would be faster.

bob · Post by **bob** » Mon Sep 01, 2014 4:27 am

syzygy wrote:
matthewlai wrote:Is there a copy/make engine that actually achieves reasonable NPS in perft with no bulk counting and no hash? (with bulk counting the speed of make/unmake would have a much lower impact on total perft time)
Perft speed is pretty much irrelevant. Komodo, at least several versions ago, was said to use copy/make. I don't know perft results for Komodo.

SF uses copy/make for part of the state.

http://talkchess.com/forum/viewtopic.php?t=50805
bob wrote:First point of notice, copy/make STILL slows me down by almost 10%, (...)
Of course that's overall nps, not perft.

Another thread:
http://talkchess.com/forum/viewtopic.php?t=39938

A pure make/unmake might have the advantage that all state is in one struct pointed to by one register. With a hybrid approach as in my engine (but also in SF), the state is divided over two structs. This might speak for saving the value of the hash key during make and restoring it during unmake instead of creating the new value in the next element of a state array and using the value in that next element. But I doubt that it would be faster.

I have two hash signatures, one for all pieces, one for pawns. I don't unmake either, I just save the original, and then restore it.

sje · Post by **sje** » Mon Sep 01, 2014 7:27 am

Oscar performs a save/restore of ten scalars for each move execute/retract operation. The move is saved, too. Together, these ten variables reside in two structures:

Code: Select all

// FEN scalar set; saved/restored on execute/retract

typedef struct
{
  Color good;  // Color on the move
  Color evil;  // Color not on the move
  Cabs  cabs;  // Castling availability bits
  Sq    epsq;  // En passant target square
  ui    hmvc;  // Half move counter
  ui    fmvn;  // Full move number
} Fss;

// Auxiliary position data; saved/restored on execute/retract

typedef struct
{
  bool incheck;  // In check status
  bool dbcheck;  // Double check status
  TiBV pinned;   // Target indexed bit vector: pinned targets
  TiBV frozen;   // Target indexed bit vector: frozen targets
} Apd;

A total of forty bytes which could be compressed except for a increase in complexity. The saved values are stored in a stack which in turn is a component of the enclosing position object:

Code: Select all

// Preserved position data

typedef struct
{
  Fss  fss;   // Saved FEN scalar set record
  Apd  apd;   // Saved auxiliary position data record
  Move move;  // Saved played move
} Ppd;

// Preserved position data stack

typedef struct
{
  Ppd ppdvec[PlyLen];  // Vector of saved preserved position data records
  ui  ppdcount;        // Count of stacked entries; index of next free element
} PpdStk;

And the rest:

Code: Select all

// Chessboard

typedef struct
{
  Man manvec[SqLen]; // Chessboard contents (men), indexed by square
} Board;

// Target tracker

typedef struct
{
  Sq sqvec[TiLen];            // Map: target index -> square
  Ti tivec[SqLen];            // Map: square -> target index
  Sq captsqvec[ChessSetLen];  // Vector of saved squares, one element per capture
  Ti capttivec[ChessSetLen];  // Vector of saved target indexes, one element per capture
  ui captcount;               // Count of stacked captures; index of next free element
} Tracker;

// Chess position

typedef struct
{
  Fss     fss;      // FEN scalar set
  Apd     apd;      // Auxiliary position data
  Board   board;    // Chessboard
  Tracker tracker;  // Target tracker
  PpdStk  ppdstk;   // Preserved position data stack, one entry per played move
} Position;

sje · Post by **sje** » Mon Sep 01, 2014 7:37 am

bob wrote:I have two hash signatures, one for all pieces, one for pawns. I don't unmake either, I just save the original, and then restore it.

In Symbolic, the chessman insertion/deletion/motion routines are time symmetric with hash signatures updated appropriately. There's a little magic in the Execute()/Retract() routines which update hash signatures for castling and en passant changes.

So, Symbolic doesn't save/restore any hash signatures. However, there is a compile time debug switch that will do a save/restore just for the purpose of checking that the incremental hash code did its job properly.

Edit: Symbolic does store the chessman location hash signature, but does not restore from the saved value. The saved values are used for repetition detection only.

matthewlai · Post by **matthewlai** » Mon Sep 01, 2014 9:01 am

syzygy wrote: Perft speed is pretty much irrelevant. Komodo, at least several versions ago, was said to use copy/make. I don't know perft results for Komodo.

I would say it's relevant as a benchmark for move generation/apply/unapply. As long as no one is doing bulk counting, hashing, or other perft-specific optimization.

Each internal node would be exactly 1 full move generation, 1 apply, and 1 un-apply. Why is that not relevant?

SF uses copy/make for part of the state.

I think there's some misunderstanding here.

My definition (and Bob's definition, too, judging by the thread you linked) of copy/make is copying the entire position state (~200 bytes in most engines) before making a move.

Copy/make for part of the state is not what I would call copy/make.

It's what I am doing with my engine, too. I only copy the changed fields, which is about 30 bytes on average.

http://talkchess.com/forum/viewtopic.php?t=50805
bob wrote:First point of notice, copy/make STILL slows me down by almost 10%, (...)
Of course that's overall nps, not perft.

With Crafty 24.0, on my machine, I get about 32Mnps perft (31ns per node), and 360ns per node in "analyze" from start position.

So move generation/apply/unapply only accounted for less than 10% of each node's time. A 10% slowdown in total NPS would mean generation/apply/unapply is taking more than 2x as long.

Another thread:
http://talkchess.com/forum/viewtopic.php?t=39938

A pure make/unmake might have the advantage that all state is in one struct pointed to by one register. With a hybrid approach as in my engine (but also in SF), the state is divided over two structs. This might speak for saving the value of the hash key during make and restoring it during unmake instead of creating the new value in the next element of a state array and using the value in that next element. But I doubt that it would be faster.

On x86-64 (unlike x86-32), there is a crapload of registers. I don't think register pressure would really be a problem on x86-64.

With my engine, on each apply, I record the list of changed fields and their old value. That minimizes copying at the expense of some extra logic, while being not nearly as complicated as make/unmake with absolutely minimal information. It seems to be reasonably fast as well (about 2/3 as fast as Crafty, without any serious optimization effort).

Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Oscar's ten scalars

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity