TalkChess.com

Posted: **Mon Nov 26, 2018 9:36 pm**

Hi all,

I am steadily making my Java engine Axolotl less horrible, and there is one particular bit of code that bloats my engine a heck of a lot.

Code: Select all

if (white){
            knights = board.WHITE_KNIGHTS;
        }
        else {
            knights = board.BLACK_KNIGHTS;
        }

Code: Select all

long myPieces = white ? board.WHITE_BISHOPS : board.BLACK_BISHOPS;

And similar. While this might not slow me down a lot (I am not sure about though) it is ugly and I would like to get rid of it.

In your (Java) engine, how do you get around this kind of checking?

Kind regards,
Louis

Posted: **Mon Nov 26, 2018 10:12 pm**

Hi Louis,

in my engine 'white' is a number (0=white, 1=black)

So I just have an array for knights

Code: Select all

board.knights[white]

Actually I have a double array for pieces

Code: Select all

board.pieces[Constants.KNIGHT][white]

--Laurens

Posted: **Mon Nov 26, 2018 10:20 pm**

or alternatively you can do this:

Code: Select all


const WHITE = 0;
const BLACK = 1;

const PAWN = 0;
...
const KING = 5;

struct Position
{
	Bitboard player[2];
	Bitboard pieces[6];
};

So to get the white bishops:

Code: Select all

white_bishops = position.player[WHITE] & position.pieces[BISHOP]

Thats how I do it anyway.

Posted: **Mon Nov 26, 2018 10:39 pm**

Thank you for the feedback! Two very good ideas, I will get on it.
Best,
Louis

Posted: **Mon Nov 26, 2018 11:18 pm**

ljgw wrote: ↑Mon Nov 26, 2018 10:12 pm in my engine 'white' is a number (0=white, 1=black)

Two problems with this in Java:

boolean is not implicitly convertible to int; you'll need b ? 1 : 0.
Arrays are bounds-checked, which takes time (although the check may be optimized out)

Posted: **Tue Nov 27, 2018 12:09 am**

Sesse wrote: ↑Mon Nov 26, 2018 11:18 pm Two problems with this in Java:

boolean is not implicitly convertible to int; you'll need b ? 1 : 0.

Arrays are bounds-checked, which takes time (although the check may be optimized out)

I think the JIT might generate cmov or some other magic like that in case 1, perhaps even the frontend could optimize it (but not sure if Java bytecode has support for this, WebAssembly has "select" opcode I think).

Bounds checks are certainly a big problem in C# (I bet Java faces the same problems) due to the way arrays are represented. Typically extra pointer to "array object" with data and size members so that it can be GC-d and safe (+most likely no way to use references as class members because otherwise you couldn't track aliasing which can be hard problem in general, imagine you'd store reference to array[index] in a class, somewhere, I believe C# allows some sort of refs but only on stack).

There's no nice way to define static constant arrays in these languages which does't suck; even if you have say PST and the compiler might actually determine that you'll never index it out of bounds, it still generates the bounds check, because it's still technically something allocated on heap and GC-d later.
So it caches the object pointer in a register, but then does fetch size member via indirection anyway (doesn't even know the array has 64 elements even if it's
readonly or immutable or whatever).
I'm not sure it caches the array data pointer; but I don't remember seeing it cached (maybe that was some corner case I was looking at).
Sometimes the compiler is smart enough to move the bounds check outside of a loop, but in C# that implies that you iterate from 0 too array.length and they stll have a long way to go to improve this (for example, if you use backward iteration then it doesn't optimize this)

Indexing multi-dimensional arrays is even worse than that and the bounds check is generated all the time, one per dimension. I'm not sure Java does better here than C#.

The price to pay for safety, in C# you can't turn off bounds checking and I bet the same holds for Java.
Now compare that to C/C++ where indexing the static array itself is a single instruction which can be easily pipelined.

In C#, you might be able to work around this by using unsafe/fixed? and use pointers directly, if you could extract data pointer from array object somehow, which I'm not sure is possible.

Either way, I don't believe that bounds checking is the whole story why these managed languages perform so badly on chess engines, most likely their JIT compiler is not as good as fanboys claim (but they went a long way and can generate very good code if you don't need arrays).
For bitboard engines, I'm not sure there are intrinsics for bit scan/pop count (maybe there's is some way, I don't know).

Or maybe it's because AOT compilers can spend any time they like optimizing the code, they can optimize across compilation units and more.

Long story short: if you want maximum performance for your chess program, you don't want to use Java or C#.

Posted: **Tue Nov 27, 2018 6:18 am**

mar wrote: ↑Tue Nov 27, 2018 12:09 am In C#, you might be able to work around this by using unsafe/fixed? and use pointers directly, if you could extract data pointer from array object somehow, which I'm not sure is possible.

Possible and a requirement for interop. Both unsafe and fixed required. Long-standing fixed's can foil the GC's allocation strategy and fragment memory, however, so this really isnt a great solution.

mar wrote: ↑Tue Nov 27, 2018 12:09 am For bitboard engines, I'm not sure there are intrinsics for bit scan/pop count (maybe there's is some way, I don't know).

There arent in the .NET Framework, but in .NET Core 2.1 / Platform Extensions 2.1 there is System.Runtime.Intrinsics.X86 complete with popcount and leading zero count operations. There are also intrinsics for ARM.

I have not played at all with .NET Core yet, so dont know if they end up just being horrible calls to external libs or what.

mar wrote: ↑Tue Nov 27, 2018 12:09 am Or maybe it's because AOT compilers can spend any time they like optimizing the code, they can optimize across compilation units and more.

At least for C#, a big part of the performance issues is the very conservative inlining. Microsoft continues to recommend fat functions because of it.

mar wrote: ↑Tue Nov 27, 2018 12:09 am Long story short: if you want maximum performance for your chess program, you don't want to use Java or C#.

Given that TANSTATFC is true, I wouldnt worry about the small stuff. Sure its nice to eliminate a bounds check here and there, have constants folded across modules, and all that jazz.. but these are all part of narrow subsets of all possible optimizations. Compilers dont optimize 8x8 engines into bitboard engines, after all. The biggest optimizations are never handled by compilers.

Posted: **Tue Nov 27, 2018 11:55 am**

DustyMonkey wrote: ↑Tue Nov 27, 2018 6:18 am Given that TANSTATFC is true, I wouldnt worry about the small stuff. Sure its nice to eliminate a bounds check here and there, have constants folded across modules, and all that jazz.. but these are all part of narrow subsets of all possible optimizations. Compilers dont optimize 8x8 engines into bitboard engines, after all. The biggest optimizations are never handled by compilers.

I wouldn't consider say 100 elo small stuff, but let's say it's implied handicap you'll never get back by using these languages. Let's consider it a challenge.

Posted: **Tue Nov 27, 2018 2:27 pm**

mar wrote: ↑Tue Nov 27, 2018 11:55 am I wouldn't consider say 100 elo small stuff, but let's say it's implied handicap you'll never get back by using these languages. Let's consider it a challenge.

First:

After my last post I toiled for a few hours getting .NET Core and the intrinsics extensions up and running in visual studio community edition. It was harder and more complicated than it should have been, but new info:

They are effectively real intrinsics (they get inlined to the intended instruction) and seem to include everything important on the x86 side (including PEXT.) On the ARM side I dont know because I would not know what is missing but should be there, the bulk of the arm intrinsics seeming to be very specific to encryption.

Second:

You say languages, but what you are talking about is a function of the compiler not the language. Just because the compiler isnt doing an optimization, that doesnt prevent you from implementing that optimization anyways.

I highly suspect that the conservative inlining and the limitations it brings to compiler-offered optimizations is your issue. What did you do before your C++ compiler had whole program optimization? Surely you didnt just suffer bad performance because the compiler wasnt very good at some things.. surely you worked with the compiler to still produce efficient binaries, right?

Posted: **Tue Nov 27, 2018 3:20 pm**

DustyMonkey wrote: ↑Tue Nov 27, 2018 2:27 pm You say languages, but what you are talking about is a function of the compiler not the language. Just because the compiler isnt doing an optimization, that doesnt prevent you from implementing that optimization anyways.

I highly suspect that the conservative inlining and the limitations it brings to compiler-offered optimizations is your issue. What did you do before your C++ compiler had whole program optimization? Surely you didnt just suffer bad performance because the compiler wasnt very good at some things.. surely you worked with the compiler to still produce efficient binaries, right?

If what you say was true, then these languages (I mean C# and Java) would run as fast as native code. Except for some synthetic benchmarks, this is clearly not the case.

I'm more concerned about bounds checks rather than inlining - sometimes I even explicitly tell the C++ compiler to NOT inline certain functions to reduce code bloat. I also think you can tag some functions in C# to use aggressive inlining.
I also doubt that you get more than a couple of % by using whole program optimization (you can always put stuff in the headers, this is implied by using templates anyway so the compiler knows about what these functions).
Question is how costly is a (non-virtual) call in these managed languages. x64 ABI passes first n args in registers so it shouldn't be a big deal.

The problem is that arrays in C# or Java are objects, this is both for safety and it simplifies languages design, but the compiler no matter how good can't do much about it - this is why they write Burst compiler in Unity now which operates on a subset of C# with more vector-like arrays and other stuff because performance matters for some applications.

Chess programs use tables a lot so I suspect part of the performance loss is in boundchecks, but I don't believe that it's the whole story

Even if you can hack your way through pointers in unsafe code to faster binary (I think this isn't possible in Java but only in C#), the resulting code
won't be pretty and you would do better by using C right from the start.

I remember there used to be a project called PortFish in the past, back in 2012. The author did a very good job optimizing it.
Still it ran 2 times slower than the C++ version (I compared to SF 3 which should be close but perhaps newer).
I got 900kn/s C# vs 2Mn/s C++. I used the non-pocnt version compiled by Jim Ablett.

TalkChess.com

Java: white ? repeat : repeat;

Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;

Re: Java: white ? repeat : repeat;