[.Net only] - fast bit operations

Discussion of chess software programming and technical issues.

Moderator: Ras

bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

[.Net only] - fast bit operations

Post by bpfliegel »

Does anyone have fast x86/x64 bitscan functions (fsb, lsb, popfirst) in C# (which are not Mono bound)? What do you use?
Playing around with these ideas, but not sure that they are worth the possible overhead (like having a delegate):

http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
http://stackoverflow.com/questions/3216 ... in-c-sharp
http://blogs.microsoft.co.il/blogs/sash ... rom-c.aspx

Unsafe code ideas are welcome aswell.

Any thoughts?
Thanks, Balint
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: [.Net only] - fast bit operations

Post by bpfliegel »

Any dirty tricks regarding biboard related operations are welcome. Checked/unchecked, safe/unsafe.
My current ones are available in Portfish: https://github.com/bpfliegel/Portfish

Cheers, Balint
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: [.Net only] - fast bit operations

Post by bpfliegel »

Okay, so here we go - going for an x64 pop first bit function.

Idea from here:
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx

Original C code looks like this:

Code: Select all

#include "stdafx.h"
#include <intrin.h>

int __declspec(noinline) __stdcall pop_1st_bit_64(unsigned long& b) {
	unsigned long index;
	_BitScanForward64(&index, b);
	b &= ~(1ULL<<(index));
	return index;
}
C# code with some tests like this (can't even use FastCall...)

Code: Select all

using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Tasks;

namespace Popcount
{
	// Original idea here: http://stackoverflow.com/questions/3216535/x86-x64-cpuid-in-c-sharp
	public static class PopFirst
	{
		#region Interop

		[DllImport("kernel32.dll", SetLastError = true)]
		private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize, AllocationType flAllocationType,
			MemoryProtection flProtect);

		[DllImport("kernel32")]
		private static extern bool VirtualFree(IntPtr lpAddress, UInt32 dwSize, UInt32 dwFreeType);

		[Flags()]
		private enum AllocationType : uint
		{
			COMMIT = 0x1000,
			RESERVE = 0x2000,
			RESET = 0x80000,
			LARGE_PAGES = 0x20000000,
			PHYSICAL = 0x400000,
			TOP_DOWN = 0x100000,
			WRITE_WATCH = 0x200000
		}

		[Flags()]
		private enum MemoryProtection : uint
		{
			EXECUTE = 0x10,
			EXECUTE_READ = 0x20,
			EXECUTE_READWRITE = 0x40,
			EXECUTE_WRITECOPY = 0x80,
			NOACCESS = 0x01,
			READONLY = 0x02,
			READWRITE = 0x04,
			WRITECOPY = 0x08,
			GUARD_Modifierflag = 0x100,
			NOCACHE_Modifierflag = 0x200,
			WRITECOMBINE_Modifierflag = 0x400
		}

		#endregion

		[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
		public delegate int PopFirstBit(ref UInt64 b);

		static byte[] pop_first_x64 = new byte[]
		{
			0x48 ,0x89 ,0x4C ,0x24 ,0x08       //mov         qword ptr [rsp+8],rcx  
			,0x57                   //push        rdi  
			,0x48 ,0x83 ,0xEC ,0x40          //sub         rsp,40h  
			,0x48 ,0x8B ,0xFC             //mov         rdi,rsp  
			,0xB9 ,0x10 ,0x00 ,0x00 ,0x00       //mov         ecx,10h  
			,0xB8 ,0xCC ,0xCC ,0xCC ,0xCC       //mov         eax,0CCCCCCCCh  
			,0xF3 ,0xAB                //rep stos    dword ptr [rdi]  
			,0x48 ,0x8B ,0x4C ,0x24 ,0x50       //mov         rcx,qword ptr [b]  
			//unsigned long index;
			//_BitScanForward64(&index, b);
			,0x48 ,0x8B ,0x44 ,0x24 ,0x50       //mov         rax,qword ptr [b]  
			,0x8B ,0x00                //mov         eax,dword ptr [rax]  
			,0x48 ,0x0F ,0xBC ,0xC0          //bsf         rax,rax  
			,0x89 ,0x44 ,0x24 ,0x24          //mov         dword ptr [index],eax  
			//b &= ~(1ULL<<(index));
			,0x8B ,0x44 ,0x24 ,0x24          //mov         eax,dword ptr [index]  
			,0xB9 ,0x01 ,0x00 ,0x00 ,0x00       //mov         ecx,1  
			,0x48 ,0x89 ,0x4C ,0x24 ,0x38       //mov         qword ptr [rsp+38h],rcx  
			,0x0F ,0xB6 ,0xC8             //movzx       ecx,al  
			,0x48 ,0x8B ,0x44 ,0x24 ,0x38       //mov         rax,qword ptr [rsp+38h]  
			,0x48 ,0xD3 ,0xE0             //shl         rax,cl  
			,0x48 ,0xF7 ,0xD0             //not         rax  
			,0x48 ,0x8B ,0x4C ,0x24 ,0x50       //mov         rcx,qword ptr [b]  
			,0x8B ,0x09                //mov         ecx,dword ptr [rcx]  
			,0x48 ,0x23 ,0xC8             //and         rcx,rax  
			,0x48 ,0x8B ,0xC1             //mov         rax,rcx  
			,0x48 ,0x8B ,0x4C ,0x24 ,0x50       //mov         rcx,qword ptr [b]  
			,0x89 ,0x01                //mov         dword ptr [rcx],eax  
			//return index; 
			,0x8B ,0x44 ,0x24 ,0x24          // mov         eax,dword ptr [index]  
			//}
			,0x8B ,0xF8                // mov         edi,eax  
			,0x48 ,0x8B ,0xCC             // mov         rcx,rsp  
			//,0x48 ,0x8D ,0x15 ,0x83 ,0x5F ,0x00 ,0x00 // lea         rdx,[std::numeric_limits<float>::min_exponent10+26Ch (013FACBEB0h)]  
			//,0xE8 ,0xFE ,0xDB ,0xFF ,0xFF       // call        _RTC_CheckStackVars (013FAC3B30h)  
			,0x8B ,0xC7                // mov         eax,edi  
			,0x48 ,0x83 ,0xC4 ,0x40          // add         rsp,40h  
			,0x5F                   // pop         rdi  
			,0xC3                   // ret  
		};

		// Taken from Portfish/Stockfish
		internal static readonly int[] PopTable = new int[64];
		internal static int pop_first_x64_sw(ref UInt64 b)
		{
			UInt64 bb = b;
            b &= (b - 1);
            return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
		}

		public static void PopFirstTest()
		{
			// Init (taken from Portfish/Stockfish)
			for (int i = 0; i < 64; i++)
			{
				PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
			}

			IntPtr codepointer = IntPtr.Zero;
			try
			{
				byte[] codeBytes = pop_first_x64;
				codepointer = VirtualAlloc(
					IntPtr.Zero,
					new UIntPtr((uint)codeBytes.Length),
					AllocationType.COMMIT,
					MemoryProtection.EXECUTE_READWRITE
				);

				Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);

				PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));

				Console.WriteLine("test prepared, press any key");
				Console.ReadKey();

				for (UInt64 i = 0; i < 5000000; i++)
				{
					UInt64 target = i;
					while (target != 0) { int first = pop1st(ref target); }
				}

				Console.WriteLine("first passed");
				Console.ReadKey();

				for (UInt64 i = 0; i < 5000000; i++)
				{
					UInt64 target = i;
					while (target != 0) { int first = pop_first_x64_sw(ref target); }
				}

				Console.WriteLine("second passed");
				Console.ReadKey();
			}
			finally
			{
				if (codepointer != IntPtr.Zero)
				{
					VirtualFree(codepointer, 0, 0x8000);
					codepointer = IntPtr.Zero;
				}
			}
		}
	}
}
Slow as hell! (~around 1:5)

There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...

Anyone in?

Cheers, Balint
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: [.Net only] - fast bit operations

Post by Gerd Isenberg »

bpfliegel wrote:
C# code with some tests like this (can't even use FastCall...)

Code: Select all

		[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
		public delegate int PopFirstBit(ref UInt64 b);

		static byte[] pop_first_x64 = new byte[]
		{
			0x48 ,0x89 ,0x4C ,0x24 ,0x08       //mov         qword ptr [rsp+8],rcx  
			,0x57                   //push        rdi  
			,0x48 ,0x83 ,0xEC ,0x40          //sub         rsp,40h  
			,0x48 ,0x8B ,0xFC             //mov         rdi,rsp  
			,0xB9 ,0x10 ,0x00 ,0x00 ,0x00       //mov         ecx,10h  
			,0xB8 ,0xCC ,0xCC ,0xCC ,0xCC       //mov         eax,0CCCCCCCCh  
			,0xF3 ,0xAB                //rep stos    dword ptr [rdi]  
			,0x48 ,0x8B ,0x4C ,0x24 ,0x50       //mov         rcx,qword ptr [b]  
			//unsigned long index;
			//_BitScanForward64(&index, b);
			,0x48 ,0x8B ,0x44 ,0x24 ,0x50       //mov         rax,qword ptr [b]  
			,0x8B ,0x00                //mov         eax,dword ptr [rax]  
			,0x48 ,0x0F ,0xBC ,0xC0          //bsf         rax,rax  
			,0x89 ,0x44 ,0x24 ,0x24          //mov         dword ptr [index],eax  
			//b &= ~(1ULL<<(index));
			,0x8B ,0x44 ,0x24 ,0x24          //mov         eax,dword ptr [index]  
			,0xB9 ,0x01 ,0x00 ,0x00 ,0x00       //mov         ecx,1  
			,0x48 ,0x89 ,0x4C ,0x24 ,0x38       //mov         qword ptr [rsp+38h],rcx  
			,0x0F ,0xB6 ,0xC8             //movzx       ecx,al  
			,0x48 ,0x8B ,0x44 ,0x24 ,0x38       //mov         rax,qword ptr [rsp+38h]  
			,0x48 ,0xD3 ,0xE0             //shl         rax,cl  
			,0x48 ,0xF7 ,0xD0             //not         rax  
			,0x48 ,0x8B ,0x4C ,0x24 ,0x50       //mov         rcx,qword ptr [b]  
			,0x8B ,0x09                //mov         ecx,dword ptr [rcx]  
			,0x48 ,0x23 ,0xC8             //and         rcx,rax  
			,0x48 ,0x8B ,0xC1             //mov         rax,rcx  
			,0x48 ,0x8B ,0x4C ,0x24 ,0x50       //mov         rcx,qword ptr [b]  
			,0x89 ,0x01                //mov         dword ptr [rcx],eax  
			//return index; 
			,0x8B ,0x44 ,0x24 ,0x24          // mov         eax,dword ptr [index]  
			//}
			,0x8B ,0xF8                // mov         edi,eax  
			,0x48 ,0x8B ,0xCC             // mov         rcx,rsp  
			//,0x48 ,0x8D ,0x15 ,0x83 ,0x5F ,0x00 ,0x00 // lea         rdx,[std::numeric_limits<float>::min_exponent10+26Ch (013FACBEB0h)]  
			//,0xE8 ,0xFE ,0xDB ,0xFF ,0xFF       // call        _RTC_CheckStackVars (013FAC3B30h)  
			,0x8B ,0xC7                // mov         eax,edi  
			,0x48 ,0x83 ,0xC4 ,0x40          // add         rsp,40h  
			,0x5F                   // pop         rdi  
			,0xC3                   // ret  
		};

		// Taken from Portfish/Stockfish
		internal static readonly int[] PopTable = new int[64];
		internal static int pop_first_x64_sw(ref UInt64 b)
		{
			UInt64 bb = b;
            b &= (b - 1);
            return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
		}

		public static void PopFirstTest()
		{
			// Init (taken from Portfish/Stockfish)
			for (int i = 0; i < 64; i++)
			{
				PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
			}

			IntPtr codepointer = IntPtr.Zero;
			try
			{
				byte[] codeBytes = pop_first_x64;
				codepointer = VirtualAlloc(
					IntPtr.Zero,
					new UIntPtr((uint)codeBytes.Length),
					AllocationType.COMMIT,
					MemoryProtection.EXECUTE_READWRITE
				);

				Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);

				PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));

				Console.WriteLine("test prepared, press any key");
				Console.ReadKey();

				for (UInt64 i = 0; i < 5000000; i++)
				{
					UInt64 target = i;
					while (target != 0) { int first = pop1st(ref target); }
				}

				Console.WriteLine("first passed");
				Console.ReadKey();

				for (UInt64 i = 0; i < 5000000; i++)
				{
					UInt64 target = i;
					while (target != 0) { int first = pop_first_x64_sw(ref target); }
				}

				Console.WriteLine("second passed");
				Console.ReadKey();
			}
			finally
			{
				if (codepointer != IntPtr.Zero)
				{
					VirtualFree(codepointer, 0, 0x8000);
					codepointer = IntPtr.Zero;
				}
			}
		}
	}
}
Slow as hell! (~around 1:5)

There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...

Anyone in?

Cheers, Balint
Hi Balint,

The machine language PopFirstBit seems to come from a debug version. There is much to throw out - a naked function without locals and stackframe should suffice. In bitboard serialization, I prefer a (inlined) routine for bitscan forward via call by value to reset the bit with independend "and" with one's decrement.

Code: Select all

if ( x ) do {
   int idx = bitScanForward(x); // square index from 0..63
   *list++ = foo(idx, ...);
} while (x &= x-1); // reset LS1B
see also
https://chessprogramming.wikispaces.com/BitScan
https://chessprogramming.wikispaces.com ... %20Version

why (0xffffffffffffffff - bb + 1) intstead of (-bb) or (0-bb)?

Gerd
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: [.Net only] - fast bit operations

Post by bpfliegel »

Thanks Gerd, agree with all of what you wrote!

- the code itself is just a lame initial version, I have to work on the .Net related parts.
- right, bb & ~(0-bb) might be used aswell, bb & ~-bb won't compile as I remember.
- the while trick is lovely!
- I have near zero x86/x64 asm knowledge, so the code might be best replaced anyhow by something written by a human expert and not by any compiler at the end, I think. But this link is great: https://chessprogramming.wikispaces.com ... %20Version

Have to run,
Balint
zongli
Posts: 13
Joined: Sat May 12, 2012 9:45 pm

Re: [.Net only] - fast bit operations

Post by zongli »

I think you mean -bb or ~bb + 1. :) It is indeed annoying that you can't just negate a UInt64 value. For the pop function I found that isolating the LSB first was a bit faster in practice (perft) even though it was slower when testing with random values in a loop:

Code: Select all

public static Int32 pop_first_x64_sw(ref UInt64 b) {
     UInt64 bb = b & (0UL - b);
     b &= b - 1;
     return BitIndex[(bb * 0x218A392CD3D5DBFUL) >> 58];
}
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: [.Net only] - fast bit operations

Post by bpfliegel »

zongli wrote:I think you mean -bb or ~bb + 1. :) It is indeed annoying that you can't just negate a UInt64 value. For the pop function I found that isolating the LSB first was a bit faster in practice (perft) even though it was slower when testing with random values in a loop:

Code: Select all

public static Int32 pop_first_x64_sw(ref UInt64 b) {
     UInt64 bb = b & (0UL - b);
     b &= b - 1;
     return BitIndex[(bb * 0x218A392CD3D5DBFUL) >> 58];
}
Correct. Thanks for the runtime info!

Created a new code from release build and - hopefully - proper optimizations set. Performance is still around 1:2,5 even with code security supressed, too much overhead...

Balint

Code: Select all

using System;
using System.Security;
using System.Runtime.InteropServices;

namespace Popcount
{
	// Idea: http://ybeernet.blogspot.hu/2011/03/techniques-of-calling-unmanaged-code.html

	// Original idea here: http://stackoverflow.com/questions/3216535/x86-x64-cpuid-in-c-sharp
	public static class PopFirst
	{
		#region Interop

		[DllImport("kernel32.dll", SetLastError = true)]
		private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize, AllocationType flAllocationType,
			MemoryProtection flProtect);

		[DllImport("kernel32")]
		private static extern bool VirtualFree(IntPtr lpAddress, UInt32 dwSize, UInt32 dwFreeType);

		[Flags()]
		private enum AllocationType : uint
		{
			COMMIT = 0x1000,
			RESERVE = 0x2000,
			RESET = 0x80000,
			LARGE_PAGES = 0x20000000,
			PHYSICAL = 0x400000,
			TOP_DOWN = 0x100000,
			WRITE_WATCH = 0x200000
		}

		[Flags()]
		private enum MemoryProtection : uint
		{
			EXECUTE = 0x10,
			EXECUTE_READ = 0x20,
			EXECUTE_READWRITE = 0x40,
			EXECUTE_WRITECOPY = 0x80,
			NOACCESS = 0x01,
			READONLY = 0x02,
			READWRITE = 0x04,
			WRITECOPY = 0x08,
			GUARD_Modifierflag = 0x100,
			NOCACHE_Modifierflag = 0x200,
			WRITECOMBINE_Modifierflag = 0x400
		}

		#endregion

		[SuppressUnmanagedCodeSecurity]
		[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
		public delegate int PopFirstBit(ref UInt64 b);

		static byte[] pop_first_x64 = new byte[]
		{
			0x44 ,0x8B ,0x01					//mov         r8d,dword ptr [rcx]  
			,0x4C ,0x8B ,0xC9					//mov         r9,rcx  
			,0xBA ,0x01 ,0x00 ,0x00 ,0x00       //mov         edx,1  
			,0x49 ,0x0F ,0xBC ,0xC0				//bsf         rax,r8  
			,0x8B ,0xC8							//mov         ecx,eax  
			,0x48 ,0xD3 ,0xE2					//shl         rdx,cl  
			,0xF7 ,0xD2							//not         edx  
			,0x41 ,0x23 ,0xD0					//and         edx,r8d  
			,0x41 ,0x89 ,0x11					//mov         dword ptr [r9],edx  
			,0xC3								//ret 
		};

		// Taken from Portfish/Stockfish
		internal static readonly int[] PopTable = new int[64];
		internal static int pop_first_x64_sw(ref UInt64 b)
		{
			UInt64 bb = b;
            b &= (b - 1);
            return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
		}

		public static void PopFirstTest()
		{
			// Init (taken from Portfish/Stockfish)
			for (int i = 0; i < 64; i++)
			{
				PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
			}

			IntPtr codepointer = IntPtr.Zero;
			try
			{
				byte[] codeBytes = pop_first_x64;
				codepointer = VirtualAlloc(
					IntPtr.Zero,
					new UIntPtr((uint)codeBytes.Length),
					AllocationType.COMMIT,
					MemoryProtection.EXECUTE_READWRITE
				);

				Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);

				PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));

				Console.WriteLine("test prepared, press any key");
				Console.ReadKey();

				for (UInt64 i = 0; i < 5000000; i++)
				{
					UInt64 target = i;
					while (target != 0)
					{ 
						int first = pop1st(ref target);
					}
				}

				Console.WriteLine("first passed");
				Console.ReadKey();

				for (UInt64 i = 0; i < 5000000; i++)
				{
					UInt64 target = i;
					while (target != 0)
					{ 
						int first = pop_first_x64_sw(ref target); 
					}
				}

				Console.WriteLine("second passed");
				Console.ReadKey();
			}
			finally
			{
				if (codepointer != IntPtr.Zero)
				{
					VirtualFree(codepointer, 0, 0x8000);
					codepointer = IntPtr.Zero;
				}
			}
		}
	}
}
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: [.Net only] - fast bit operations

Post by bpfliegel »

I retested UInt64 bb = b & (0UL - b); instead of the old code and performs somewhat better. I have a very minimal movegen code and now it's visible - did not detect it when I tested this change on Portfish earlier. But is very logical, one op less.

Thanks anyway! Balint
zongli
Posts: 13
Joined: Sat May 12, 2012 9:45 pm

Re: [.Net only] - fast bit operations

Post by zongli »

Well, assuming the compiler understands the commutative property of addition, it's the same number of operations. :) I think the gain comes from slightly better parallelism once the method gets inlined.

I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
bpfliegel
Posts: 71
Joined: Fri Mar 16, 2012 10:16 am

Re: [.Net only] - fast bit operations

Post by bpfliegel »

zongli wrote:Well, assuming the compiler understands the commutative property of addition, it's the same number of operations. :) I think the gain comes from slightly better parallelism once the method gets inlined.

I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
Agree again. Let's hope someone will come up with a better technical idea - I'm abandoning this attempt.
Thanks, Balint