Does anyone have fast x86/x64 bitscan functions (fsb, lsb, popfirst) in C# (which are not Mono bound)? What do you use?
Playing around with these ideas, but not sure that they are worth the possible overhead (like having a delegate):
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
http://stackoverflow.com/questions/3216 ... in-c-sharp
http://blogs.microsoft.co.il/blogs/sash ... rom-c.aspx
Unsafe code ideas are welcome aswell.
Any thoughts?
Thanks, Balint
[.Net only] - fast bit operations
Moderator: Ras
-
bpfliegel
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Any dirty tricks regarding biboard related operations are welcome. Checked/unchecked, safe/unsafe.
My current ones are available in Portfish: https://github.com/bpfliegel/Portfish
Cheers, Balint
My current ones are available in Portfish: https://github.com/bpfliegel/Portfish
Cheers, Balint
-
bpfliegel
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Okay, so here we go - going for an x64 pop first bit function.
Idea from here:
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
Original C code looks like this:
C# code with some tests like this (can't even use FastCall...)
Slow as hell! (~around 1:5)
There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...
Anyone in?
Cheers, Balint
Idea from here:
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
Original C code looks like this:
Code: Select all
#include "stdafx.h"
#include <intrin.h>
int __declspec(noinline) __stdcall pop_1st_bit_64(unsigned long& b) {
unsigned long index;
_BitScanForward64(&index, b);
b &= ~(1ULL<<(index));
return index;
}Code: Select all
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Tasks;
namespace Popcount
{
// Original idea here: http://stackoverflow.com/questions/3216535/x86-x64-cpuid-in-c-sharp
public static class PopFirst
{
#region Interop
[DllImport("kernel32.dll", SetLastError = true)]
private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize, AllocationType flAllocationType,
MemoryProtection flProtect);
[DllImport("kernel32")]
private static extern bool VirtualFree(IntPtr lpAddress, UInt32 dwSize, UInt32 dwFreeType);
[Flags()]
private enum AllocationType : uint
{
COMMIT = 0x1000,
RESERVE = 0x2000,
RESET = 0x80000,
LARGE_PAGES = 0x20000000,
PHYSICAL = 0x400000,
TOP_DOWN = 0x100000,
WRITE_WATCH = 0x200000
}
[Flags()]
private enum MemoryProtection : uint
{
EXECUTE = 0x10,
EXECUTE_READ = 0x20,
EXECUTE_READWRITE = 0x40,
EXECUTE_WRITECOPY = 0x80,
NOACCESS = 0x01,
READONLY = 0x02,
READWRITE = 0x04,
WRITECOPY = 0x08,
GUARD_Modifierflag = 0x100,
NOCACHE_Modifierflag = 0x200,
WRITECOMBINE_Modifierflag = 0x400
}
#endregion
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
public delegate int PopFirstBit(ref UInt64 b);
static byte[] pop_first_x64 = new byte[]
{
0x48 ,0x89 ,0x4C ,0x24 ,0x08 //mov qword ptr [rsp+8],rcx
,0x57 //push rdi
,0x48 ,0x83 ,0xEC ,0x40 //sub rsp,40h
,0x48 ,0x8B ,0xFC //mov rdi,rsp
,0xB9 ,0x10 ,0x00 ,0x00 ,0x00 //mov ecx,10h
,0xB8 ,0xCC ,0xCC ,0xCC ,0xCC //mov eax,0CCCCCCCCh
,0xF3 ,0xAB //rep stos dword ptr [rdi]
,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b]
//unsigned long index;
//_BitScanForward64(&index, b);
,0x48 ,0x8B ,0x44 ,0x24 ,0x50 //mov rax,qword ptr [b]
,0x8B ,0x00 //mov eax,dword ptr [rax]
,0x48 ,0x0F ,0xBC ,0xC0 //bsf rax,rax
,0x89 ,0x44 ,0x24 ,0x24 //mov dword ptr [index],eax
//b &= ~(1ULL<<(index));
,0x8B ,0x44 ,0x24 ,0x24 //mov eax,dword ptr [index]
,0xB9 ,0x01 ,0x00 ,0x00 ,0x00 //mov ecx,1
,0x48 ,0x89 ,0x4C ,0x24 ,0x38 //mov qword ptr [rsp+38h],rcx
,0x0F ,0xB6 ,0xC8 //movzx ecx,al
,0x48 ,0x8B ,0x44 ,0x24 ,0x38 //mov rax,qword ptr [rsp+38h]
,0x48 ,0xD3 ,0xE0 //shl rax,cl
,0x48 ,0xF7 ,0xD0 //not rax
,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b]
,0x8B ,0x09 //mov ecx,dword ptr [rcx]
,0x48 ,0x23 ,0xC8 //and rcx,rax
,0x48 ,0x8B ,0xC1 //mov rax,rcx
,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b]
,0x89 ,0x01 //mov dword ptr [rcx],eax
//return index;
,0x8B ,0x44 ,0x24 ,0x24 // mov eax,dword ptr [index]
//}
,0x8B ,0xF8 // mov edi,eax
,0x48 ,0x8B ,0xCC // mov rcx,rsp
//,0x48 ,0x8D ,0x15 ,0x83 ,0x5F ,0x00 ,0x00 // lea rdx,[std::numeric_limits<float>::min_exponent10+26Ch (013FACBEB0h)]
//,0xE8 ,0xFE ,0xDB ,0xFF ,0xFF // call _RTC_CheckStackVars (013FAC3B30h)
,0x8B ,0xC7 // mov eax,edi
,0x48 ,0x83 ,0xC4 ,0x40 // add rsp,40h
,0x5F // pop rdi
,0xC3 // ret
};
// Taken from Portfish/Stockfish
internal static readonly int[] PopTable = new int[64];
internal static int pop_first_x64_sw(ref UInt64 b)
{
UInt64 bb = b;
b &= (b - 1);
return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
}
public static void PopFirstTest()
{
// Init (taken from Portfish/Stockfish)
for (int i = 0; i < 64; i++)
{
PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
}
IntPtr codepointer = IntPtr.Zero;
try
{
byte[] codeBytes = pop_first_x64;
codepointer = VirtualAlloc(
IntPtr.Zero,
new UIntPtr((uint)codeBytes.Length),
AllocationType.COMMIT,
MemoryProtection.EXECUTE_READWRITE
);
Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);
PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));
Console.WriteLine("test prepared, press any key");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0) { int first = pop1st(ref target); }
}
Console.WriteLine("first passed");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0) { int first = pop_first_x64_sw(ref target); }
}
Console.WriteLine("second passed");
Console.ReadKey();
}
finally
{
if (codepointer != IntPtr.Zero)
{
VirtualFree(codepointer, 0, 0x8000);
codepointer = IntPtr.Zero;
}
}
}
}
}There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...
Anyone in?
Cheers, Balint
-
Gerd Isenberg
- Posts: 2251
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: [.Net only] - fast bit operations
Hi Balint,bpfliegel wrote:
C# code with some tests like this (can't even use FastCall...)
Slow as hell! (~around 1:5)Code: Select all
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)] public delegate int PopFirstBit(ref UInt64 b); static byte[] pop_first_x64 = new byte[] { 0x48 ,0x89 ,0x4C ,0x24 ,0x08 //mov qword ptr [rsp+8],rcx ,0x57 //push rdi ,0x48 ,0x83 ,0xEC ,0x40 //sub rsp,40h ,0x48 ,0x8B ,0xFC //mov rdi,rsp ,0xB9 ,0x10 ,0x00 ,0x00 ,0x00 //mov ecx,10h ,0xB8 ,0xCC ,0xCC ,0xCC ,0xCC //mov eax,0CCCCCCCCh ,0xF3 ,0xAB //rep stos dword ptr [rdi] ,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b] //unsigned long index; //_BitScanForward64(&index, b); ,0x48 ,0x8B ,0x44 ,0x24 ,0x50 //mov rax,qword ptr [b] ,0x8B ,0x00 //mov eax,dword ptr [rax] ,0x48 ,0x0F ,0xBC ,0xC0 //bsf rax,rax ,0x89 ,0x44 ,0x24 ,0x24 //mov dword ptr [index],eax //b &= ~(1ULL<<(index)); ,0x8B ,0x44 ,0x24 ,0x24 //mov eax,dword ptr [index] ,0xB9 ,0x01 ,0x00 ,0x00 ,0x00 //mov ecx,1 ,0x48 ,0x89 ,0x4C ,0x24 ,0x38 //mov qword ptr [rsp+38h],rcx ,0x0F ,0xB6 ,0xC8 //movzx ecx,al ,0x48 ,0x8B ,0x44 ,0x24 ,0x38 //mov rax,qword ptr [rsp+38h] ,0x48 ,0xD3 ,0xE0 //shl rax,cl ,0x48 ,0xF7 ,0xD0 //not rax ,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b] ,0x8B ,0x09 //mov ecx,dword ptr [rcx] ,0x48 ,0x23 ,0xC8 //and rcx,rax ,0x48 ,0x8B ,0xC1 //mov rax,rcx ,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b] ,0x89 ,0x01 //mov dword ptr [rcx],eax //return index; ,0x8B ,0x44 ,0x24 ,0x24 // mov eax,dword ptr [index] //} ,0x8B ,0xF8 // mov edi,eax ,0x48 ,0x8B ,0xCC // mov rcx,rsp //,0x48 ,0x8D ,0x15 ,0x83 ,0x5F ,0x00 ,0x00 // lea rdx,[std::numeric_limits<float>::min_exponent10+26Ch (013FACBEB0h)] //,0xE8 ,0xFE ,0xDB ,0xFF ,0xFF // call _RTC_CheckStackVars (013FAC3B30h) ,0x8B ,0xC7 // mov eax,edi ,0x48 ,0x83 ,0xC4 ,0x40 // add rsp,40h ,0x5F // pop rdi ,0xC3 // ret }; // Taken from Portfish/Stockfish internal static readonly int[] PopTable = new int[64]; internal static int pop_first_x64_sw(ref UInt64 b) { UInt64 bb = b; b &= (b - 1); return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]); } public static void PopFirstTest() { // Init (taken from Portfish/Stockfish) for (int i = 0; i < 64; i++) { PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i; } IntPtr codepointer = IntPtr.Zero; try { byte[] codeBytes = pop_first_x64; codepointer = VirtualAlloc( IntPtr.Zero, new UIntPtr((uint)codeBytes.Length), AllocationType.COMMIT, MemoryProtection.EXECUTE_READWRITE ); Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length); PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit)); Console.WriteLine("test prepared, press any key"); Console.ReadKey(); for (UInt64 i = 0; i < 5000000; i++) { UInt64 target = i; while (target != 0) { int first = pop1st(ref target); } } Console.WriteLine("first passed"); Console.ReadKey(); for (UInt64 i = 0; i < 5000000; i++) { UInt64 target = i; while (target != 0) { int first = pop_first_x64_sw(ref target); } } Console.WriteLine("second passed"); Console.ReadKey(); } finally { if (codepointer != IntPtr.Zero) { VirtualFree(codepointer, 0, 0x8000); codepointer = IntPtr.Zero; } } } } }
There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...
Anyone in?
Cheers, Balint
The machine language PopFirstBit seems to come from a debug version. There is much to throw out - a naked function without locals and stackframe should suffice. In bitboard serialization, I prefer a (inlined) routine for bitscan forward via call by value to reset the bit with independend "and" with one's decrement.
Code: Select all
if ( x ) do {
int idx = bitScanForward(x); // square index from 0..63
*list++ = foo(idx, ...);
} while (x &= x-1); // reset LS1B
https://chessprogramming.wikispaces.com/BitScan
https://chessprogramming.wikispaces.com ... %20Version
why (0xffffffffffffffff - bb + 1) intstead of (-bb) or (0-bb)?
Gerd
-
bpfliegel
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Thanks Gerd, agree with all of what you wrote!
- the code itself is just a lame initial version, I have to work on the .Net related parts.
- right, bb & ~(0-bb) might be used aswell, bb & ~-bb won't compile as I remember.
- the while trick is lovely!
- I have near zero x86/x64 asm knowledge, so the code might be best replaced anyhow by something written by a human expert and not by any compiler at the end, I think. But this link is great: https://chessprogramming.wikispaces.com ... %20Version
Have to run,
Balint
- the code itself is just a lame initial version, I have to work on the .Net related parts.
- right, bb & ~(0-bb) might be used aswell, bb & ~-bb won't compile as I remember.
- the while trick is lovely!
- I have near zero x86/x64 asm knowledge, so the code might be best replaced anyhow by something written by a human expert and not by any compiler at the end, I think. But this link is great: https://chessprogramming.wikispaces.com ... %20Version
Have to run,
Balint
-
zongli
- Posts: 13
- Joined: Sat May 12, 2012 9:45 pm
Re: [.Net only] - fast bit operations
I think you mean -bb or ~bb + 1.
It is indeed annoying that you can't just negate a UInt64 value. For the pop function I found that isolating the LSB first was a bit faster in practice (perft) even though it was slower when testing with random values in a loop:
Code: Select all
public static Int32 pop_first_x64_sw(ref UInt64 b) {
UInt64 bb = b & (0UL - b);
b &= b - 1;
return BitIndex[(bb * 0x218A392CD3D5DBFUL) >> 58];
}-
bpfliegel
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Correct. Thanks for the runtime info!zongli wrote:I think you mean -bb or ~bb + 1.It is indeed annoying that you can't just negate a UInt64 value. For the pop function I found that isolating the LSB first was a bit faster in practice (perft) even though it was slower when testing with random values in a loop:
Code: Select all
public static Int32 pop_first_x64_sw(ref UInt64 b) { UInt64 bb = b & (0UL - b); b &= b - 1; return BitIndex[(bb * 0x218A392CD3D5DBFUL) >> 58]; }
Created a new code from release build and - hopefully - proper optimizations set. Performance is still around 1:2,5 even with code security supressed, too much overhead...
Balint
Code: Select all
using System;
using System.Security;
using System.Runtime.InteropServices;
namespace Popcount
{
// Idea: http://ybeernet.blogspot.hu/2011/03/techniques-of-calling-unmanaged-code.html
// Original idea here: http://stackoverflow.com/questions/3216535/x86-x64-cpuid-in-c-sharp
public static class PopFirst
{
#region Interop
[DllImport("kernel32.dll", SetLastError = true)]
private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize, AllocationType flAllocationType,
MemoryProtection flProtect);
[DllImport("kernel32")]
private static extern bool VirtualFree(IntPtr lpAddress, UInt32 dwSize, UInt32 dwFreeType);
[Flags()]
private enum AllocationType : uint
{
COMMIT = 0x1000,
RESERVE = 0x2000,
RESET = 0x80000,
LARGE_PAGES = 0x20000000,
PHYSICAL = 0x400000,
TOP_DOWN = 0x100000,
WRITE_WATCH = 0x200000
}
[Flags()]
private enum MemoryProtection : uint
{
EXECUTE = 0x10,
EXECUTE_READ = 0x20,
EXECUTE_READWRITE = 0x40,
EXECUTE_WRITECOPY = 0x80,
NOACCESS = 0x01,
READONLY = 0x02,
READWRITE = 0x04,
WRITECOPY = 0x08,
GUARD_Modifierflag = 0x100,
NOCACHE_Modifierflag = 0x200,
WRITECOMBINE_Modifierflag = 0x400
}
#endregion
[SuppressUnmanagedCodeSecurity]
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
public delegate int PopFirstBit(ref UInt64 b);
static byte[] pop_first_x64 = new byte[]
{
0x44 ,0x8B ,0x01 //mov r8d,dword ptr [rcx]
,0x4C ,0x8B ,0xC9 //mov r9,rcx
,0xBA ,0x01 ,0x00 ,0x00 ,0x00 //mov edx,1
,0x49 ,0x0F ,0xBC ,0xC0 //bsf rax,r8
,0x8B ,0xC8 //mov ecx,eax
,0x48 ,0xD3 ,0xE2 //shl rdx,cl
,0xF7 ,0xD2 //not edx
,0x41 ,0x23 ,0xD0 //and edx,r8d
,0x41 ,0x89 ,0x11 //mov dword ptr [r9],edx
,0xC3 //ret
};
// Taken from Portfish/Stockfish
internal static readonly int[] PopTable = new int[64];
internal static int pop_first_x64_sw(ref UInt64 b)
{
UInt64 bb = b;
b &= (b - 1);
return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
}
public static void PopFirstTest()
{
// Init (taken from Portfish/Stockfish)
for (int i = 0; i < 64; i++)
{
PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
}
IntPtr codepointer = IntPtr.Zero;
try
{
byte[] codeBytes = pop_first_x64;
codepointer = VirtualAlloc(
IntPtr.Zero,
new UIntPtr((uint)codeBytes.Length),
AllocationType.COMMIT,
MemoryProtection.EXECUTE_READWRITE
);
Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);
PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));
Console.WriteLine("test prepared, press any key");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0)
{
int first = pop1st(ref target);
}
}
Console.WriteLine("first passed");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0)
{
int first = pop_first_x64_sw(ref target);
}
}
Console.WriteLine("second passed");
Console.ReadKey();
}
finally
{
if (codepointer != IntPtr.Zero)
{
VirtualFree(codepointer, 0, 0x8000);
codepointer = IntPtr.Zero;
}
}
}
}
}
-
bpfliegel
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
I retested UInt64 bb = b & (0UL - b); instead of the old code and performs somewhat better. I have a very minimal movegen code and now it's visible - did not detect it when I tested this change on Portfish earlier. But is very logical, one op less.
Thanks anyway! Balint
Thanks anyway! Balint
-
zongli
- Posts: 13
- Joined: Sat May 12, 2012 9:45 pm
Re: [.Net only] - fast bit operations
Well, assuming the compiler understands the commutative property of addition, it's the same number of operations.
I think the gain comes from slightly better parallelism once the method gets inlined.
I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
-
bpfliegel
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Agree again. Let's hope someone will come up with a better technical idea - I'm abandoning this attempt.zongli wrote:Well, assuming the compiler understands the commutative property of addition, it's the same number of operations.I think the gain comes from slightly better parallelism once the method gets inlined.
I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
Thanks, Balint