Optimizing bitboards for ARM

Discussion of chess software programming and technical issues.

Moderator: Ras

mar
Posts: 2672
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Optimizing bitboards for ARM

Post by mar »

Hi all,
unless you are already using it, try __builtin_ffsll for getLSBit (=bsf) and 63-__builtin_clzll for getMSBit (=bsr). There's really no point in messing with inline assembly clz/rbit manually (note that rbit is only available on ARMv6 and higher AFAIK). Also it may be worth investigating whether using popMSB instead of popLSB would save some additional processing time.
Anyway I got a nice 25% speedup on ARM now.

Martin
mar
Posts: 2672
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Optimizing bitboards for ARM

Post by mar »

Correction: in fact I meant __bultin_ffsll-1
ZirconiumX
Posts: 1361
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Optimizing bitboards for ARM

Post by ZirconiumX »

mar wrote:Hi all,
unless you are already using it, try __builtin_ffsll for getLSBit (=bsf) and 63-__builtin_clzll for getMSBit (=bsr). There's really no point in messing with inline assembly clz/rbit manually (note that rbit is only available on ARMv6 and higher AFAIK). Also it may be worth investigating whether using popMSB instead of popLSB would save some additional processing time.
Anyway I got a nice 25% speedup on ARM now.

Martin
I have done some tests.

ARM is a RISC computer. This means you have to think about everything in a different manner.

Taking a simple example:

Code: Select all

int msb(uint32_t hi, uint32_t lo) {
  uint32_t res, tmp;
  asm("cmp %0, 0" : : "r"(hi));
  asm("clzne %0, %1" : "=r"(tmp) : "r"(hi));
  asm("eorne %0, %1, 31" : "=r"(res) : "r"(tmp));
  asm("clzeq %0, %1" : "=r"(tmp) : "r"(lo));
  asm("eoreq %0, %1, 63" : "=r"(res) : "r"(tmp));
  return res;
}
BSR in ~5 instructions = 5 cycles. Not bad. (Note the code is branchless, taking advantage of ARM's conditional execution instructions)

RBIT is ARMv7, not 6.

Matthew:out
tu ne cede malis, sed contra audentior ito
mar
Posts: 2672
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Optimizing bitboards for ARM

Post by mar »

ZirconiumX wrote: I have done some tests.

ARM is a RISC computer. This means you have to think about everything in a different manner.

Taking a simple example:

Code: Select all

int msb(uint32_t hi, uint32_t lo) {
  uint32_t res, tmp;
  asm("cmp %0, 0" : : "r"(hi));
  asm("clzne %0, %1" : "=r"(tmp) : "r"(hi));
  asm("eorne %0, %1, 31" : "=r"(res) : "r"(tmp));
  asm("clzeq %0, %1" : "=r"(tmp) : "r"(lo));
  asm("eoreq %0, %1, 63" : "=r"(res) : "r"(tmp));
  return res;
}
BSR in ~5 instructions = 5 cycles. Not bad. (Note the code is branchless, taking advantage of ARM's conditional execution instructions)

RBIT is ARMv7, not 6.

Matthew:out
Thanks Matthew

the code looks nice, I'll try it once I have some time,
let's see whether you are better than GCC intrinsics :twisted:
LSB is still needed though (at least in my case).
As for RBIT, yes you're right, available since ARMv6T2.
There are things to avoid when writing for ARM like integer division
(do the new ARM versions already have an instruction for it?), it's useless for a chess engine,
but there are other areas where a modulo/division is handy ;)

Martin
ZirconiumX
Posts: 1361
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Optimizing bitboards for ARM

Post by ZirconiumX »

It comes with no guarantee of fitness for purpose, just so you know.

I'm still trying to get crosstool to compile an armhf compiler *sigh*

Matthew:out
tu ne cede malis, sed contra audentior ito
mar
Posts: 2672
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Optimizing bitboards for ARM

Post by mar »

ZirconiumX wrote:It comes with no guarantee of fitness for purpose, just so you know.

I'm still trying to get crosstool to compile an armhf compiler *sigh*

Matthew:out
No problem, perhaps the operands are in reverse order? I thought AT&T syntax puts destination operand last.

Martin
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Optimizing bitboards for ARM

Post by diep »

ZirconiumX wrote:
mar wrote:Hi all,
unless you are already using it, try __builtin_ffsll for getLSBit (=bsf) and 63-__builtin_clzll for getMSBit (=bsr). There's really no point in messing with inline assembly clz/rbit manually (note that rbit is only available on ARMv6 and higher AFAIK). Also it may be worth investigating whether using popMSB instead of popLSB would save some additional processing time.
Anyway I got a nice 25% speedup on ARM now.

Martin
I have done some tests.

ARM is a RISC computer. This means you have to think about everything in a different manner.

Taking a simple example:

Code: Select all

int msb(uint32_t hi, uint32_t lo) {
  uint32_t res, tmp;
  asm("cmp %0, 0" : : "r"(hi));
  asm("clzne %0, %1" : "=r"(tmp) : "r"(hi));
  asm("eorne %0, %1, 31" : "=r"(res) : "r"(tmp));
  asm("clzeq %0, %1" : "=r"(tmp) : "r"(lo));
  asm("eoreq %0, %1, 63" : "=r"(res) : "r"(tmp));
  return res;
}
BSR in ~5 instructions = 5 cycles. Not bad. (Note the code is branchless, taking advantage of ARM's conditional execution instructions)

RBIT is ARMv7, not 6.

Matthew:out
The original Reduced Instruction Set Computing historically
was executing 1 instruction a clock.

Todays ARMs are not even close to that. they're much closer to
modern CPU's.

Modern ARMs are all Out of Order just like x64 is.
A9 for example much used now has a L1 of 32+32 and L2
and can store instructions also on the L2 just like modern x64 cpu's can.

The arm cortex A9 executes 2 instructions a cycle.

If we look to number of transistors, which depend heavily upon size of L2 cache which is between 256KB and 1MB most of the times, then it's a big chip.

The real big difference between a x64 cpu and a modern ARM is that the ARMs currently mostly are 32 bits and they are low power.

The Cortex A15 will get 64 bits by the way, so biggest difference then is the fact they are low power.

Yet that's changing as well.

Cortex A9 quadcore (most are dual core) under fullload is already 3 watts
and the A15 is expected to eat 6 watts.

Still a big difference with x64's.

Note there is very few products featuring quad core A9's.

They are very expensive to buy in the quad core A9's. If you buy a 100 they're around a $30 to $40 each. They run at 1.0Ghz then. The higher
clocks are difficult to buy in small amount from and those are dual core all.

I didn't find a cheap motherboard/SoC/development board for them.

That seems to be the real challenge for ARM. Of course if you design one yourself and produce 100k of them they're $6 each or so, yet i'm looking for a good one currently and cannot find for a QUAD CORE arm A9.

It's all dual core nonsense.
mar
Posts: 2672
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Optimizing bitboards for ARM

Post by mar »

diep wrote:Todays ARMs are not even close to that. they're much closer to
modern CPU's.

Modern ARMs are all Out of Order just like x64 is.
A9 for example much used now has a L1 of 32+32 and L2
and can store instructions also on the L2 just like modern x64 cpu's can.

The arm cortex A9 executes 2 instructions a cycle.
I can only add that I was a bit disappointed with ARM performance.
I understand that ARMs are designed for very low power consumption,
but still. Cortex A8 (600 to 1GHz, not sure at what frequency does
iPhone 4 run, say 800MHz) runs factor of 7 slower than 2GHz Core2 duo (T7200) (1 core each).
Frankly I expected it to be slower but not that much slower.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Optimizing bitboards for ARM

Post by diep »

mar wrote:
diep wrote:Todays ARMs are not even close to that. they're much closer to
modern CPU's.

Modern ARMs are all Out of Order just like x64 is.
A9 for example much used now has a L1 of 32+32 and L2
and can store instructions also on the L2 just like modern x64 cpu's can.

The arm cortex A9 executes 2 instructions a cycle.
I can only add that I was a bit disappointed with ARM performance.
I understand that ARMs are designed for very low power consumption,
but still. Cortex A8 (600 to 1GHz, not sure at what frequency does
iPhone 4 run, say 800MHz) runs factor of 7 slower than 2GHz Core2 duo (T7200) (1 core each).
Frankly I expected it to be slower but not that much slower.
Apple is notorious of course. I remember how before launch of ipad-1, they bragged about a quadcore ARM cpu that would be inside.

In reality they put in a dual core CPU.

Realize from those 2 cpu's you can lose easily up to 1 core to all the telecommunication protocols and other protocols (gps huh?)

then apple of course is always real cheapskate with their big successtories.

Todays cpu's they put in their hardware still is DUAL CORE.

If i recall their latest ipad model has the A9 cortex (omap bla bla or something).

So you were benchmarking i suppose 1 core or so?

Now another problem at the embedded hardware is the big variety between the cpu's.

Some A8's have a L2 cache others do NOT. I remember how i tested that with the pentiumpro at the time. When i turned off the L2 cache, diep got back then factor 3 slower...

With A9 this should go a lot better than with the A8.

They will keep putting dual core ARM's in all that hardware, simply as they get away with it.

They bragged bigtime about ipad and then when it released it had older chip with less cores and it had an old iOS that couldn't multitask.

Yet it sold well.

As long as that keeps happening, they will put in inferior CPU's in it. There is many ARMs you know which do not have RAM at all. They instead use some sort of flash memory both as RAM and permanent storage...

Most a9's also are without RAM in fact.

Needless to say this is really slow for chess.

I find them a tad expensive these cpu's. Just 4 ARM cores for $40 and then you do not have a board yet. You have to design yourself such board i suppose.

I have NO IDEA why they aren't selling development boards cheap for $20 or so for those ARM9's. Most i see are in the hundreds of dollars. Real sick.

I want something with a bunch of USB connectors and 1 hole to put the adapte of the power into, and/or AA batteries, that's all.

Then i can connect that to a contrlller steering the robot to be build here.

Only GCC can get used as a compiler for these ARMs and i wouldn't give it the all time award for how well it is doing.

p.s. note that the iphone 4 touch has a 800Mhz A8 and the iphone4 has 1Ghz A8 (apple A4).

"The A4 processor package does not contain RAM,"

http://en.wikipedia.org/wiki/Apple_A4

just factor 7 slower is still pretty good without RAM. I bet Diep is 20 times slower at this :)