Hi all,
unless you are already using it, try __builtin_ffsll for getLSBit (=bsf) and 63-__builtin_clzll for getMSBit (=bsr). There's really no point in messing with inline assembly clz/rbit manually (note that rbit is only available on ARMv6 and higher AFAIK). Also it may be worth investigating whether using popMSB instead of popLSB would save some additional processing time.
Anyway I got a nice 25% speedup on ARM now.
Martin
Optimizing bitboards for ARM
Moderator: Ras
-
mar
- Posts: 2672
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
-
mar
- Posts: 2672
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Optimizing bitboards for ARM
Correction: in fact I meant __bultin_ffsll-1
-
ZirconiumX
- Posts: 1361
- Joined: Sun Jul 17, 2011 11:14 am
- Full name: Hannah Ravensloft
Re: Optimizing bitboards for ARM
I have done some tests.mar wrote:Hi all,
unless you are already using it, try __builtin_ffsll for getLSBit (=bsf) and 63-__builtin_clzll for getMSBit (=bsr). There's really no point in messing with inline assembly clz/rbit manually (note that rbit is only available on ARMv6 and higher AFAIK). Also it may be worth investigating whether using popMSB instead of popLSB would save some additional processing time.
Anyway I got a nice 25% speedup on ARM now.
Martin
ARM is a RISC computer. This means you have to think about everything in a different manner.
Taking a simple example:
Code: Select all
int msb(uint32_t hi, uint32_t lo) {
uint32_t res, tmp;
asm("cmp %0, 0" : : "r"(hi));
asm("clzne %0, %1" : "=r"(tmp) : "r"(hi));
asm("eorne %0, %1, 31" : "=r"(res) : "r"(tmp));
asm("clzeq %0, %1" : "=r"(tmp) : "r"(lo));
asm("eoreq %0, %1, 63" : "=r"(res) : "r"(tmp));
return res;
}
RBIT is ARMv7, not 6.
Matthew:out
tu ne cede malis, sed contra audentior ito
-
mar
- Posts: 2672
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Optimizing bitboards for ARM
Thanks MatthewZirconiumX wrote: I have done some tests.
ARM is a RISC computer. This means you have to think about everything in a different manner.
Taking a simple example:BSR in ~5 instructions = 5 cycles. Not bad. (Note the code is branchless, taking advantage of ARM's conditional execution instructions)Code: Select all
int msb(uint32_t hi, uint32_t lo) { uint32_t res, tmp; asm("cmp %0, 0" : : "r"(hi)); asm("clzne %0, %1" : "=r"(tmp) : "r"(hi)); asm("eorne %0, %1, 31" : "=r"(res) : "r"(tmp)); asm("clzeq %0, %1" : "=r"(tmp) : "r"(lo)); asm("eoreq %0, %1, 63" : "=r"(res) : "r"(tmp)); return res; }
RBIT is ARMv7, not 6.
Matthew:out
the code looks nice, I'll try it once I have some time,
let's see whether you are better than GCC intrinsics
LSB is still needed though (at least in my case).
As for RBIT, yes you're right, available since ARMv6T2.
There are things to avoid when writing for ARM like integer division
(do the new ARM versions already have an instruction for it?), it's useless for a chess engine,
but there are other areas where a modulo/division is handy
Martin
-
ZirconiumX
- Posts: 1361
- Joined: Sun Jul 17, 2011 11:14 am
- Full name: Hannah Ravensloft
Re: Optimizing bitboards for ARM
It comes with no guarantee of fitness for purpose, just so you know.
I'm still trying to get crosstool to compile an armhf compiler *sigh*
Matthew:out
I'm still trying to get crosstool to compile an armhf compiler *sigh*
Matthew:out
tu ne cede malis, sed contra audentior ito
-
mar
- Posts: 2672
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Optimizing bitboards for ARM
No problem, perhaps the operands are in reverse order? I thought AT&T syntax puts destination operand last.ZirconiumX wrote:It comes with no guarantee of fitness for purpose, just so you know.
I'm still trying to get crosstool to compile an armhf compiler *sigh*
Matthew:out
Martin
-
diep
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Optimizing bitboards for ARM
The original Reduced Instruction Set Computing historicallyZirconiumX wrote:I have done some tests.mar wrote:Hi all,
unless you are already using it, try __builtin_ffsll for getLSBit (=bsf) and 63-__builtin_clzll for getMSBit (=bsr). There's really no point in messing with inline assembly clz/rbit manually (note that rbit is only available on ARMv6 and higher AFAIK). Also it may be worth investigating whether using popMSB instead of popLSB would save some additional processing time.
Anyway I got a nice 25% speedup on ARM now.
Martin
ARM is a RISC computer. This means you have to think about everything in a different manner.
Taking a simple example:BSR in ~5 instructions = 5 cycles. Not bad. (Note the code is branchless, taking advantage of ARM's conditional execution instructions)Code: Select all
int msb(uint32_t hi, uint32_t lo) { uint32_t res, tmp; asm("cmp %0, 0" : : "r"(hi)); asm("clzne %0, %1" : "=r"(tmp) : "r"(hi)); asm("eorne %0, %1, 31" : "=r"(res) : "r"(tmp)); asm("clzeq %0, %1" : "=r"(tmp) : "r"(lo)); asm("eoreq %0, %1, 63" : "=r"(res) : "r"(tmp)); return res; }
RBIT is ARMv7, not 6.
Matthew:out
was executing 1 instruction a clock.
Todays ARMs are not even close to that. they're much closer to
modern CPU's.
Modern ARMs are all Out of Order just like x64 is.
A9 for example much used now has a L1 of 32+32 and L2
and can store instructions also on the L2 just like modern x64 cpu's can.
The arm cortex A9 executes 2 instructions a cycle.
If we look to number of transistors, which depend heavily upon size of L2 cache which is between 256KB and 1MB most of the times, then it's a big chip.
The real big difference between a x64 cpu and a modern ARM is that the ARMs currently mostly are 32 bits and they are low power.
The Cortex A15 will get 64 bits by the way, so biggest difference then is the fact they are low power.
Yet that's changing as well.
Cortex A9 quadcore (most are dual core) under fullload is already 3 watts
and the A15 is expected to eat 6 watts.
Still a big difference with x64's.
Note there is very few products featuring quad core A9's.
They are very expensive to buy in the quad core A9's. If you buy a 100 they're around a $30 to $40 each. They run at 1.0Ghz then. The higher
clocks are difficult to buy in small amount from and those are dual core all.
I didn't find a cheap motherboard/SoC/development board for them.
That seems to be the real challenge for ARM. Of course if you design one yourself and produce 100k of them they're $6 each or so, yet i'm looking for a good one currently and cannot find for a QUAD CORE arm A9.
It's all dual core nonsense.
-
mar
- Posts: 2672
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Optimizing bitboards for ARM
I can only add that I was a bit disappointed with ARM performance.diep wrote:Todays ARMs are not even close to that. they're much closer to
modern CPU's.
Modern ARMs are all Out of Order just like x64 is.
A9 for example much used now has a L1 of 32+32 and L2
and can store instructions also on the L2 just like modern x64 cpu's can.
The arm cortex A9 executes 2 instructions a cycle.
I understand that ARMs are designed for very low power consumption,
but still. Cortex A8 (600 to 1GHz, not sure at what frequency does
iPhone 4 run, say 800MHz) runs factor of 7 slower than 2GHz Core2 duo (T7200) (1 core each).
Frankly I expected it to be slower but not that much slower.
-
diep
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Optimizing bitboards for ARM
Apple is notorious of course. I remember how before launch of ipad-1, they bragged about a quadcore ARM cpu that would be inside.mar wrote:I can only add that I was a bit disappointed with ARM performance.diep wrote:Todays ARMs are not even close to that. they're much closer to
modern CPU's.
Modern ARMs are all Out of Order just like x64 is.
A9 for example much used now has a L1 of 32+32 and L2
and can store instructions also on the L2 just like modern x64 cpu's can.
The arm cortex A9 executes 2 instructions a cycle.
I understand that ARMs are designed for very low power consumption,
but still. Cortex A8 (600 to 1GHz, not sure at what frequency does
iPhone 4 run, say 800MHz) runs factor of 7 slower than 2GHz Core2 duo (T7200) (1 core each).
Frankly I expected it to be slower but not that much slower.
In reality they put in a dual core CPU.
Realize from those 2 cpu's you can lose easily up to 1 core to all the telecommunication protocols and other protocols (gps huh?)
then apple of course is always real cheapskate with their big successtories.
Todays cpu's they put in their hardware still is DUAL CORE.
If i recall their latest ipad model has the A9 cortex (omap bla bla or something).
So you were benchmarking i suppose 1 core or so?
Now another problem at the embedded hardware is the big variety between the cpu's.
Some A8's have a L2 cache others do NOT. I remember how i tested that with the pentiumpro at the time. When i turned off the L2 cache, diep got back then factor 3 slower...
With A9 this should go a lot better than with the A8.
They will keep putting dual core ARM's in all that hardware, simply as they get away with it.
They bragged bigtime about ipad and then when it released it had older chip with less cores and it had an old iOS that couldn't multitask.
Yet it sold well.
As long as that keeps happening, they will put in inferior CPU's in it. There is many ARMs you know which do not have RAM at all. They instead use some sort of flash memory both as RAM and permanent storage...
Most a9's also are without RAM in fact.
Needless to say this is really slow for chess.
I find them a tad expensive these cpu's. Just 4 ARM cores for $40 and then you do not have a board yet. You have to design yourself such board i suppose.
I have NO IDEA why they aren't selling development boards cheap for $20 or so for those ARM9's. Most i see are in the hundreds of dollars. Real sick.
I want something with a bunch of USB connectors and 1 hole to put the adapte of the power into, and/or AA batteries, that's all.
Then i can connect that to a contrlller steering the robot to be build here.
Only GCC can get used as a compiler for these ARMs and i wouldn't give it the all time award for how well it is doing.
p.s. note that the iphone 4 touch has a 800Mhz A8 and the iphone4 has 1Ghz A8 (apple A4).
"The A4 processor package does not contain RAM,"
http://en.wikipedia.org/wiki/Apple_A4
just factor 7 slower is still pretty good without RAM. I bet Diep is 20 times slower at this