Arduino Uno Tournament

stegemma · Post by **stegemma** » Sat Jan 31, 2015 9:56 am

hgm wrote:Too bad. Yet 13 people expressed interest in the poll. Isn't that enough, or are most of those voters people that want to watch the tournament, rather than participate?

It sounded like a real fun challenge to design something that would only need 2KB of RAM. Parhaps I am going to do it anyway.

I've ordered an Arduino UNO on Amazon just last night, in a little kit with some sensors and accessories, for about 51 euros. You can get an Arduino UNO for 20$ too. I'm thinking that if we can write a C program that fit in the limited available memory of Arduino, then we can even compile the same software on a modern computer... and see what happens. I think that it could fit completely in the CPU cache and play very well (maybe better than Satana 2.0.7).

I planned to try designing a "chess-machine" for a FPGA, in some far future, and maybe Arduino could be an interesting option to use as a controller for the FPGA (or an FPGA array), having standard interfaces available (lcd screens and so on).

matthewlai · Post by **matthewlai** » Sat Jan 31, 2015 3:48 pm

stegemma wrote: I've ordered an Arduino UNO on Amazon just last night, in a little kit with some sensors and accessories, for about 51 euros. You can get an Arduino UNO for 20$ too. I'm thinking that if we can write a C program that fit in the limited available memory of Arduino, then we can even compile the same software on a modern computer... and see what happens. I think that it could fit completely in the CPU cache and play very well (maybe better than Satana 2.0.7).

I planned to try designing a "chess-machine" for a FPGA, in some far future, and maybe Arduino could be an interesting option to use as a controller for the FPGA (or an FPGA array), having standard interfaces available (lcd screens and so on).

I made a thread about FPGA chess a little while ago - just thought you may be interested. There's some good discussion in there.
http://talkchess.com/forum/viewtopic.php?t=54474

If you are using an FPGA anyways, I don't think you'd really need an Arduino. You can easily synthesize a much more capable CPU on the FPGA itself.

I've also thought about making an engine fit in cache, and made a thread about it many years ago, and did some experiments.

It's actually very easy if you just make the hash tables very small. But there was a net speed decrease.

As it turned out, accessing RAM is expensive, but having to search the same position twice is even worse. Reducing the size of t-table also means some positions would now not have a hash move from previous iterations. The speed lost from that is also very high.

hgm · Post by **hgm** » Sat Jan 31, 2015 4:27 pm

matthewlai wrote:As it turned out, accessing RAM is expensive, but having to search the same position twice is even worse. Reducing the size of t-table also means some positions would now not have a hash move from previous iterations. The speed lost from that is also very high.

Sure, but how many entries could you store in 2KB of RAM, (after subtracting the necessary space for board and stack)? I don't think you would get enough hits from such a hash table to make a measurable difference. Transpositions can only occur in trees of 3-ply or more, when you span 2 all-nodes (the moves of which you exchange) and one cut-node (using the same cut move in both cases). Such a tree will typically have 35^2 ~ 1200 nodes, of which at most half are transpositions. So to get any hits you would need at least 600 entries in the table for that ply. Otherwise they would be replaced before they could be hit.

So for a microcontroller like the AtMega328 or PIC18 this seems to be out of the question, you would need at least 6 KB to only hash ply 3, which is 3 to 6 times more as your total RAM.

Of course when you are talking about running in a 3MB L2 cache of an x64 CPU, it becomes quite a different matter. Even with 1MB hash you could have 100K entries, which is 10K entries per ply if you spread them out over 10 ply. That way you could remember trees with 3 levels of all-nodes, (so 5 ply deep in total), where each leaf would be reached through 6 different transpositions. (Mostly move A - null - move B - null - move C.) That would save you a factor 3 of work for every hashed level. (So 3^10 in this optimistic scenario where all moves are refuted by null moves, and every move can be transposed with every earlier move.)

matthewlai · Post by **matthewlai** » Sat Jan 31, 2015 4:33 pm

hgm wrote:
matthewlai wrote:As it turned out, accessing RAM is expensive, but having to search the same position twice is even worse. Reducing the size of t-table also means some positions would now not have a hash move from previous iterations. The speed lost from that is also very high.
Sure, but how many entries could you store in 2KB of RAM, (after subtracting the necessary space for board and stack)? I don't think you would get enough hits from such a hash table to make a measurable difference. Transpositions can only occur in trees of 3-ply or more, when you span 2 all-nodes (the moves of which you exchange) and one cut-node (using the same cut move in both cases). Such a tree will typically have 35^2 ~ 1200 nodes, of which at most half are transpositions. So to get any hits you would need at least 600 entries in the table for that ply. Otherwise they would be replaced before they could be hit.

So for a microcontroller like the AtMega328 or PIC18 this seems to be out of the question, you would need at least 6 KB to only hash ply 3, which is 3 to 6 times more as your total RAM.

Of course when you are talking about running in a 3MB L2 cache of an x64 CPU, it becomes quite a different matter. Even with 1MB hash you could have 100K entries, which is 10K entries per ply if you spread them out over 10 ply. That way you could remember trees with 3 levels of all-nodes, (so 5 ply deep in total), where each leaf would be reached through 6 different transpositions. (Mostly move A - null - move B - null - move C.) That would save you a factor 3 of work for every hashed level. (So 3^10 in this optimistic scenario where all moves are refuted by null moves, and every move can be transposed with every earlier move.)

On an Atmega328 or PIC18, this wouldn't be a concern, as they have no cache. Their SRAM runs at the same speed as the CPU (or I supposed you could say they have 2KB non-backed cache and no RAM).

hgm · Post by **hgm** » Sat Jan 31, 2015 5:12 pm

Indeed, I considered the on-chip SRAM as all cache. You can of course always add external storage through the I/O ports (e.g DRAM).

matthewlai · Post by **matthewlai** » Sat Jan 31, 2015 5:21 pm

hgm wrote:Indeed, I considered the on-chip SRAM as all cache. You can of course always add external storage through the I/O ports (e.g DRAM).

It would be EXTREMELY slow, though, since those chips don't have an exposed address and data bus. That means even if you hook up all the address bits and data bits to the same IO port so they can be read/written to in one instruction, sending one DRAM command would still be something like -

- set data/address multiplexer to address
- set clock low (another IO pin)
- write register address
- set clock high
- toggle multiplexer to data
- set clock low
- write register content
- set clock high

And accesses may require multiple commands, and a refresh needs to be done at least every 64ms.

At that point, it probably makes more sense to go up to a ARM Cortex-M3. They still only cost a few dollars/euros, are much faster, have a much larger range of IO options, and a real hardware memory controller that can do all the dirty work.

hgm · Post by **hgm** » Sat Jan 31, 2015 6:19 pm

Well, compared to the ratio DRAM/cache access in an Intel i7 that is still pretty fast. And you would not have to do it entirely in software. You could use one of the output pins on the address port to trigger a hardware clock generator based on a dual one-shot, so that whenever you write a new address there it automatically administers the corresponding clock to the DRAM after the address stabilizes to latch it in.

But yes, it will take a few instructions to fetch a word. But I don't think that will slow you down much, if you only use it for hashing. These thinsg are inherently slow, like 4 MIPS, or perhaps 10 if you are licky, while on an i7 we are used to 10 GIPS per core...

matthewlai · Post by **matthewlai** » Sun Feb 01, 2015 3:38 pm

hgm wrote:Well, compared to the ratio DRAM/cache access in an Intel i7 that is still pretty fast. And you would not have to do it entirely in software. You could use one of the output pins on the address port to trigger a hardware clock generator based on a dual one-shot, so that whenever you write a new address there it automatically administers the corresponding clock to the DRAM after the address stabilizes to latch it in.

But yes, it will take a few instructions to fetch a word. But I don't think that will slow you down much, if you only use it for hashing. These thinsg are inherently slow, like 4 MIPS, or perhaps 10 if you are licky, while on an i7 we are used to 10 GIPS per core...

That's true. Though in that case I would think it's better to use asynchronous SRAM instead of DRAM. They don't require a clock, don't require refreshing, and don't require implementing a complex command system.

At an atmega's clock rate, we can pretty much ignore all the timing requirements, since it will most certainly be less than 1 clock cycle (these chips are designed for 100+MHz operation).

They are available up to a few megabytes. That's way more than what an atmega can reasonably use effectively, since ideal t-table size depends on NPS, and they have very low NPS.

Arduino Uno Tournament

Is an arduino Uno Chess engine tournament an interesting idea

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament

Re: Arduino Uno Tournament