nvidia tesla

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

MM
Posts: 766
Joined: Sun Oct 16, 2011 11:25 am

nvidia tesla

Post by MM »

Hi, can nvidia tesla be helpful for chess engines, now or in the future?

Best
MM
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: nvidia tesla

Post by smatovic »

Hi, can nvidia tesla be helpful for chess engines, now or in the future?
http://zeta-chess.blogspot.com/

"Zeta" would profit from a Tesla GPU :)

It uses an best-first search algorithm and is weaker than an comparable (same evaluation, no special moves) Depth-First Searcher.

--
Srdja
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: nvidia tesla

Post by diep »

MM wrote:Hi, can nvidia tesla be helpful for chess engines, now or in the future?

Best
Yes Tesla is very useful:

a) for parameter tuning
b) it is possible to write a strong chessprogram at Tesla, just it will cost you dearly and you better hire me as parallel search is important at it.

Tesla's 2075's i have here a few from which nvidia gave me; they have each 6GB ram which is more than enough for hashtables.

They are each 448 cores @ 1.15Ghz which is a lot more powerful than cpu's.

Where they deliver a lot more power than cpu's, there is more difficulty to program them. CPU's can execute software programs and at each core run a different program that is busy itself at a specific spot in the program code.

GPU's are typical vectorized type processors, manycores, which have the propert that from each compute unit (that's around a 32 cores) all the cores execute the same instruction at the same time. The compute unit, also gets called SIMD, not to confuse with the SIMD of the processors, which truely is a vector at each cpu core.

Most toy at home with some gamers cards which is not so clever as those are lobotomized in several manners, under which less double precision units, but sometimes also the bandwidth - that hurts when programming something like chess bigtime, as you need that bandwidth to regurarly consult the hashtables using a distributed (layered) model.

Tesla does not have these problems, also it is more reliable and therefore clocked LOWER than the gamerscards.

Writing a chessprogram for Tesla is not easy and will eat big time.

To have a program for yourself made for tesla that can play chess will cost you in case of me around a 40k euro for version 1.0

It will play very strong though, realize the number of chesspositions a second of it will be huge and the potential to hammer away all competition is there.

Garantuees can be given there.

So this is not a $100 project.

Realize that each core of a Tesla is a bit weaker than a normal cpu's core, as each core in a normal cpu is out of order and in fact has a lot of execution units and can issue a lot of instructions per clocktick, which Tesla cannot.

Yet this is a constant loss, which i have calculated, based upon experiments of others and my own CUDA toying in 2007, at factor 5.

Todays tesla is a lot more powerful however; so getting that huge nps is not such big problem in itself; as some might or might not know here, already in 90s there was chessprograms on gpu's getting massive amount of nps-es, so getting that huge nps (chesspositions per second, or more accurately nodes per second) is just some work, the real hard work is having all the cores cooperate efficiently together to effectively search deep and get that great search depth.

If you read in this forum you will learn that programming gpu's is a special expertise, most who try are laymen, not professional let alone talented programmers.

This is an expertise job.

Don't let some beginners tell you it's difficult. In 90s already impressive results were achieved at professional videocards and some engines played quite strong actually - all these kids here toying at gamerscards have no idea what all has been achieved there.

So the real problem never was getting a superior nodes per second, in the 90s the GPU designers themselves already achieved that handsdown and in Germany a team also had some impressive results there. Nowadays you can find some of this on the internet if you search well. It is the combination of the hashtable with the parallel search that is the real challenge to solve for todays gpgpu.

In 2007 for Tesla i made a design on paper, which took some months, that can really scale well and achieve a big speedup out of a gpu, at a small constant overhead.

This 'small constant overhead' is too much at a quadcore cpu, yet with thousands of gpu cores, that overhead is easy to pay for.

Tesla is the professional gpgpu card there to program for.

It is a professional platform.

Professional platforms need professional payments however.

Speculating about the new tesla:

The new Tesla platform did not officially get released yet, from the gamerscard which has 1536 cores at 1Ghz, we can see that it probably will be a 1.7-3.4 Tflop (double precision) chip, thereby of course convincingly beating any projection as given by intel about their upcoming platform.

Now intel very careful each time showed 3 disclaimers at each presentation (also at the recent one in Switzerland, basically half the speech was just showing disclaimers), so probably they can start new and some years from now when intel shows up with a 4Tflop chip then nvidia will also have their so manieth generation new chip which is again double that of the upcoming tesla :)
So intel can already shredder their upcoming 1 Tflop chip, as Nvidia already achieved a year ago near that and the new chip from which the gamerscard already released, it can be projected to be much faster than what intel still didn't even release a prototype from publicly.

Kind Regards,
Vincent Diepeveen
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nvidia tesla

Post by Daniel Shawul »


GPU's are typical vectorized type processors, manycores, which have the propert that from each compute unit (that's around a 32 cores) all the cores execute the same instruction at the same time. The compute unit, also gets called SIMD, not to confuse with the SIMD of the processors, which truely is a vector at each cpu core.

Most toy at home with some gamers cards which is not so clever as those are lobotomized in several manners, under which less double precision units, but sometimes also the bandwidth - that hurts when programming something like chess bigtime, as you need that bandwidth to regurarly consult the hashtables using a distributed (layered) model.

This is very wrong. Memory bandwidth has never been a problem but
Tesla does not have these problems, also it is more reliable and therefore clocked LOWER than the gamerscards.
This is simply false. Hashtables have always been a problem for gpus even for teslas. There is some improvement in tesla through the addition of the L1 cache but is still far from anything astounding. Read this phd research of 2011 to see how everyone that do gpu work is expecting a break through in that aspect.

http://idav.ucdavis.edu/~dfalcant//down ... tation.pdf

Get your hands dirty instead of never ending planning and insulting everyone else that actually try something...
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nvidia tesla

Post by Daniel Shawul »

If you can find a suitable algorithm for the hardware, it is indeed possible.
Infact I belive a good engine that plays checkers is very real. I am working on that right now. However the algorithm I have is not suitable for chess.

Don't believe the rant some people make here... You can tell if you ask "what have you done?" :) Possible answer "Pay me"
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: nvidia tesla

Post by diep »

Daniel Shawul wrote:

GPU's are typical vectorized type processors, manycores, which have the propert that from each compute unit (that's around a 32 cores) all the cores execute the same instruction at the same time. The compute unit, also gets called SIMD, not to confuse with the SIMD of the processors, which truely is a vector at each cpu core.

Most toy at home with some gamers cards which is not so clever as those are lobotomized in several manners, under which less double precision units, but sometimes also the bandwidth - that hurts when programming something like chess bigtime, as you need that bandwidth to regurarly consult the hashtables using a distributed (layered) model.

This is very wrong. Memory bandwidth has never been a problem but
Tesla does not have these problems, also it is more reliable and therefore clocked LOWER than the gamerscards.
This is simply false. Hashtables have always been a problem for gpus even for teslas. There is some improvement in tesla through the addition of the L1 cache but is still far from anything astounding. Read this phd research of 2011 to see how everyone that do gpu work is expecting a break through in that aspect.

http://idav.ucdavis.edu/~dfalcant//down ... tation.pdf

Get your hands dirty instead of never ending planning and insulting everyone else that actually try something...
I compared with the 90s with the gpu's that got some tens of millions of nps for playing chessprograms. They didn't have any form of cache/device RAM.

It's not easy to hash at tesla's as i said before - but with a layered system you can get a very good performance - i wrote that already before here - yet it seems you're not in the same league.

If we both talk at total different level - From what i understand you still are trying to achieve to build a move generator that can have any decent performance.

But well again - the hardware guys are expensive. The guys who got it to work in the 90s are in the highest league and you are not. Some of them have a salary, bonus counted, on average over the past 20 years that's 7 digits. That's US dollars yeah.

Most of them were hardware engineers - there is a lot of money to make for hardware engineers now - they are much wanted.

So my 40k euro estimate is the utmost cheapest price you can get it for, they laugh for this, and one of them burns that up sometimes within 1 day when fixing his aircraft he flies just for fun.

Don't act as if you know how to program in that league ok.

We shouldn't do as if gpgpu programming is any easy. It isn't. But this hardware can give magnificent performance. A few PHD's who just learned how to program you can't take serious here. In fact you can easily speedup most of what they do by factors in efficiency by just looking at the code.

Note that in 90s there was at least 2 projects i have played against and communicated with programmers of it. One of the projects some results have been posted online a few years ago.

The reason that very few topprogrammer tried to build chessprogram at modern hardware is because no one wants to pay the bill.

Therefore i also wrote down the price here - to not scare off investors.

Let me quote one of the topprogrammers who already right from the start that gpgpu programming in CUDA is public, is actively busy with this already years ago told me he had some ideas how to do achieve big search depths at it, yet didn't see who would buy it.

This is also the reason why back in 2000 it was Donninger who got the FPGA job. In this case it was chessbase who first asked a range of other programmers, which would've been better choices in the first place to carry it out; yet if we look back, Donninger was a bad choice.

a) he doesn't know much from SMP coding and never got it going well there
b) it's losing bigtime in efficiency
c) a stand alone card back then could not beat other programs 1 card against 1 program.

In short in software Brutus/Hydra would've played a lot stronger.

Please note that Chrilly has some very good excuses why some things didn't go as they should've gone. But i'm not sure i can post that here.

I bet Julien wil remove the posting right away then.

As it has to do with specific companies simply not paying at all what was appointed and specific universities which have no clue about how to parallellize a chessprogram (it could handle less nps at 16 cpu's in paderborn university all cpu's together than a single fpga card of Chrilly delivered; one fpga card got against a 100k nps, versus the 16 cpu's together could do 16k searches a second thanks to the slowdown of their parallel software framework).
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: nvidia tesla

Post by diep »

Daniel Shawul wrote:If you can find a suitable algorithm for the hardware, it is indeed possible.
Infact I belive a good engine that plays checkers is very real. I am working on that right now. However the algorithm I have is not suitable for chess.

Don't believe the rant some people make here... You can tell if you ask "what have you done?" :) Possible answer "Pay me"
8x8 checkers has been kind of solved.

The first 10x10 international checkersprogram i made in 90s, when i showed up at a tournament, i outsearched everyone by factor 2 in plies on average, over the entire game.

That was fullwidth.

No nothing pruning.

Some move ordering tricks though and a high nps.

They didn't know how to generate moves fast and they still do not know how to do that real quick.

From what you posted you have the same problem now for gpgpu chess. Fix that is my tip.

As for the 10x10 checkersprogram, it took me 3 weeks fulltime of hard work to get the first version going. After that at most 1 day a year work or so.

Some of these guys were busy every evening with their program.

I don't write this to spit on them, in contradiction - in fields where there has not been a very big competition it's possible to really outdo others by factors.

Right now what's there in gpgpu for chess is not very well optimized for vector processing.

In the 90s they already knew how to solve this you know, this is not rocket science, but it shows how much creativity someone possesses as there is no downloadable example how to do it.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: nvidia tesla

Post by diep »

Apologies as usual i had posted and editted later:
This is also the reason why back in 2000 it was Donninger who got the FPGA job. In this case it was chessbase who first asked a range of other programmers, which would've been better choices in the first place to carry it out; yet if we look back, Donninger was a bad choice.

a) he doesn't know much from SMP coding and never got it going well there
b) it's losing bigtime in efficiency
c) a stand alone card back then could not beat other programs 1 card against 1 program.
d) The biggest advantage of the fpga hardware wasn't used and that is that you can kind of lossless build a very huge evaluation function for it; Chrilly in fact had kind of built worlds smallest evaluation function kind of for Hydra

e) computerchess requires good testing and Chrilly wasn't capable of testing well

You cannot blame Chrilly for A nor E. E is a money question and if a sponsor at start of a project is doing promises then so be it when that doesn't happen - i don't know all details about E.

A would get solved by a specific university, let's not quote names here. But this was someone who is not even remotely capable of writing a good parallel search.

Hydra's parallel search is really bad and doesn't scale as it requires O ( n ^ 2 ) communication.

So when i discussed with the Sheikh on building a 1024 node machine for Hydra i had to advice against that of course - besides as i told him - Chrilly had worked in Petten doing nuclear calculations there at a computer - fpga cards and back then that would've been one of worlds strongest supercomputers - that's asking for trouble as i explained to him - for example Israel might have gotten get very upset about the possibilities back then with something like that of a nuclear engineer who knows how to do those calculations writing software for a 1024 node supercomputer with each node 1 or more fpga cards.

The initial goal however would not be able to achieve with Hydra - namely run at 1024 processors as the algorithm didn't scale. They basically scaled by stopping doing transpositions last few 6 plies or so.

Try such experiment with your chessprogram and run it at a 16 cores or more.

You will see that you start losing factor 10 instantly.

Now add in some latency to the hashtable - bigger machines are a lot slower in latency to get a hashtable entry.

That factor 10 is about what hydra lost in efficiency to the software search as with around a 90k-100k searches a second a card, it did get a huge nps of course.

As usual they had an algorithm that scaled well for the small amount of nodes they could test at (8 or so), but did do so by burning too much of an overhead and wrote something that in generic doesn't scale at all.

All speedup comparisions of Hydra i also want to void, as basically they compared 1 hydra processor not storing last 6 plies into hashtable with n.

So they got the full nps, just like deep blue, but not the search depth they could've gotten.

Obviously if you just use 1 you can easily store all software plies in hashtable. So the correct compare would have been 1 cpu doing hashtable at all plydepths, as 1 cpu could do that easily, versus n cpu's not doing it last n plies.

First losing a few plies of search depth in order to CLAIM a good speedup i find very bad science - i really use a huge understatement here.
In short in software Brutus/Hydra would've played a lot stronger.

Please note that Chrilly has some very good excuses why some things didn't go as they should've gone. But i'm not sure i can post that here.

I bet Julien wil remove the posting right away then.

As it has to do with specific companies simply not paying at all what was appointed and specific universities which have no clue about how to parallellize a chessprogram (it could handle less nps at 16 cpu's in paderborn university all cpu's together than a single fpga card of Chrilly delivered; one fpga card got against a 100k nps, versus the 16 cpu's together could do 16k searches a second thanks to the slowdown of their parallel software framework).
Of course with 100k nps i mean : 100k searches per second carried out by the hardware (each machine had 1 fpga card).

The overall conclusion also there is that the focus was SMP programming and building a big evaluation in fpga - that would've really profitted from the fpga.

Both those advantages have not been performed by the Hydra team.

gpgpu programming also has a lot of technical difficulties - but those tesla cards offer possibilities you can exploit which you cannot exploit at your cpu easily.

Using the advantages of the Tesla takes professionals and those really can do well with those cards.

See it like this - it only requires 1 professional to write something that works real well and then everyone can profit from that - in theory.

The big problem in gpgpu programming is the parallel search; it's easy to prove that the best way to solve this requires at least a 3 point solution:

a) SMP search between the gpu's using the DDR3 RAM of the cpu's
b) SMP search between the compute units ( = SIMD - that's around a 32 cores)
c) SMP search within 1 compute unit

For comparision normal SMP searches in software have just 1 layer of SMP search which already isn't easy to build. So there is 3 hurdles here. there is an efficiency loss at each layer. Minimization of that will determine how well its speedup is over 1 cpu core that's doing the same yet using a very efficient shared hashtable (namely everywhere).

The software search of Diep at a shared memory machine, just the addition 2010 is 40-50 pages of a4 full of proof.

Designing such search you need to do on paper not surprisingly or it won't work.

I get the impression that the big paper work that's required for this gets underestimated and laughed away by those who simply have no idea what you need to do to get the maximum out of the hardware.

That requires a paper design that you PROOF.

How to prove software programs on paper to be correct is a course given at some of the better universities. You can order books showing you how to do that. Usually it is the Einstein level guys who are good in this.



Vincent
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: nvidia tesla

Post by Daniel Shawul »

diep wrote:
Daniel Shawul wrote:

GPU's are typical vectorized type processors, manycores, which have the propert that from each compute unit (that's around a 32 cores) all the cores execute the same instruction at the same time. The compute unit, also gets called SIMD, not to confuse with the SIMD of the processors, which truely is a vector at each cpu core.

Most toy at home with some gamers cards which is not so clever as those are lobotomized in several manners, under which less double precision units, but sometimes also the bandwidth - that hurts when programming something like chess bigtime, as you need that bandwidth to regurarly consult the hashtables using a distributed (layered) model.

This is very wrong. Memory bandwidth has never been a problem but
Tesla does not have these problems, also it is more reliable and therefore clocked LOWER than the gamerscards.
This is simply false. Hashtables have always been a problem for gpus even for teslas. There is some improvement in tesla through the addition of the L1 cache but is still far from anything astounding. Read this phd research of 2011 to see how everyone that do gpu work is expecting a break through in that aspect.

http://idav.ucdavis.edu/~dfalcant//down ... tation.pdf

Get your hands dirty instead of never ending planning and insulting everyone else that actually try something...
From now on I am going to point out your every BS so please make your posts shorter so that you don't waste mine and your time. All this hardware talk is nonsense. You are given a device asked to use that. FGPA has nothing to do with it since that is configurable. All you need to do is understand how it works which you don't seem to understand at all from the mistakes you make again and again.

I asked you last time how you are going to do move generation for alpha-beta since you claimed you use that. You can't do alpha beta without puting the moves global memory. Your caches available to you are 64kb per SM and some 700kb L2 for all the device. Now please do the move generation for 64 plies and 1024 threads see how many you are left with hash tables. If you are thinking about running say 12 threads per SM then there is no argument. That already gives up huge on occupancy so whatever speed up you get is far far from what is achivable.

The only reason I got my nps very very high is because I used a monte-carlo approach that doesn't require you to store moves at all plies but you never understand that... Srjda has tried to use alpha-beta but he got lower nps per SM. Still he did not use more than one SM since there is no synchronization between them at all...
I compared with the 90s with the gpu's that got some tens of millions of nps for playing chessprograms. They didn't have any form of cache/device RAM.

It's not easy to hash at tesla's as i said before - but with a layered system you can get a very good performance - i wrote that already before here - yet it seems you're not in the same league.
Get any text book on gpus and read about hashtables. Random access/store is a no no. Did you even read the abstract of the paper I linked to?
If we both talk at total different level - From what i understand you still are trying to achieve to build a move generator that can have any decent performance.
See above and answer the question instead of talking about other people like you always do. It doesn't make you any better if you say I know that guy who did this / that.
But well again - the hardware guys are expensive. The guys who got it to work in the 90s are in the highest league and you are not. Some of them have a salary, bonus counted, on average over the past 20 years that's 7 digits. That's US dollars yeah.
Again you don't need hardware guys. You just need to understand it.
Most of them were hardware engineers - there is a lot of money to make for hardware engineers now - they are much wanted.

So my 40k euro estimate is the utmost cheapest price you can get it for, they laugh for this, and one of them burns that up sometimes within 1 day when fixing his aircraft he flies just for fun.
Dream on...
Don't act as if you know how to program in that league ok.
I know you don't ...
We shouldn't do as if gpgpu programming is any easy. It isn't. But this hardware can give magnificent performance. A few PHD's who just learned how to program you can't take serious here. In fact you can easily speedup most of what they do by factors in efficiency by just looking at the code.
Laughable.. what are you ... some guy who has been programming and programming for years who thinks academia is worthless. Think again.
Note that in 90s there was at least 2 projects i have played against and communicated with programmers of it. One of the projects some results have been posted online a few years ago.
Usual crap..in the 90s..etc. Vincent let me just say that I respect you and eveyone else who did chess programming. Everybody learns from the past that is a given, so there is no point debating that.
The reason that very few topprogrammer tried to build chessprogram at modern hardware is because no one wants to pay the bill.
You are so wrong about programmers. Programmers are just that rarely push the AI which is mostly done by some academicaian from a well reconginized institution. Just because you had a good alpha-beta searcher and know c don't you think you need to be a top programmer. If the hardware and algorithms are known, I am sure many programmers will do gpu programing but they wait :)
Therefore i also wrote down the price here - to not scare off investors.
If you really serious about it I could even program it for you if you explain it well.
Let me quote one of the topprogrammers who already right from the start that gpgpu programming in CUDA is public, is actively busy with this already years ago told me he had some ideas how to do achieve big search depths at it, yet didn't see who would buy it.
Your obsession with topporgrammers is ridiclous. How many do you know? :)
This is also the reason why back in 2000 it was Donninger who got the FPGA job. In this case it was chessbase who first asked a range of other programmers, which would've been better choices in the first place to carry it out; yet if we look back, Donninger was a bad choice.

a) he doesn't know much from SMP coding and never got it going well there
b) it's losing bigtime in efficiency
c) a stand alone card back then could not beat other programs 1 card against 1 program.

In short in software Brutus/Hydra would've played a lot stronger.

Please note that Chrilly has some very good excuses why some things didn't go as they should've gone. But i'm not sure i can post that here.

I bet Julien wil remove the posting right away then.
But don't you think you it is enough to just mention what you just think is the problem with hydra. I know you are one of the best in parallel search but please make sure you have your facts straight when you say someone don't understand or something like that.
As it has to do with specific companies simply not paying at all what was appointed and specific universities which have no clue about how to parallellize a chessprogram (it could handle less nps at 16 cpu's in paderborn university all cpu's together than a single fpga card of Chrilly delivered; one fpga card got against a 100k nps, versus the 16 cpu's together could do 16k searches a second thanks to the slowdown of their parallel software framework).
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: nvidia tesla

Post by diep »

Daniel, quiet down and don't say things i didn't write.

All postings i did do i speak of a 3 layer approach.

So now please write me a PROOF that this requires for all cores to slowdown factor 100-1000 or so, by reading from device ram and/or global shared ram.

Your proof is absent. As long as you're not gonna be able to prove things on paper, and laugh for me to prove my SMP search first on paper, you might never make it to the Einstein league.