This post is from a blog called Adventures in GraphicsLand that I’m writing with two fellow CS grad students, Chris Gibson and Ryan Schmitt. Articles about anything related to my graduate work in graphics or my thesis will be posted there and then cross-posted here. Articles about handy tips (like fixing bugs with VirtualBox or software setup on Fedora) will remain here. This post that I wrote for AIGFX, originally appeared here.
Learning CUDA has definitely been an interesting experience. As much as they make it sound like it’s simple to get started (and for the most part, it is), there are lots of little traps that can keep you frustrated for hours… or days. Here are four tips that stumped me during initial development of Haste (which is now on GitHub!) that might be helpful to you.
Long running kernels on a desktop workstation
In Linux, X’s driver watchdog will kill a process that leaves a driver hanging for too long, so to prevent that from happening you can’t launch a GPU kernel unless it returns within a couple milliseconds. (This happens in Windows, too, but I’m working mainly in Linux at the moment.) However, you might want to test kernels on your workstation. The way around this is to switch to a text-only terminal before running your CUDA program. On most Linux distributions, you can swap between terminals using Ctrl-Alt-F2 through Ctrl-Alt-F6, where each is a different terminal. If you hit Ctrl-Alt-F1 in Fedora 14, it will take you back to your X session (you’re still logged in and everything).
So, all you need to do is write code in your graphical desktop, compile, hit Ctrl-Alt-F2 to switch to a text-only terminal, then run your program for testing. When you want to go back to graphical mode to fix bugs, just Ctrl-Alt-F1 back and off you go.
Slow device info queries
If you’re doing doing development on a headless compute box (like our Tesla machine at Cal Poly), you might have noticed that querying device information takes a long time. This is compounded if it’s a multi-device machine. Our box at Poly has four Tesla GPUs, and Haste startup was frustratingly slow. All we did is query the device list once, then query each device individually using cudaGetDeviceProperties(). It usually take on the order of 30 to 45 seconds at program startup to get all the device information and allocate memory before we were off to the races launching kernels.
The problem is that the NVIDIA drivers normally maintain a lot of state about the GPUs in memory. However, this state is only there if there’s some resident process keeping it there, like X. If X is not running (or not even installed, like on our headless compute box), that state will need to get reinitialized every time you make a call that requires it. This can be excruciatingly slow, especially on multi-device machines.
The solution? Well, the easiest one is to just install and leave X running, even on a headless machine. Just make sure it’s not driving a display, or better yet switch it over to a text-only terminal with Ctrl-Alt-F2 to keep X around but not have it interfere with your kernels.
Printing debug info in device kernels
I must admit, while debuggers are neat, I tend to like printf() debugging. It’s not that I don’t see the value of debuggers; for some problems they’re really the only way to solve things. Maybe it has something do with the fact that cuda-gdb inexplicably crashes on every machine and kernel I try to run it on.
With the Fermi architecture, available in cards of compute capability 2.0 and higher, you can actually do printf()‘s directly from your device code now, without having to jump through any strange library hoops. Initially, however, I was never able to get it to work. I couldn’t find which CUDA header I needed to include to get things off the ground, and even when it seemed to compile it didn’t print anything.
Well, it sounds silly, but just #include <stdio.h> and away you go. I never tried this initially because I thought that didn’t make any sense. The C standard library doesn’t have CUDA device code! The best I can tell, nvcc is rewriting these standard calls from device code behind the scenes.
The device info’s maximumThreadsPerBlock lies!
This one really irks me. If you query a device’s properties, it reports the maximum number of threads per block in a cudaDeviceProp struct member called, shockingly, maxThreadsPerBlock. The problem is that this is not the actual number of threads you can launch. That depends entirely on your kernel’s occupancy, which you can figure out using the difficult-to-find occupancy calculator spreadsheet. You’ll also want to compile your kernel with the nvcc option --ptxas-options=-v to see the shared memory and register usage for your kernel. You’ll need it in the spreadsheet.
The occupancy limit doesn’t bug me so much as the fact that this is not mentioned anywhere in the documentation where maxThreadsPerBlock is mentioned. Once would think that would be a great place to throw up a warning flag, letting developers know that that number is purely speculative, and that they need to do some real benchmarking of their kernel to find the best occupancy and thread launch combination. Essentially, the maxThreadsPerBlock element is entirely superfluous, since it’s only real use would be in scaling kernel launch sizes by number of device threads available. However, instead we should apparently embed the Excel worksheet in our program and have the device properties chug through the macros to provide any runtime adjustments based on the hardware we’re running on. (</sarcasm>) Yeesh.
Hopefully these tips help you out. As I continue to bang my head against the wall and find new tidbits I’ll be keeping track of them on my GitHub wiki page. Happy holidays!













