On the fourth day of Christmas my true love gave to me… four CUDA tips…

This post is from a blog called Adventures in GraphicsLand that I’m writing with two fellow CS grad students, Chris Gibson and Ryan Schmitt. Articles about anything related to my graduate work in graphics or my thesis will be posted there and then cross-posted here. Articles about handy tips (like fixing bugs with VirtualBox or software setup on Fedora) will remain here. This post that I wrote for AIGFX, originally appeared here.

Learning CUDA has definitely been an interesting experience. As much as they make it sound like it’s simple to get started (and for the most part, it is), there are lots of little traps that can keep you frustrated for hours… or days. Here are four tips that stumped me during initial development of Haste (which is now on GitHub!) that might be helpful to you.

Long running kernels on a desktop workstation

In Linux, X’s driver watchdog will kill a process that leaves a driver hanging for too long, so to prevent that from happening you can’t launch a GPU kernel unless it returns within a couple milliseconds. (This happens in Windows, too, but I’m working mainly in Linux at the moment.) However, you might want to test kernels on your workstation. The way around this is to switch to a text-only terminal before running your CUDA program. On most Linux distributions, you can swap between terminals using Ctrl-Alt-F2 through Ctrl-Alt-F6, where each is a different terminal. If you hit Ctrl-Alt-F1 in Fedora 14, it will take you back to your X session (you’re still logged in and everything).

So, all you need to do is write code in your graphical desktop, compile, hit Ctrl-Alt-F2 to switch to a text-only terminal, then run your program for testing. When you want to go back to graphical mode to fix bugs, just Ctrl-Alt-F1 back and off you go.

Slow device info queries

If you’re doing doing development on a headless compute box (like our Tesla machine at Cal Poly), you might have noticed that querying device information takes a long time. This is compounded if it’s a multi-device machine. Our box at Poly has four Tesla GPUs, and Haste startup was frustratingly slow. All we did is query the device list once, then query each device individually using cudaGetDeviceProperties(). It usually take on the order of 30 to 45 seconds at program startup to get all the device information and allocate memory before we were off to the races launching kernels.

The problem is that the NVIDIA drivers normally maintain a lot of state about the GPUs in memory. However, this state is only there if there’s some resident process keeping it there, like X. If X is not running (or not even installed, like on our headless compute box), that state will need to get reinitialized every time you make a call that requires it. This can be excruciatingly slow, especially on multi-device machines.

The solution? Well, the easiest one is to just install and leave X running, even on a headless machine. Just make sure it’s not driving a display, or better yet switch it over to a text-only terminal with Ctrl-Alt-F2 to keep X around but not have it interfere with your kernels.

Printing debug info in device kernels

I must admit, while debuggers are neat, I tend to like printf() debugging. It’s not that I don’t see the value of debuggers; for some problems they’re really the only way to solve things. Maybe it has something do with the fact that cuda-gdb inexplicably crashes on every machine and kernel I try to run it on.

With the Fermi architecture, available in cards of compute capability 2.0 and higher, you can actually do printf()‘s directly from your device code now, without having to jump through any strange library hoops. Initially, however, I was never able to get it to work. I couldn’t find which CUDA header I needed to include to get things off the ground, and even when it seemed to compile it didn’t print anything.

Well, it sounds silly, but just #include <stdio.h> and away you go. I never tried this initially because I thought that didn’t make any sense. The C standard library doesn’t have CUDA device code! The best I can tell, nvcc is rewriting these standard calls from device code behind the scenes.

The device info’s maximumThreadsPerBlock lies!

This one really irks me. If you query a device’s properties, it reports the maximum number of threads per block in a cudaDeviceProp struct member called, shockingly, maxThreadsPerBlock. The problem is that this is not the actual number of threads you can launch. That depends entirely on your kernel’s occupancy, which you can figure out using the difficult-to-find occupancy calculator spreadsheet. You’ll also want to compile your kernel with the nvcc option --ptxas-options=-v to see the shared memory and register usage for your kernel. You’ll need it in the spreadsheet.

The occupancy limit doesn’t bug me so much as the fact that this is not mentioned anywhere in the documentation where maxThreadsPerBlock is mentioned. Once would think that would be a great place to throw up a warning flag, letting developers know that that number is purely speculative, and that they need to do some real benchmarking of their kernel to find the best occupancy and thread launch combination. Essentially, the maxThreadsPerBlock element is entirely superfluous, since it’s only real use would be in scaling kernel launch sizes by number of device threads available. However, instead we should apparently embed the Excel worksheet in our program and have the device properties chug through the macros to provide any runtime adjustments based on the hardware we’re running on. (</sarcasm>) Yeesh.

Hopefully these tips help you out. As I continue to bang my head against the wall and find new tidbits I’ll be keeping track of them on my GitHub wiki page. Happy holidays!

CUDA Development Environment Setup Under Windows

This post is from a blog called Adventures in GraphicsLand that I’m writing with two fellow CS grad students, Chris Gibson and Ryan Schmitt. Articles about anything related to my graduate work in graphics or my thesis will be posted there and then cross-posted here. Articles about handy tips (like fixing bugs with VirtualBox or software setup on Fedora) will remain here. This post that I wrote for AIGFX, originally appeared here.

Getting a complete CUDA development environment up and running under Windows can be a bit… daunting. Between all the dev drivers, SDKs, toolkits, and other trimmings it can take several hours to get your workstation up and running. However, the results are pretty nice.

This guide will give you the following setup:

  • Visual Studio 2010 (C/C++ compiler and IDE)
  • CUDA Toolkit 3.2 (CUDA C compiler and runtime)
  • GPU Computing SDK 3.2 (sample code and utility libraries)
  • Parallel Nsight 1.5 (live debugging and performance analysis of CUDA code)
  • Visual Assist X 10.6 (syntax highlighting and completion goodies)

There are, however, some drawbacks that you should be aware of:

  • In addition to Visual Studio 2010, you need Visual Studio 2008. This is because the CUDA compiler (nvcc) only supports the VS 9.0 build tools at the moment. You can still develop in and compile from VS 2010 however.
  • You need two CUDA-capable GPUs in your machine to do CUDA debugging with Parallel Nsight.
  • Installation and setup will take the whole afternoon.

For reference, my machine is running Windows 7 Professional x64. Your mileage may vary.

Step 1: Get everything downloaded.

Make sure you have downloaded installers or installation disks handy for all of the following:

Step 2: Install Visual Studio(s).

Both VS 2008 and 2010 can coexist side-by-side. Install VS 2008, then the SP1 update. Lastly install VS 2010. Be sure to launch Windows Update afterwords to pull down any patches to the dev tools.

Step 3: Install Visual Assist X.

If you haven’t used Visual Assist X before, you’ve been missing out. It has lots of refactoring and code exploration features, but its killer feature for me is how well it complements Visual Studio’s Intellisense. When you run the installer, you’ll be able to select if you want to install it for both versions of VS or just 2010. It’s up to you, but we only needed VS 2008 for the compiler so you can get away with just installing it for VS 2010.

Step 4: Install the CUDA tools.

First install the developer driver. Then the CUDA toolkit, and the GPU Computing SDK. The default installation paths are fine. Lastly, install the Parallel Nsight Host and Parallel Nsight Monitor.

Step 5: Configure Visual Assist X to know about CUDA.

To get full syntax highlighting and include support, we need to tell VAX about our CUDA libraries as well as the fact that it should treat CUDA files as C/C++ files.

Launch VS 2010 and open the Options screen (Tools > Options). Under Projects and Solutions > VC++ Project Settings add the following entries to the Extensions to Include item:

  • .cu
  • .cu.h (or whatever you use for your CUDA headers)

It should look something like this:

Extensions to Include

Now close VS 2010 and open up the registry editor (Start > regedit.exe). Browse to the following folder:

  • HKEY_CURRENT_USER\Software\Whole Tomato\Visual Assist X\VANet10

Now look for the ExtHeader key and add .cu.h to the list. Make sure the whole line ends with a semicolon. It should look something like this:

Regedit ExtHeader

Look a little further down and you should see the ExtSource key. Add .cu to the list in the same way. Again, make sure the line ends with a semicolon.

Regedit Source

Now relaunch VS 2010 and open the VAX options (VAssistX > Visual Assist X Options). Open the Projects group on the left and select C/C++ Directories. Under Platform select Custom. Then, select Stable Include Files from the drop down on the right and add the paths to your CUDA toolkit includes and GPU Computing SDK includes. If you used the default installation directories, these are:

  • C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include
  • C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\common\inc

In other words, your screen should look something like this:

Stable Includes

Now switch the drop down to Source files and add the following paths:

  • C:\Program Files (x86)\NVIDIA GPU Computing Toolkit\CUDA\v3.2\src
  • C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\common\src

Source Files

Lastly, select the Performance item from the left and click the Rebuild Symbol Databases button.

Rebuild Symbol Databases

You should be good to go now with Visual Assist X and CUDA.

Step 6: A bare bones CUDA project.

To take you through the process of setting up a new CUDA project in VS 2010, here’s a simple bare bones console application that adds two numbers on the GPU.

First, go to File > New > Project. Select Win32 Console Application from the Visual C++ category. Enter a location and a name and click OK. In the wizard, uncheck Precompiled Headers and check Empty Project.

Now in the Solution Explorer, right click on the name of your project and select Build Customizations. In the box that pops up, select CUDA 3.2 and click OK.

Build Customizations

Now add a new source file to your project. Let’s call it hello.cu. Once it’s added, right click on it in the solution explorer and select Properties. Select the General item on the left and make sure the Item Type is set to CUDA C/C++.

CUDA Item Type

Lastly, we need to make sure we’re including the GPU Computing SDK headers and linking to the CUDA runtime library, as well as tell Visual Studio to use the 2008 (9.0) version of the compiler.

Right click on your project and select Properties. From the Configuration drop down, select All Configurations. Under Configuration Properties > General, select v90 from the Platform Toolset item.

Platform Toolset

Under Configuration Properties > CUDA C/C++ > Common, add the GPU Computing SDK include path to Additional Include Directories. If you chose the default installer path, it will be:

  • C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 3.2\C\common\inc

CUDA Additional Include Directories

Now under Configuration Properties > Linker > Input add cudart.lib to your Additional Dependencies.

Linking CUDArt

Apply the settings and click OK. All we need now is to flesh out our hello.cu file with a sample program and we’re good to go. Here’s a sample program that adds two integers on the GPU and prints the result. (Note that this is not the default VS 2010 or VAX syntax highlighting, I’ve done some heavy customization.)

CUDA Example Program

Hit the run button and away you go. For fun, (and to learn more about Parallel Nsight), try setting a breakpoint in your GPU code and debugging it. :)

Killing Trees with Maximum Efficiency

This post is from a blog called Adventures in GraphicsLand that I’m writing with two fellow CS grad students, Chris Gibson and Ryan Schmitt. Articles about anything related to my graduate work in graphics or my thesis will be posted there and then cross-posted here. Articles about handy tips (like fixing bugs with VirtualBox or software setup on Fedora) will remain here. This post that I wrote for AIGFX, originally appeared here.

So hopefully it doesn’t seem like we’ve jumped the shark before things have even gotten rolling, but today I’m going write about… printers.

“But Bob,” you ask, “this is a blog about graphics. Why are you writing about printers?”

Well, first of all, because my trusty HP PSC 1315 All-in-One is wheezing its way out of existence at the moment. I’m shocked that a $50 printer/scanner lasted me over 5 years, but it’s held up quite well.

HP PSC 1315

Secondly, because this blog is about our graphics research and research means one thing… printing out boatloads of other people’s research to read through. Sure everything’s a PDF these days so I could read it on the screen, but screens mean internet access and internet access means hours wasted on Reddit, Facebook, or Minecraft instead of reading. Eventually you realize that while you just spent hours building TNT-powered sheep cannons in a pixelated voxel world, your PDF, sadly, did not read itself. I know that I have a much higher chance of actually reading something if I can kick back on my couch or bed away from the those horrible backlit time sinks and read something on a good ole’ stack of dead trees and ink.

Convinced yet? If not, then this post was written by Chris. If you are, check out the sweet laser printer I just ordered from Newegg.

Brother HL-2270DW

Looks pretty humble, but check out the specs:

  • 27 pages per minute (a page about every 2 seconds!)
  • Wired and wireless network connectivity
  • Auto-duplex (prints on both sides without having to refeed!)
  • Only $150 shipped.

On top of that, the toner cartridges are only $57. That’s what it already costs me to replace my inkjet cartridges, except the toner cartridges will print about 2600 pages before needing to be replaced, and you can set the printer to toner-save mode to get even better mileage out of them.

Now, granted, it is only a monochrome printer, but I’m primarily just printing text these days anyway (research papers and whatnot). I’m particularly excited about the auto-duplexing, because I can save half my paper that way and I might actually have a fighting chance of getting a staple through some of these research papers. In fact, I’m kinda surprised at how excited I am over… a printer.

As a disclaimer, I have not received the Brother HL-2270DW yet, so I can’t comment on whether or not it’s actually all that and a bag of chips, but the Amazon and Newegg reviews are pretty positive. Assuming it doesn’t catch fire and burn my house down* when I plug it in, I think I’ll be happy with it.

* If it does catch fire and burn my house down, I reserve the right to edit this post with vicious commentary representative of only a single user’s poor experience with the HL-2270DW, neglecting everyone else’s glowing reviews. You know, like a real blogger would do.

EDIT: I’ve had the HL-2270DW for over a month now. Yes, it is all that and a bag of chips. Very happy with in, and Newegg has even run a few specials where you can pick it up for around $90. It’s an absolute steal at that price.