On the fourth day of Christmas my true love gave to me… four CUDA tips…

This post is from a blog called Adventures in GraphicsLand that I’m writing with two fellow CS grad students, Chris Gibson and Ryan Schmitt. Articles about anything related to my graduate work in graphics or my thesis will be posted there and then cross-posted here. Articles about handy tips (like fixing bugs with VirtualBox or software setup on Fedora) will remain here. This post that I wrote for AIGFX, originally appeared here.

Learning CUDA has definitely been an interesting experience. As much as they make it sound like it’s simple to get started (and for the most part, it is), there are lots of little traps that can keep you frustrated for hours… or days. Here are four tips that stumped me during initial development of Haste (which is now on GitHub!) that might be helpful to you.

Long running kernels on a desktop workstation

In Linux, X’s driver watchdog will kill a process that leaves a driver hanging for too long, so to prevent that from happening you can’t launch a GPU kernel unless it returns within a couple milliseconds. (This happens in Windows, too, but I’m working mainly in Linux at the moment.) However, you might want to test kernels on your workstation. The way around this is to switch to a text-only terminal before running your CUDA program. On most Linux distributions, you can swap between terminals using Ctrl-Alt-F2 through Ctrl-Alt-F6, where each is a different terminal. If you hit Ctrl-Alt-F1 in Fedora 14, it will take you back to your X session (you’re still logged in and everything).

So, all you need to do is write code in your graphical desktop, compile, hit Ctrl-Alt-F2 to switch to a text-only terminal, then run your program for testing. When you want to go back to graphical mode to fix bugs, just Ctrl-Alt-F1 back and off you go.

Slow device info queries

If you’re doing doing development on a headless compute box (like our Tesla machine at Cal Poly), you might have noticed that querying device information takes a long time. This is compounded if it’s a multi-device machine. Our box at Poly has four Tesla GPUs, and Haste startup was frustratingly slow. All we did is query the device list once, then query each device individually using cudaGetDeviceProperties(). It usually take on the order of 30 to 45 seconds at program startup to get all the device information and allocate memory before we were off to the races launching kernels.

The problem is that the NVIDIA drivers normally maintain a lot of state about the GPUs in memory. However, this state is only there if there’s some resident process keeping it there, like X. If X is not running (or not even installed, like on our headless compute box), that state will need to get reinitialized every time you make a call that requires it. This can be excruciatingly slow, especially on multi-device machines.

The solution? Well, the easiest one is to just install and leave X running, even on a headless machine. Just make sure it’s not driving a display, or better yet switch it over to a text-only terminal with Ctrl-Alt-F2 to keep X around but not have it interfere with your kernels.

Printing debug info in device kernels

I must admit, while debuggers are neat, I tend to like printf() debugging. It’s not that I don’t see the value of debuggers; for some problems they’re really the only way to solve things. Maybe it has something do with the fact that cuda-gdb inexplicably crashes on every machine and kernel I try to run it on.

With the Fermi architecture, available in cards of compute capability 2.0 and higher, you can actually do printf()‘s directly from your device code now, without having to jump through any strange library hoops. Initially, however, I was never able to get it to work. I couldn’t find which CUDA header I needed to include to get things off the ground, and even when it seemed to compile it didn’t print anything.

Well, it sounds silly, but just #include <stdio.h> and away you go. I never tried this initially because I thought that didn’t make any sense. The C standard library doesn’t have CUDA device code! The best I can tell, nvcc is rewriting these standard calls from device code behind the scenes.

The device info’s maximumThreadsPerBlock lies!

This one really irks me. If you query a device’s properties, it reports the maximum number of threads per block in a cudaDeviceProp struct member called, shockingly, maxThreadsPerBlock. The problem is that this is not the actual number of threads you can launch. That depends entirely on your kernel’s occupancy, which you can figure out using the difficult-to-find occupancy calculator spreadsheet. You’ll also want to compile your kernel with the nvcc option --ptxas-options=-v to see the shared memory and register usage for your kernel. You’ll need it in the spreadsheet.

The occupancy limit doesn’t bug me so much as the fact that this is not mentioned anywhere in the documentation where maxThreadsPerBlock is mentioned. Once would think that would be a great place to throw up a warning flag, letting developers know that that number is purely speculative, and that they need to do some real benchmarking of their kernel to find the best occupancy and thread launch combination. Essentially, the maxThreadsPerBlock element is entirely superfluous, since it’s only real use would be in scaling kernel launch sizes by number of device threads available. However, instead we should apparently embed the Excel worksheet in our program and have the device properties chug through the macros to provide any runtime adjustments based on the hardware we’re running on. (</sarcasm>) Yeesh.

Hopefully these tips help you out. As I continue to bang my head against the wall and find new tidbits I’ll be keeping track of them on my GitHub wiki page. Happy holidays!

Include dependencies

Generally when I write software, I try to keep things relatively well organized. Inevitably, however, things are going to get a bit messy, especially if you’re working on a large, disorganized codebase that you didn’t write to begin with… say, oh… something like the Source SDK.

Frequently you have some class which is composed inside another class, but occasionally needs to access the class it’s composed inside of. Basically, the classes are composed inside each other, though the abstraction really only makes sense in one direction. Confused yet?

In this example, we’ll use a Refrigerator class which stores inside it an instance of a Cheese class. Why cheese, you ask? Because cheese is delicious. Also, our refrigerator is from the future and can slice and serve cheese just like the built in ice maker and water dispenser. It’s a pretty sweet fridge.

Now, we were all taught to keep our #includes in our header files, not the implementation files, so like good little programmers we construct our classes like so:

refrigerator.h

#include "cheese.h"

class Refrigerator {
private:
    Cheese *pCheese;
    int temp = 35;

public:
    void ServeCheese();
    int GetTemp();
};

refrigerator.cpp

#include "refrigerator.h"

void Refrigerator::ServeCheese() {
    printf("Now dispensing %s cheese!\n", pCheese->GetFlavor());
}

int Refrigerator::GetTemp() {
    return temp;
}

cheese.h

class Cheese {
private:
    Refrigerator *pFridge;
public:
    char *GetFlavor();
    void CheckTemp();
    void BeginMolding();
};

cheese.cpp

#include "cheese.h"

void Cheese::CheckTemp() {
    if (pFridge->GetTemp() > 45) {
        BeginMolding();
    }
}

char *Cheese::GetFlavor() {
    return "cheddar";
}

I’ve left out the constructors in this example for brevity, but let’s assume that they get the pointers set up correctly so that our instance of the Refrigerator class has a correct pointer to an instance of the Cheese class and vice versa.

Now, at this point you may be screaming that this needs to be refactored and reorganized. Yes, it probably does. But there are many instances where you simply can’t, and in fact the abstraction really only makes sense one way. The fridge has cheese in it, but the cheese certainly doesn’t have a fridge in it. We just need that pesky reference around so we can check the temperature of the fridge every so often.

(Yes, I am aware that the fridge could push it’s temperature down to all the items in it, ala the Observer Pattern. Yes I am aware that would be a better solution. But this is a contrived example anyway, so stick with me here.)

Now, the code given above doesn’t compile, because the Cheese class has no idea what the heck this Refrigerator class is, so we either need to include it or forward declare it. If we try to do this:

cheese.h

#include "refrigerator.h"

our compiler (more specifically, the preprocessor) is going to get very angry at us, depending on which order it decides to compile refrigerator and cheese. The solution, is a forward delcaration:

cheese.h

class Refrigerator;

class Cheese {
    // ...etc...
};

Basically what this does is tell the compiler, “Hey! There’s this class called Refrigerator that I might talk about, so here’s an empty declaration of it!”

The problem, though, is that this is rather limiting. Within the Cheese class, we can declare pointers to Refrigerator class, no problem. Pointers are of fixed size, so the compiler doesn’t much what care what it’s a pointer to, since it knows how much memory it needs to hold a pointer to it. When we try to access members of the class, though, like properties or methods, it falls apart because as far as the compiler knows, the class is empty. After all, we told it the Refrigerator class didn’t have anything in it.

So if we can’t #include it and forward declaring it doesn’t give us what we want, what can we do?

Well, we can do both. Kind of.

The solution is to forward declare in your header file, and #include in your implementation file. This will avoid the preprocessor headaches of of the chicken and egg #include, while allowing us to access the members of forward declared class in the implementation. In other words, here’s the fix:

cheese.h

class Refrigerator;

class Cheese {
    ...
};

cheese.cpp

#include "cheese.h"
#include "refrigerator.h"

Again, it goes without saying, the better solution is to refactor or rearchitect your code if you can. These kind of hacks can get really out of hand and are usually a good code smell that something needs to be fixed at a deeper level. However, if you’re working on a large codebase that you can’t change, this can help out a lot.

Removing entities in the Source SDK

I haven’t written for a while, mainly because I’ve been busy with classes and studying for the GRE for my grad school applications, but here’s a quick tip for those of you meddling around with the Source SDK.

It’s well documented on how you go about spawning entities, but I couldn’t find a good place explaining how to remove spawned entities through code.

Don’t try meddling with the global entity list (gEntList) or calling its RemoveEntity() method. It doesn’t do what you want.

Instead, used one of the super-handy UTIL_* functions. Given a pointer to the entity you want to remove, simply use:

UTIL_Remove(pEntity);

Poof. Entity gone. Remember, entities are created and destroyed on the server side only. The server will automatically broadcast any changes to the entity list to its connected clients for you.