Archive | Valgrind RSS for this section

At The Gates Of Valhalla

In my post the other day, I talked about using Valgrind and GProf to debug DMP Photo Booth. As tends to be the way of most articles on the Internet, I didn’t actually spend any time talking about how to use said programs. You are, however, on notice that they are Good.

Well, I promised I’d tell you how to use them. I may be many things, but a liar I am not. Today, I’ll start with Valgrind.

Valgrind

Installation

First things first, we need to install. Once again, Ubuntu comes through for us. It’s just a simple matter of …

sudo apt-get install valgrind

… and we’re off to the races. Unfortunately for those of you in the audience running Windows, Valgrind is not available for Windows.

Valgrind Vs. GLib

You may remember in my last post a lot of talk about difficulties with GTK. Per the Gnome Wiki, the recommended way to launch Valgrind with a GLib/GTK application is:

G_DEBUG=resident-modules valgrind \ --tool=memcheck \ --leak-check=full \ --leak-resolution=high \ --num-callers=20 \ --log-file=vgdump \ [your-program]

Let’s talk about these options.

  • G_DEBUG=resident-modules
    • This is not actually an argument to Valgrind, but an environment variable. This is used by GLib. You need to use this if your application makes use of GModule. For instance, if you’re implementing a module-based Photo Booth application. If you don’t use this, your modules may get unloaded prematurely. If this doesn’t apply to your application, feel free to omit this.
  • valgrind
    • Calls valgrind
  • –tool=memcheck
    • Valgrind is actually a suite of tools. Memcheck is the tool that checks for memory leaks
  • –leak-check=full
    • Enables searching for leaks on exit
  • –leak-resolution=high
    • Sets the detail level of leak stack traces
  • –num-callers=20
    • This sets the depth of the stack traces. For instance, setting this to 20 will show: foo() called by bar() … and so on up to 20 times
  • –log-file=vgdump
    • The file to output to. In its current configuration, a file named vgdump will be placed in the current directory
  • [your-program]
    • Finally, your program. Place in whatever command you’d call to start your program.

…so you do all that and execute your program. Valgrind will monitor your program while you interact with it. Go ahead and run some test cases. It’ll be a little slow due to the monitoring, but that’s ok. Finally, when you’re done, exit your program and open the log file. The first thing you should notice is that it is long. Really long.

This is due to GTK. GTK doesn’t clean up after itself on exit, instead relying on the OS to clean up on process termination. While this behavior is considered to be fine by most, it makes this step difficult. On the Gnome Wiki, you’ll find a Suppression file that is used to mitigate some of this. My experience with this is that it doesn’t do much.

The best way I’ve found is to just search for “definitely lost”. These are most likely to be caused by your program. You can go line by line and check each possibly lost section, but I’ve found that this isn’t practical as 99/100 of these originate from gtk_main().

DMP Photo Booth: Underwater

You’ve heard it before: “Premature optimization is the root of all Evil.” Capital Evil. So you go on about your day, arranging the ones and zeros in pretty christmas tree shapes and suddenly the day arrives: your program is slow as molasses. What are you going to do now?

Last monday was that day for me, and I’ve been underwater ever since. “Why is this happening to me?!” I thought. While not prematurely optimizing, I thought I did things right. I have no nested for loops. I’m not using an array when I need a list. Threads aren’t modifying the UI willy-nilly. Why has God forsaken me?

The Symptoms

I first noticed it while working on the printer module. After the program is open for some length of time, my whole computer begins to lag. Not just a little bit either; things completely fall apart. In the space of about 5 minutes, the computer becomes unusably slow. Killing the Photo Booth process doesn’t help; only physically shutting the computer off helps. Of course, the computer is so slow that I can’t use the shutdown option; I have to press The Button.

At this point, I feel some context is in order. I had been trying to figure out how to make my printer print on photo paper. Apparently printing is one of the areas Linux still hasn’t caught up to windows on, so this was proving to be difficult. After printing a few strips, I realized that my low-res photo strips weren’t going to cut it, so I bumped the resolution from 100 pixels wide to 1000. It was then that I noticed things were off.

Ten years of troubleshooting experience kicked in: “what changed?” I thought. The obvious answer was the image size. Clearly my photo strip assembly algorithm was operating at O(n^n^n) or something. What can be done?

Doing It Wrong

I took a look at my assemble strips function. After poking around for a while, I zeroed in on something that had been bugging me for a while. I had been using a function MagickResetImagePage combined with MagickCoalesceImages to composite images over each other. I had decided to use these functions before I knew this operation was called “compositing”, and I had found them in a tutorial on making animated .gif files in MagickWand. At the time, I was never really happy with this implementation, so I went back to the API docs to see if there was a function with “composite” in its name. There was.

MagickCompositeImage is a lot more intuitive to use than MagickResetImagePage. It doesn’t have that Magickal formatting string that MagickResetImagePage uses, it just takes coordinates. Perhaps this was the solution to my problem. I refactored, and recompiled.

Still broke.

Measure, Don’t Guess

That old gem: I’m sure you’ve heard it too. I decided that maybe this was my best course of action. I decided it was time to learn how to use this Valgrind thing all the Cool Kids are talking about these days. For those of you not in the know, Valgrind is a utility that will tell you various things about your program. The most important/most well-known thing that it can do for you is identify memory leaks. Thinking that prehaps I have a memory leak, I installed Valgrind and got to work.

It turns out that GTK has more than a few memory leaks. Allegedly this is due to the fact that it doesn’t cleanup on exit, relying on the OS to free the memory on program termination. While the general consensus is that this is fine, it doesn’t help us. The folks at Gnome are aware of this, and there is even a Wiki page on ways to mitigate this. The cliff’s notes version of that page being: “Just search for ‘definitely lost'”.

Armed with this piece of wisdom, I set off. I ran the Photo Booth in Valgrind, and examined the results. Valgrind actually turned up some memory leaks, which I corrected. Maybe now we’re set!

Nope.

Breaking Out The Profiler

This is what they usually want you to do when they tell you to Measure. Unfortunately for me, NetBeans’ built-in profiler is only for Java. After some google searching, I found gprof. Gprof is a pretty bare-bones profiler. It does what it says and not much else, which is fine. I hooked my program into the profiler and got to work. The results? Nothing. My two GTK idle functions ran some 7 million times, returning basically immediately each time as expected. Every other function performed as expected.

What now?

Trying The Process Monitor

Having run through Valgrind and GProf, coming out empty-handed, I was at a loss. I got into development because I wanted to fix my own broken code instead of mitigate somebody else’s, and fix it I will. Luckily I have 10 years of sysadmin experience to fall back on. I dusted off my process monitor and got to work.

I fired up DMP Photo Booth, and watched it in the process monitor. I pushed the button. I pushed it again. And again. memory use rose and fell predictably as the strip was assembled, but CPU usage stayed relatively low. Then boom!

I tried again, this time doing literally nothing. Still my computer sputtered and died. I killed the process, but again it was too late.

But wait, isn’t the OS supposed to clean up after me when my process ends? Something fishy is going on.

Have I Mentioned That Threads Are Hard?

Having eliminated all other possibilities, I was forced to consider that I was having a threading issue. “But I was so careful!” I thought. Shortly thereafter I noticed it: I was getting random pthread mutex errors on my console. Clearly I had a threading issue on my hand. Was I spawning extra threads? Was something not releasing its lock? Was I being victimized by gremlins? I set a break point on line one of main() and fired up my debugger. It was time to see just what was being done when nothing was being done.

So, I stepped through my program. Whenever I got to a g_thread_new call, I made sure the thread function was solid. Finally, I got to my g_idle_add calls. I had two of them, one to monitor the status indicators, and one to retrieve photo strip thumbnails. Both of these functions pop from a result from a GAsyncQueue. These Queues are fed by worker threads. I thought back to my profiler output and remembered how often these are called. Looking a few lines down I saw a call to g_timeout_add_seconds. This function is basically adds an idle function, but is only called at most X seconds. Maybe replacing the g_idle_add calls with g_timeout_add_seconds was my answer. I refactored and reran.

Nope.

Well, crud. “Are these functions even my problem?” I thought. I commented them out, recompiled and reran.

Fixed.

“So, what’s the difference?” I wondered. All three of these functions rely on the same basic behavior: pop from a GAsyncQueue some result placed there by a worker thread. I looked at the three threads: the thread that was working properly calls g_async_queue_ref/unref, and the two that don’t work do not take a reference, instead accessing the static global variable in their module. I refactored all thread functions that access a GAsyncQueue to take a reference and work on their local copy only. I recompiled, reran, and went to bed. 46,100 seconds later, everything was humming along just fine.

Wait, So I Just Had To Increment A Reference Count?

It certainly seemed odd. That’s like your car not starting if the headlights are out. Sure, they’re important, but the car should still start right?

Looking through the source of glib didn’t help. So far as I can tell, all that does is increment the reference count, and return a pointer. I turned to the documentation, which says “… Whenever another thread is creating a new reference of (that is, pointer to) the queue, it has to increase the reference count (using g_async_queue_ref()). Also, before removing this reference, the reference count has to be decreased (using g_async_queue_unref()). …” While not definitive, this certainly seems to indicate that taking a reference is important.

Frankly, I’m not happy about this answer. This is just the sort of magic solution that I hate; it’s fixed, but I’m not sure why. For the time being, I won’t dwell on it. Moving forward, I’ll be sure that my threads take a reference of a GAsyncQueue before calling methods on it. At some point when all of this is said and done, perhaps I’ll investigate this mysterious reference count.

I have taken away from this a new appreciation of just how brittle threads are. Sure, they are powerful, but shooting yourself in the foot with a 50 cal hurts a lot more than with a 9 mm. I’ll have to be more careful.

It was also a good introduction to GProf and Valgrind. Expect blog posts on the usage of each of these tools soon!