A Bug’s Life : The One Sample Solution

May 23, 2014 Pek Leave a comment

At one point in my life I found myself working for a company that made synthesisers of the musical instrument kind. The guts of the synths consisted of a bunch of DSPs that did all the audio processing, some memory, and a general purpose CPU that handled the rest: scanning the keyboard, scanning the knobs and buttons on the control panel and providing MIDI and USB connectivity.

One of my jobs was to clean out a list of bugs that users had reported on a released product. Most of these bugs were minor and sometimes obscure but the company prided itself on high quality so they would strive to fix all reported bugs, no matter how minor. Unless the bug report was that “the product should cost 50% less, be all-analogue, and be able to transport me to my place of work”. Some people have strong feelings of entitlement.

One of the bugs on my list was that the delay glitched if you turned it off and on again quickly. A delay is a type of audio effect that simulates an echo: if you send a sound into it then the sound will be repeated some number of times while fading out. Early delays were implemented using magnetic tape. If the recording head was placed some distance ahead of the playback head then a sound that was recorded into the delay unit would play back as an echo a short while later as the tape passed by the playback head. In this particular case the delay was digital and consisted of a circular buffer of audio samples that the DSP would continuously feed data into (mixed together with feedback from audio that was already in the buffer). The buffer would then be played back by being mixed with the main audio output signal.

The delay had an on/off button on the front panel of the synth and when you switched it off the DSP would set the input volume to the delay buffer to 0 so that it would fill up with silent samples. However, since this happened at the normal audio rate it would take a few seconds before the whole buffer was filled with zeros. A user had discovered that if you played something with the delay enabled, then switched it off and then quickly switched it on again then parts of the delay buffer would still contain sound that you would hear, sometimes with nasty transients. The solution was to program the DSP to quickly zero the delay buffer using a DMA transfer whenever the delay was turned off.

This may sound trivial but the code running on the DSP was hand-coded assembly optimised to within an inch of its life. The DSP had one job: to present a pair of completely processed 24-bit samples — a stereo signal — to the Digital-Analogue converter inputs at the sample rate, which was 44100 times per second. If a sample wasn’t ready in time then digital noise would result at the output. This was frowned upon because it would sound horrible and if it happened while the instrument was being played on stage, through a serious sound system, you might as well poke out the eardrums of your audience using an ice pick. This made the rules of the game for the DSP pretty simple: if it could execute one instruction per cycle and was running at F cycles per second then that meant that it could spend N=F/(2*44100) instructions per sample. In fact it had to spend exactly N instructions or the timing would drift off. Any unused cycles had to be filled in with “No-op” instructions that do nothing but waste time. In this case N was a couple of hundred cycles. This meant that the DSP code was a couple of hundred instructions long, which was good because there was not so much of it, but bad because there were very few unused cycles left in which to set up the DMA.

This type of DSP is built to do one thing: to perform arithmetic on 24-bit fixed point numbers. It is a multiply-add engine. Multiplying and offsetting 24-bit fixed point numbers is easy and everything else is a pain in the upholstery. Instructions are often very limited as to which registers they can operate on and the data you want to operate on is therefore, as per Murphy’s law, always in the wrong type of register.

After scrounging up some spare cycles I managed to set up a DMA that zeroed the delay buffer whenever it was turned off. Apparently. I tested it: no problem. I told my boss and he came in and listened. Now, my boss had worked with theses synths for years and years and he immediately heard what I didn’t: a diminutive “click” sound that was so weak that I couldn’t hear it at all. “Probably you didn’t turn off the input to the delay buffer so a couple of samples gets in there while the DMA is running.” I verified it but no, the input was properly turned off.

Everyone needs Wilson's Common Sense Ear Drums. — Everyone needs Wilson’s Common Sense Ear Drums.

Now that I knew what to listen for I could hear the click if I put on headphones and turned the volume up. Headphones are always scary when programming audio applications because if you screwed up the code somewhere you might very well suddenly get random data at the DACs which means very loud digital noise in the headphones which means broken eardrums and soiled pants in the programmer. In contrast, a single sample of random data at 44.1kHz sounds a little bit like a hair being displaced close to your ear. In the beginning I had to stop breathing and sit absolutely still to hear the click noise. Moving the head just a little would make mechanical noise as the headphones resettled, noise that would drown out the click. After a while though, my brain became better at picking out the clicks, after a day or so I could hear it through loudspeakers. Unfortunately I soon started to hear single-sample clicks everywhere…

"Wait, wait! I think I can hear it!" — “Wait, wait! I think I can hear it!”

The good thing about this bug was that it was fairly repeatable. I like to think that no repeatable bug is fundamentally hard. Unless the system is very opaque you just have to keep digging and eventually you’ll get there. It may take a long time, but you’ll get there.

So what was going on? Was the delay buffer size off by one? No. Was the playback pointer getting clobbered? No. Did the DMA do anything at all? Yes, by filling the buffer with a non-zero value and then running the DMA I could see that the buffer was zeroed afterwards. Was it some sort of timing glitch that caused a bad value to appear at the DACs? Not that I could tell. Blaming the compiler is always an option but in this case the code was hand-written assembly so that alternative turned out to be less satisfying. A compiler can’t defend itself but your co-worker can…

The debugging environment was pretty basic. The only tool available was a home-grown debugger that could display something like 8 words of memory together with the register contents and not much else. Looking through a megabyte or so of delay buffer data through an 8 word window might sound like fun but I guess I’m easily bored…

One thing that became apparent after a while was that the click sound would appear even when the delay buffer was empty to begin with. This indicated that the bad data was put there instead of being left behind. At this point I started to throw things out of the code. I find this to be a pretty good way forward if a bug checks your early advances. Remove stuff until the bug goes away, then you can start to add stuff back again. After a while I had just the delay code left. I wrote a routine that filled the delay buffer with a known value, ran the DMA, and then stepped through the whole buffer, verifying the contents word-by-word. And bingo, it found a bad sample.

bingo2

Seeing it made it real. As long as I only heard it, the defect could theoretically have been introduced somewhere later in the processing or in the output chain. But now I could see the offending sample through my tiny window, sitting there in the middle of a sea of zeros. What is more, the bad sample did not contain the known value that the delay buffer was initialised with. At this point suspicions were raised against the DMA itself. A quick look through the DSP manual revealed an errata list on the DMA block several pages long. This DSP had a horrendously complex DMA engine that could do all sorts of things. For example, it could copy a buffer to a destination using only odd or even destination addresses — in other words it could copy a flat buffer into to a single channel of a stereo pair. It seemed like half of these modes didn’t really work.

None of the issues listed in the errata list fit what I was seeing but I still eyed the DMA engine suspiciously. I therefore tried to zero the delay buffer using a different DMA mode and it worked! Ah, hardware: can’t love it, can’t kill it. Hardware designers on the other hand…

bad_dog

So the bug was put to rest and I moved on to the next issue on my list. Did we learn something? When you can’t blame the compiler, blaming hardware remains an option. In the end I came away pretty impressed by the dedication to quality and stability that this small company displayed. The original bug report was on a glitch that many companies wouldn’t bother to correct at all. The initial fix took maybe half a day to implement and took the amount of unwanted data in the delay from 1-2 seconds (between 90000 and 180000 samples) down to just a single, barely audible, sample in what was already a rare corner case. Fixing it completely took about a week. In other words, it took four hours to fix 99.99999% of the bug and 36 hours to fix the rest of it. But the message was pretty clear : fix it and fix it right.

And to all the semiconductor companies out there: don’t let your intern design the DMA engine.

Programming

The Philosophy of Bug-fixing

May 7, 2014 Pek Leave a comment

I spend way to much time fixing bugs. I don’t say that because my code is excessively buggy (although that is also true), I say that because quite a few of the bugs I fix are probably not worth fixing. Not really. Let’s imagine…

Your customer has submitted, along with vague threats of ninja assassins being dispatched to your company headquarters, a bug report about a crash in your software. He cannot really supply any useful information on what led up to the crash but the error message tells you exactly where in the code the crash occurred. Firing up your favourite editor you take a look. The crash happens because an unallocated resource is freed.

A bug :

mystificate(wizard)
 ...

 fire(wizard)

The crash is easily prevented, just check if the resource is allocated before freeing it. The thing is, that resource should be allocated at this point in the code. You know that it usually is or you would have seen this crash before. The crash, while a bug in itself, is clearly also a symptom of another bug. What do you do and why?

A bug fix :

mystificate(wizard)
 ...

 if (hired(wizard))
    fire(wizard)

Any diligent programmer would at least spend a little time considering the implications of this discovery. If you can easily figure out how the resource came to be unallocated then perhaps you can fix that too. And you should. You may also realise that the unallocated resource is a fatal error which absolutely, positively, must be fixed or the plague breaks out. Put on your mining helmet and start digging. Perhaps your product is a pacemaker. Fix the bug.

Most of the time though, the situation is not as clear cut. You just don’t know what the implications are. The bug could be a corner case with no other consequences than the crash that you just fixed. Or it could be a symptom of some intricate failure mode that may cause other bad things to happen randomly. I would guess that most competent programmers would feel decidedly uncomfortable papering over the problem. Some may even find it hard to sleep at night, knowing that the bug is out there, plotting against you. They lie awake worrying about when and how the bug will crop up next. So you start to investigate.

You rule out the most likely causes. You try to repeat the problem. You try to make it happen more often. If you’re an embedded programmer you verify five times that RX and TX pins haven’t been swapped somewhere. You instrument the code. You meditate over reams of trace printouts. You start to suspect that this is a compiler bug, but that is almost never really the case. You read the code (desperation has taken hold). You poke around in a debugger. You remove more and more code, trying to reduce the problem. You start to become certain that this is a compiler bug (it never is). You dream about race conditions. You stop showering. You start to have conversations with an interrupt handler like it was a person. You dig into the assembly. You now know that it is a compiler bug (it isn’t). You mutter angrily to yourself on the subway. A colleague rewrites a completely unrelated part of the code and the symptom inexplicably goes away. Your cat shuns you. You start to question everything; how certain are you that a byte is 8 bits on this machine?

After aeons of hell a co-worker drops by. She looks over your shoulder and asks “isn’t that resource freed right there?” And sure enough — in the code that you’ve been staring at for weeks, a mere 4 lines above the bad free, the resource is freed the first time. To say that the error had been staring you in the face the whole time would be understating it. You kill yourself. The cycle starts again.

The terrible truth:

mystificate(wizard)
   fire(wizard)

   hire(izzard)
       if (flag)
           occupy(india)

   if (hired(wizard))
       fire(wizard)

Or maybe the bug turned out to be something really gnarly that could explain dozens of different bug reports.

Should you have fixed that bug? The thing about bugs is that they are unknown. Until you know what is going on you can’t tell wether the bug will cause catastrophic errors or is relatively harmless, right? The bug is what Donald Rumsfeld would call a “known unknown,” something that you know that you don’t know. Not tracking down the bug means living with uncertainty. Fixing bugs increases the quality of the software so fixing any bug (correctly) makes the software better. You can easily convince yourself that fixing the bug was the right thing to do, even if it turned out to be relatively harmless. But you’re probably wrong.

sheldon

The “known unknown” argument cuts both ways. Even if the bug turns out to be serious you didn’t know that beforehand. Putting in serious effort before you know what the payoff would be may not be the wisest way to allocate your time. And the question isn’t really “did the fix make the software better” but “was the time spent on the fix the best way to make the software better”.

Living with uncertainty can be difficult but lets face it, your code probably has hundreds or thousands of unknown bugs. Does it actually make a difference that you know that this one is out there?

Am I saying that it was wrong to start looking for the bug? Definitely not. If you suspect that something is amiss, you should absolutely strive to understand what is going on. The difficulty lies in knowing when to stop. There’s always a sense that you are this close to finding the problem. If you would just spend 10 more minutes you would figure it out, or stumble upon some clue that would really help. Now you’ve spent half a day on the bug, might as well spend the whole day since you’ve just managed to fit the problem inside your head. You’ve spent a day on it, might as well put in another hour just to see where this latest lead goes. You’ve spent three days and if you don’t fix it now you’ve just wasted three days. You’ve spent two weeks and dropping the whole thing without resolution at this point would cause irreparable psychological damage. The more you dig the deeper the hole becomes.

So what do you do to prevent this downward spiral? Talk to a co-worker. I know it might seem uncomfortable and unsanitary and possibly even unpossible, but it can be an invaluable strategy even if your colleagues are all muppets. Take ten minutes a day to talk about where you are and how it is going. Just describing the problem often makes you realise what the solution is. “So the flag is unset when frobnitz() is called and… and… never mind, I am an idiot” is a common conversation pattern (although “never mind, you are an idiot” is idiomatic in some places). Sometimes this even works with inanimate objects like managers. This is sometimes called “Rubber duck debugging“: carefully explain your problem to your rubber duck and the fix will become apparent. Don’t own a rubber duck? What is wrong with you??

Even if describing the problem doesn’t help, a sentient co-worker has a tendency to ask if you have checked 10 different things that are so obvious that how can you even ask, and do you think I’m completely incompetent and, by the way, no, I meant to get around to it. “Did I recompile the code after changing it? No, do you have to?” If your colleague is very polite you may have to insult him a bit before he asks the obvious questions. Obvious questions can be hard on the ego but very useful because the odds are in favour of the obvious things. For some reason the human mind seems to prefer the mystical and complex (“it’s a compiler bug”) to the simple and likely (“I forgot to recompile”).

The final reason it is good to talk to a co-worker is that you get some distance from the problem. Just discussing loosely where you are and what you know and how much time you’ve spent can really help when you start to suspect that you should strap your bug-fixing project to an ejection seat.

Conclusions:

Don’t get lost fixing a hard bug, keep an eye on the big picture. This is a lot easier to write than to do.
Learn to live with uncertainty. If this is impossible: embrace misery.
Co-workers can serve a useful purpose other than being on the opposite side of indentation wars.

Eurorack, Synths

Living Without Oscillators

May 9, 2014 Pek Leave a comment

My oscillators have still not arrived but I managed to get the USBStreamer B to work properly by reflashing its ADAT firmware. This required a rather sordid excursion into Windows 7, which I will spare you, but the upshot is that the ES-3 now works as intended. I’ve written a guide on how to integrate a modular synth with Reason using the ES-3 here.

Happily this means that I can trigger the envelopes from Reason devices as well as using the modular filters on sounds from Reason. First impressions: the Z2040 sounds pretty nice, especially when you turn up the gain knob so it starts to overdrive.

Remember how I wrote that “I won’t be digging into any of the esoteric math modules…“? It turns out that I was wrong and that I will be digging into, specifically, an esoteric math module. This is what happens when you order a spanking new toy and a vital part — the oscillators — takes a month longer to arrive than the rest of it. You can rationalise pretty much anything. I’m getting a Maths module from Make Noise. It’s a flexible little thing that can work as a multiple, LFO, sub-oscillator and envelope among other things. It also seems to be just the right kind of puzzling.

The Maths "Analogue Computer" module from Make Noise. — The Maths “Analogue Computer” module from Make Noise.

On the oscillator side I finally got bored waiting for the Z3000s (they would be 2-3 weeks more). I found a Z3000 in stock at Escape from noise and decided to let my sudden craving for a crazier oscillator to set the agenda. Schneidersladen had a Make Noise DPO in stock so I switched my order there from the two Z3000 oscillators to the DPO, with is a dual oscillator module, and ordered one Z3000 from Escape From Noise. The upshot is that my modular lost a Z3000 but gained a DPO. +1 oscillator! The Z3000 should arrive any day while the DPO and the missing Doepfer modules will arrive hopefully next week. Luckily for me I’m running out of rack space so there’s a natural barrier to further indulgence.

I’ll write some more about the Maths and DPO modules as I get them. You may notice that both of them are present in my parodical West Coast example system. I’m quickly seeing how addictive this hobby really is. How the hell can my brain confuse synth modules with nuts and berries? Collecting Eurorack modules won’t help me get through the winter without starving! The collecting instinct is truly a fucked up piece of neurological wiring…

The Dual Prismatic Oscillator from Make Noise.

Eurorack, Synths

Getting Maths

May 12, 2014 Pek Leave a comment

If you go on a Eurorack forum and ask people to name one essential module that everyone needs in their rack I can almost guarantee that the most common suggestion would be for Maths from Make Noise. I think it is popular for three main reasons: it is very flexible, it is compositional (meaning that it consists of a bunch of basic functions that can be combined to create more advanced things) and it is a bit inscrutable — very much like modular synthesisers in general!

Make Noise describes the Maths variously as a signal generator and as an analogue computer which can be impressive and unenlightening in equal measure.

So what the heck is it? Think of it as the Swiss Army Knife of your rack; it does a little bit of everything. It can generate envelopes. It can repeat those envelopes, creating an oscillator. It can scale, invert and combine voltages in various ways. And it makes you feel clever.

Maths consists of four separate channels that can affect an input signal in various ways. The input channel jacks are “normalled” which means that they have an internal connection that is broken when a plug is inserted. This is used to provide one function when the input is connected and another function when the input is unconnected.

When the inputs are unconnected channels 1 and 4 can generate a simple Attack-Decay envelope with adjustable attack and decay times. The envelope can be triggered on a CV pulse (like a normal envelope) or cycle continuously while the Cycle button is engaged or when the CV Cycle input is high. The length of the envelope varies from the glacial (about 2 cycles per hour) to the quite fast (1kHz) so it can function both as an LFO and as a sub-oscillator.

When the Signal input of channel 1 or 4 is connected then the that channel functions as an envelope follower.

Channel 1 has a curious but useful feature where the End Of Rise (EOR) output generates a pulse signal when the channel 1 envelope has reached its maximum value. This can be used as a clock signal or a pulse wave with the length of the envelope determining the frequency. Channel 4 has a similar feature but it generates the pulse at the End Of Cycle (EOC) output when the channel 4 envelope has reached the end of one whole cycle.

Channels 2 and 3 are less sophisticated. Channel 2 generates a constant +10V signal and channel 3 generates a constant +5V signal. All channels pass through knob-controlled “attenuverters” which are circuits that can amplify, attenuate or invert a signal. This can be used to adjust the output voltage from any channel so, for example, if you need a +2V offset signal you could pick channel 2 or 3 and adjust the channel attenuverter until the channel outputs the desired offset. If a signal is sent to a channel input then the attenuverter for that channel affects the incoming signal instead (as expected).

All channels are by default routed to three logic function buses called OR, SUM and INV. This may sound complicated but isn’t really.

OR means that the highest voltage is present at the output. For example, if the channel 1 input is 0V, channel 2 is 1V, channel 3 is 0V and channel 4 is 3V then the OR output would be 3V. If the channel 2 input rises to 4V and the others stay the same then the OR output will change to 4V.

SUM is the sum of the input voltages. In the example above the SUM output would be 4V (1V + 3V) initially and then 7V (4V + 3V) after the channel 2 input has risen. Since the attenuverters can invert signals you can also use this with the SUM bus to subtract one signal from another.

INV is the inverted SUM, so it would start at -4V and end up at -7V.

So far a bit tricky, but not mind-meltingly hard. The fun stuff begins when you start to combine the different channels, feeding them back into each other. If you want to understand what goes on with that stuff I highly recommend getting an oscilloscope. The Maths manual has lots of examples of things you can do with the module: ADSR envelopes, signal peak detection, signal rectification and many others.

I bought a Maths primarily because it can generate envelopes, audio signals and perform signal amplification and inversion. It packs a lot of functionality into a reasonably sized and priced module. And yes, it makes me feel like a rocket scientist when I have to plan my envelope using pen and paper instead of just clicking somewhere.

Make Noise modules have a very peculiar look that some people love, some people hate and some people feel ambivalent about. Sort of like absolutely anything then. I’m in the ambivalent camp. In my eyes some modules, like the Dual Prismatic Oscillator, looks great and some modules look confusing or downright ugly. I didn’t really like the look of the Maths module but fortunately a company called Grayscale offers alternate panels for many of the Make Noise modules so I ordered a Maths panel from them. I must admit that the unadulterated Maths module looked a lot better in real life than it does in photos, so I could probably have lived with the original panel, but the Grayscale alternate is a lot neater and easier to understand.