Category Archives: Programming

Making Caps-Lock useful : Untangling the Mystery of Key-remapping in OS X and Linux

Executive summary : How to get the Caps-Lock key to act as the mod-key in Xmonad when Xmonad is running in a VirtualBox on a OS X host. Or, if the previous sentence is gibberish to you, how to remap keyboard keys on Macs and in Linux.

I did this once on my MacBook and then it stopped working when I upgraded the OS by which point I had completely forgotten what I had done. So now I’m redoing it and documenting it here for posterity.

I’ll start with the short turbo-version for the impatient, a more detailed explanation will follow after.

The short turbo-version for the impatient

  • On the Mac, go into System Preferences -> Keyboard -> Modifier Keys and set Caps-Lock to “No Action”.
  • Get the OS X utility “Seil”. Use it the change the Caps-Lock key so it generates the PC_APPLICATION KeyCode (which is value 110).
  • Go into Linux on your VirtualBox. In my case the PC_APPLICATION KeyCode was translated into KeyCode 135 in X. Your Linux might change it to something else. You can run the “xev” program in a terminal in Linux and then press Caps-Lock to see which value it generates. Look for “keycode” in the output when you press Caps-Lock. If it generates something different from 135 then replace 135 with your keycode in the steps below.
  • Take the keycode you figured out in the previous step and check which keysym it generates by running
# xmodmap -pke | grep 135
keycode 135 = Menu NoSymbol Menu

In my case the keycode maps to the keysym “Menu”.

  • Add the keysym from the previous step to modifier mod3 by running the following in a Linux terminal :
xmodmap -e “add mod3 = Menu”
  • Turn of key repeat for keycode 135 by running
xset -r 135
  • Test that everything works by running “xev” in a Linux terminal,
    mousing over the created window and pressing the Caps-Lock key. You should see something like “KeyPress event … keycode 135 (keysym 0xffeb, Menu)” when pressing the key, no repeats while the key is being held down, and a corresponding “KeyRelease event…” when the key is released.
  • Get xmonad to use the mod3 modifier as the “mod” key by adding the following : “modMask = mod3Mask,” to your config in .xmonad/xmonad.hs

If you wanted to do exactly this, and it worked, and you don’t have an unhealthy compulsion to understand the workings of the world, you can stop reading here.

What the hell did you just do?

Key remapping gets confusing because of the indirection and terminology involved.

When you press a key on a keyboard then the keyboard will generate a scan code. The scan code is one or more bytes that identifies the physical key on the keyboard. The scan code also says if the key was pressed (called “make” for some obscure reason) or released (called “break”, probably for the same obscure reason). So far things are fairly simple. Different types of keyboards have different sets of scan codes. For example, a USB keyboard generates the this set of scan codes .

The scan code is converted by the operating system into a keycode. The keycode is one or more bytes, like the scan code. The difference between the scan code and the keycode is that the meaning of the scan code depends on the type of keyboard but the meaning of the keycode is the same for all keyboards.

A keycode also has a mapping to a key symbol or keysym. The keysym is a name that describes what the key is supposed to do. The keysym is used by applications to identify particular key presses, or by confused programmers who want to use Caps-Lock for something other than what God/IBM intended.

For example, if I press the ‘A’ key on my MacBook keyboard it will generate the scan code 0x04. I’m actually just guessing because 0x04 is the USB keyboard scan code for ‘A’ and I don’t know for sure that the internal keyboard in MacBook conforms to the USB keyboard standard, but let’s assume that it does.
The scan code 0x04 is translated by the operating system into the keycode 0x00 which is translated into the keysym kVK_ANSI_A.

If I would somehow manage to attach an old PS/2-keyboard to the MacBook and press the ‘A’ key on that, it would generate the scan code 0x1C which would be translated into the keycode 0x00 which would still be translated into kVK_ANSI_A.

Suppose that I’m writing an application and I want the ‘A’ key to activate some sort of spinning ball-type animation. My application would register an event handler for the kVK_ANSI_A key down event. This would then work correctly no matter what keyboard was used (as long as it actually had an ‘A’ key).

In addition to the normal keys some keys act like modifiers: holding them down changes the meaning of other keys. These keys are called modifiers (I know!). Typical modifier keys are the Shift, Control and Alt keys. Modifier keys can be regular modifiers that needs to be held down at the same time as the key being modified (like Shift normally behaves). Modifiers can also be Lock modifiers that toggle their function on and off (like Caps-Lock). There is a third variant called Sticky which behaves like a Lock modifier that is automatically unlocked after the next key-press, so that the modifier would only be applied to the next key. Sticky modifiers saves some key presses if the modifier is frequently applied to only a single key (like Shift) and rarely applied to several keys in a row (unlike rage-Shift).

So far, so general. Much of the tricky stuff happens in X so let’s look at how X handles key mapping.

X has a database with different key mappings. Each map converts keycodes to keysyms. These maps are useful because they allow you to remap the keyboard to different language layouts. My MacBook has a US keyboard. If I press the semicolon key it will generate keycode 47 and with a US keyboard map this will generate a semicolon character which is great for programming. On a Swedish keyboard the key in that same position instead generates an ‘Ö’, otherwise knowns as a “heavy metal O” (not really). If I’m not programming but, perhaps, working on the script for a Swedish erotic movie, then I would want the semicolon key to generate the ‘Ö’ character because a semicolon is a lot less useful in porn than you might think, while ‘Ö’ is widely considered the most erotic vowel (again: not really. ‘Ö’ is the sound you make when you say “Duh!” but without the ‘D’). With the Swedish keyboard map, keycode 47 generates an ‘Ö’ character instead of the semicolon.

So you can load different keyboard maps depending on what you want to do, effectively switching between different keyboard layouts or languages. You can load an entire map using the “setxkbmap” command, or you can modify the current keymap with the “xmodmap” command.

By default the “xmodmap” command lists the available modifiers and the keysyms assigned to them. For example

shift Shift_L (0x32), Shift_R (0x3e)
 lock Caps_Lock (0x42)
 control Control_L (0x25), Control_R (0x69)
 mod1 Alt_L (0x40), Alt_R (0x6c), Meta_L (0xcd)
 mod2 Num_Lock (0x4d)
 mod3
 mod4 Super_L (0x85), Super_R (0x86), Super_L (0xce), Hyper_L (0xcf)
 mod5 ISO_Level3_Shift (0x5c), Mode_switch (0xcb)

Here we see the eight available modifiers. The assigned keysyms are listed together with their matching keycodes in parenthesis. In this case you can see that a single keysym (like Super_L) can be mapped to several keycode (0x85, 0xce).

“xmodmap -pke” lists the keycode-keysym mapping:

keycode 8 =
 keycode 9 = Escape NoSymbol Escape
 keycode 10 = 1 exclam 1 exclam
 …

So let’s walk through what we want to do.

The Xmonad Mod key is fortunately named because it works like a modifier key: it should only do something when combined with another key. The default key used as Mod in Xmonad is the left Alt key. Unfortunately the Alt key is perhaps the most important key in the whole universe if you run EMACS, which I do. So I really need Xmonad to use some other key for Mod. On normal keyboards you can usually find a couple of rarely used keys, Scroll-Lock and Print-Screen comes to mind, but the MacBook keyboard lacks these. The only really useless key (for me) is Caps-Lock. Caps-Lock is also a good choice because it is well placed on the keyboard.

Caps-Lock is a bad choice because it is weird. Let’s say that we want to use the left Control key for Mod in Xmonad. This would be easy to do, just get Xmonad to use the “mod3” modifier as the Mod key (see above) and then remap the Control_L keysym to the mod3 modifier using xmodmap:

  • Clear all existing keysyms that might generate mod3 :
 xmodmap -e “clear mod3”
  • Remove the Control_L keysym from any existing modifier (or it will do multiple things which might be confusing):
 xmodmap -e “remove control = Control_L”
  • Assign the Control_L keysym to the mod3 modifier:
 xmodmap -e “add mod3 = Control_L”

This process does not work with Caps_Lock. If you run the “xev”
program in an X terminal it will show you information about the mouse and keyboard events that are sent when you press keys (and move the mouse). What this program shows is that when the Caps-Lock key is pressed the X server will first send a Key Press event to say that the Caps-Lock key was pressed, immediately followed by a Key Release event to say that the key was released. The release event is sent right away, even if the key is being held down. This makes the Caps-Lock key not work as the Xmonad modifier since the modifier key needs to be held down to be combined with other keys.

I don’t know for sure who makes the Caps-Lock key behave in this way, OS X, VirtualBox or Linux/X, but I suspect that it is OS X. If I change the keysym being generated by the keycode (66) of the physical Caps-Lock key to another keysym then that other keysym behaves in the same way.

Fortunately there is a solution: get OS X to generate another keycode when the physical Caps-Lock key is pressed. Note the difference between keysym and keycode here. It is relatively straightforward to map the keycode that Caps-Lock generates to a keysym different from Caps_Lock, but as we saw in the previous paragraph that did not fix the key press/release problem. If we can manage to convince the OS to generate a different keycode when the Caps-Lock scan code arrives from the keyboard then noone will think that it is still has the “Caps-Lock magic”. It is easy to get the Caps-Lock key to generate the keycodes for Control, Command and Option; just open the keyboard preferences and click on the “Modifiers” button on the first tab. This works fine unless you happen to want to use all of those buttons for other things (I use them as Control, Alt and “Escape from the VM”).

If you want to remap Caps-Lock to a keycode that is not Control, Command or Option you’ll need a program called “Seil”. You then need to go into the Modifiers section of the keyboard preferences as just mentioned and change the Caps-Lock key to “No Action” and then you can use Seil to get the Caps-Lock key to generate a different keycode.

The trick now is to find a keycode that is unused both by OS X and by X. I picked the PC_APPLICATION keycode (110) because the keyboard does not have such a key. Open Seil and go to Settings->”Change the caps lock key” and check the “Change the caps lock key box”. Then change the “keycode” column to say “110”.

You can now go into your Linux virtual machine and run the “xev” program and verify that the Caps-Lock key generates the correct keycode. You will probably discover two things.

First, the Caps-Lock key does, in fact, not generate keycode 110. In my case it generates keycode 135 in X. This is not actually all that unexpected. Remember that X maps scan codes to keycodes and that the VM translates the keycode 110 that OS X sends it into some simulated scan code for whatever keyboard the VM emulates. X then takes that emulated scan code and translates it into a keycode that X feels comfortable with. With the keyboard layout that I use in X the PC_APPLICATION key is mapped to the keysym “Menu”, which seems reasonable.

The second thing your “xev” test may uncover is that keeping the key pressed will generate a stream of key press/release events. The keycode 135 has key repeat enabled. This is easily fixed, just run the command

xset -r 135

to disable key repeat for keycode 135.

Now we’re getting close. When we press the Caps-Lock key on the Mac the X server in the Linux VM generates a 135 keycode which by default is the “Menu” keysym. Now all we have to do is to assign the “Menu” keysym to the “mod3” modifier, which brings us back to our old friend xmodmap.

xmodmap -e “clear mod3”
xmodmap -e “add mod3 = Menu”

And Bobs your uncle! Or not, as the case may be. But mission accomplished!

The changes to the X key mapping will be lost on reboot and every time the keymap is reloaded, such as when switching language layouts. For this reason I’ve put the X configuration in a script that I execute after switching layouts. I have a file called .xmodmap with the following contents :

clear mod3
add mod3 = Menu

A executable bash script bin/xmodmap.sh with the contents

xmodmap ~/.xmodmap
xset -r 135

and then I run bin/xmodmap.sh in my .Xsession and every time I’ve switched keyboard layouts.

An additional useful thing is to add the line

keycode 134 = Pointer_Button2

at the end of the .xmodmap file. This will make the right Command button emulate the middle mouse button, which is nice if you use it for pasting things.

 

Finally, here are two illuminating links if you feel like digging deeper :

https://www.berrange.com/tags/scan-codes/

http://www.win.tue.nl/~aeb/linux/kbd/scancodes.html

 

Update

The Menu keysym can cause extra characters to occur so I had to switch it for Hyper_L instead. My input file to xmodmap became this:

keycode 207 =
keycode 135 = Hyper_L
clear mod4
clear mod3
add mod3 = Hyper_L

A Brief Guide to Wireless Security

Illustrations by Jonas Ekman.

WiFi security is a mess. The subject has an acronym ratio that would make a NASA space shuttle manual proud and it seems like most areas of it overlap in some non-obvious fashion. I find that understanding why something strange is strange helps, but the “why” is never clearly explained in the various standards and design guidelines that constitute the pillars of the WiFi security morass. Yes, I prefer my metaphors mixed and non-sensical.

So the purpose of this text is to try to bring some clarity into the alphabet soup — without delving too deeply into the details — while also illuminating some of the reasons things ended up the way they are. There are a couple of good texts that explains the nitty-gritty but if you just want to know how the hell TKIP relates to RC4 or what the difference is between WPA2 and RSN: read on.

Houston, we have a problem.
Houston, we have a problem.

A concept that is important to keep straight is the difference between authentication and confidentiality. Usually people think “encryption” and are satisfied with that, because an encrypted communications channel of any quality provides both authentication and confidentiality. But for this discussion it is important to remember that the two concepts are different. Authentication means that you know with whom you are communicating. Confidentiality means that only the intended recipient can understand your communication. Clearly, confidentiality without authentication is a non-starter. After all, if you don’t know with whom you are communicating, you don’t know who is reading your messages. Authentication without confidentiality however, is conceivably useful. For instance, if you don’t care about keeping your messages secret, but want to prohibit your neighbour from freeloading on your WiFi network, then you might implement authentication without confidentiality.

I’ll be using the words “access point” and “station” a lot. An access point (or AP) is the WiFi “base station” — it might be a wireless router or an Apple TimeCapsule or a dedicated access point. It is the thing that provides WiFi access (hence the name) to a wired network. A station is the thing that connects to the access point — a PC or a tablet or a phone for example. If you want to be strict about it an access point is also a station but outside of the actual standard “station” usually means “a WiFi-device that is not an access point”. That is also what I will mean by “station”.

The need for WiFi security

Lets begin at the bottom: the ethernet layer. In a normal ethernet network, data is transmitted in the clear over physical wires. And people seem to be fine with that. Anyone with physical access can eavesdrop on all the traffic, but on the other hand anyone with physical access can cart off your whole computer. In a WiFi network, traffic is ending up all over the place. You can easily eavesdrop on WiFi traffic from a nearby location. You can also easily send fake traffic into the WiFi network. The designers of WiFi thought that this was a problem (because the only way to live with the dullness of standards work is vigorous porn surfing), so they added some security measures to the standard: something called Wired Equivalent Privacy, or WEP.

WEP

WEP is pretty simple to understand. It deals only with the ethernet layer. It provides two authentication methods called Open System (which is actually no authentication at all, so that makes sense…) and Shared Key. It also provides confidentiality by encryption using the RC4 cipher. Note the difference between confidentiality and authentication — it is perfectly fine to use Shared Key authentication but no encryption. If your WiFi adapter is steam powered perhaps traffic encryption is prohibitively slow but since the authentication procedure happens only once you could probably live with that being a bit sluggish.

I should mention here that Shared Key authentication has a flaw that makes it easier to crack a WEP encryption key. For this reason you should never use Shared Key authentication. If you must use WEP, use Open System authentication with traffic encryption (and also be aware that WEP as a whole is easily cracked these days).

Both the Shared Key authentication system and the RC4 traffic encryption needs some sort of key to work. The WiFi standard, IEEE 802.11, specifies what the keys should look like but not (yeah, I’m laying down the law here) how you get them. This is important. How the keys end up in the encryption engine is “out of scope of the standard” which is standardese for “if you ask us about this we’ll set fire to your car”. What this means in practice is that it is up the designer of the WiFi adapter to decide on a way to enter encryption keys into the system. It might be via a keyboard (sensible), via punch card (impractical) or via Aldis lamp (silly).

WEP keys can be 40 or 104 bits long and honest people calls that WEP-40 and WEP-104 while people who can’t count calls it WEP-64 and WEP-128 (because that sounds more secure).

WEP, if Swedish was a cipher. And let's face it, Swedish is a cipher.
WEP, if Swedish was a cipher. And let’s face it, Swedish is a cipher.

It is worth mentioning that WEP keys are strings of binary data, not “passwords” in the usual sense (like your mothers maiden name for instance). The security professional, being a bit of a bastard, calls what you normally think of as a password a “pass phrase”. People normally find it easier to remember some kind of word than a string of hexadecimal digits so it would be nice to have some way to turn a pass phrase into a WEP key and there are indeed many ways that this can be accomplished. Unfortunately, translating a pass phrase to a WEP key is “out of scope of the standard”. Which means that any access point might do it their own way, resulting in incompatibilities and chaos. In fact, what most manufacturers do (if they support pass phrases at all) is allowing fixed-length pass phrases where the ASCII representation of the pass phrase is used directly as the key.
This is pretty simple to implement but it significantly weakens WEP because it effectively restricts the number of actual bits in the key which makes it easier to crack.

As you may have gathered by now, WEP is a bit crap. The standards guys soon started getting angry notes from their bosses about the porn surfing again. Being engineers, they immediately set to work trying to solve the problem properly by introducing lots of acronyms.

In all fairness it should be stated that WEP never was intended to be a super-secure solution. As the name implies it was intended to merely provide “wired-equivalent” security. Something some people, red faced and with veins pumping at the temples, will loudly denounce as “no security at all!”

802.11i – RSN

The 802.11i standards team meant to fix the security problems present in WEP. They defined something called a Robust Security Network, or RSN, by which they mean a WiFi network using the new security measures described in the 802.11i standard, as opposed to pre-RSN security measures (meaning WEP). In order for us to have another acronym a RSN is established through a RSN Association, or RSNA.

802.11i defines two new types of confidentiality, essentially types of encryption although calling them types of encryption can lead to some confusion as we will see.

The first type is called Temporal Key Integrity Protocol, or TKIP for short. TKIP is intended to be a transitional encryption type (translation : TKIP is also a bit crap, but less crap than WEP). TKIP uses the same type of encryption algorithm, RC4, as WEP which is what’s transitional about it — it is possible to implement TKIP on hardware designed for the original 802.11 standard, hardware that normally has RC4 encryption implemented in… er… hardware.

Why not call TKIP RC4 you ask? In fact, why not call WEP RC4 as well and get rid of two acronyms at once? Well, as the P in TKIP implies, TKIP is really a protocol. It specifies what the keys should look like. RC4 is a stream cipher and the data to be encrypted is packet based so the protocol also specifies such things as how the data to be encrypted should be “chunked” before being encrypted, how to pad any chunks that are too small, and so on. These details are different for WEP and TKIP, thus the different names. Another way of looking at it is that if the hardware supports RC4 encryption, TKIP and WEP can be implemented in software on top of it.

The second type of confidentiality is the splendidly named Counter Mode with Cipher-Block Chaining Message Authentication Protocol, or CCMP (if you don’t think the acronym makes sense, CCMP is actually short for CTR with CBC-MAC Protocol; “My acronym is too long guys. This calls for a new acronym!”). In the same way as TKIP is a protocol using the RC4 encryption algorithm, so CCMP is a protocol using the AES encryption algorithm.

CCMP — now that is a confidential type of confidentiality if you ever saw one. Rest assured, your porn is safe here! At least for the moment. At least as far as I know.

One important aspect of both TKIP and CCMP is that they use so-called Temporal Keys. A temporal key is only used for a short time and is then discarded. Limiting the amount of data being encrypted by the same key makes the key harder to break. Two parties that are to communicate using TKIP or CCMP needs to do a little dance called a “four-way handshake” during which they negotiate a set of temporal keys that are used only for this session. The four-way handshake derives the temporal keys from the “normal” key. This “normal” key is called a master key and is normally equivalent to a pass phrase (with some extra processing to make a binary blob out of it). There are two types of master keys. Pairwise Master Keys are used for unicast data and Group Master Keys are used for multicast data. Avoiding acronyms is just not serious security so these keys are called PMKs and GMKs for short. For the sake of this discussion the pairwise and group master keys behave in the same way so I’ll just say “master key” and you can substitute it with “PMK or GMK” if you like. How the master key ends up in the system is… you guessed it, “out of scope of the standard”, but it is commonly entered by the user (or the users children).

EAPOL

There is a slight problem: the four way handshake happens before encryption can start because the encryption needs the key that is the result of the four way handshake. This means that the four way handshake must happen in the clear. This is such a massively complicated concept (“First unencrypted? THEN encrypted??? My God, man! I need to go lie down for a bit.”), that we apparently needed another standard for it. 802.1x (not a typo, 802.1x is part of the 802.1 family of standards, not the 802.11 family, and is applicable to other types of networks as well) essentially describes, in a very complicated way, a model where you communicate using two ports. One encrypted port and one unencrypted port. These ports are opened and closed magically.

I’ll summarise the important points about 802.1x here: A 802.1x port has nothing to do with ports in a switch, ports in a firewall, ports in a storm, disgusting beverages or the left side of a boat. To quote Monthy Python and the Holy Grail : “It’s only a model.”. 802.1x defines a packet type, called EAPOL (for EAP Over LAN (or Extensible Authentication Protocol Over Local Area Network (are we LISP yet?))), which is used to negotiate authentication parameters in the clear. The four-way handshake is performed using EAPOL packets. Once a temporal key has been generated then traffic starts being encrypted. When the 802.1x standard says “sending to the unauthenticated port” it means sending data without encryption and when is says “sending to the authenticated port” it means sending data with encryption. The unauthenticated port only allows EAPOL packets, all other packets are discarded.

Establishing a RSN assumes that that both parties (the access point and the station) have copies of the same master key, which has been put there through occult means. The station connects to the access point using the Open System authentication method (in other words: the connection is unencrypted and unauthenticated). When the connection has been made a program called a “supplicant” on the station performs the four-way handshake with a program called an “authenticator” on the access point. As we’ve already seen this handshake is performed by exchanging EAPOL packets. When the handshake has been completed then the station and access point each have an agreed-upon temporal key which they then start to use. If you sniff the WiFi traffic when a RSN connection is set up you can see the EAPOL packets in the clear and then encrypted packets after the handshake has been completed.

A 802.1x four-way handshake, juvenile style.
A 802.1x four-way handshake, juvenile style.

One consequence of the use of temporal keys is that if you want to decrypt a TKIP or CCMP connection using a packet sniffer like Wireshark, it is not enough to have the pass phrase. The sniffer has to have access to all the packets in the four-way handshake because that information is necessary to generate the temporal key from the pass phrase.

WPA/WPA2

Now, you’d be excused for grumbling “Why are you going on and on about this RSN thing that no-one has ever heard of? I want to know about WPA.” There’s actually a funny story behind that…

Ok, the story is not that funny. What happened was that the IEEE group that developed the RSN-specification (called, as mentioned, 802.11i) took a long time to agree. The 802.11i specification spent years and years in “draft” status which meant that you could read it and comment on it but it was not approved as a finished standard. In stepped the WiFi Alliance: a group of super-villains^H^H^HWiFi Manufacturers that banded together to try to make WiFi easier to use. The WiFi Alliance actually owns the trademark on the word “WiFi”; if not for them you would have called it “802.11b and 802.11g and 802.11n and 802.11ac” instead, so be grateful.

The WiFi Alliance recognised the threat to porn surfing posed by the weakness of WEP and the delays of 802.11i, so they took action. By picking some of the most useful bits of the 802.11i draft they could set their own intermediate standard that manufacturers could start to use while waiting for the 802.11i working group to extract their digits from their posterior apertures. The result was named Wireless Protected Access, WPA for short. WPA is pretty similar to RSN but lacks some features such as pre-authentication. It uses the same four-way handshake, the same crypto (although CCMP is optional in WPA) and the same key types as RSN.

Access point capabilities are announced in something called “Information Elements”. RSN-capabilities are announced in a special RSN information element but WPA could not use this since it would conflict with 802.11i if that ever settled. Therefore WPA had to use a special information element that had been set aside for vendor-specific extensions: the so called Vendor Specific Information Element. The vendor is identified in the vendor-specific information element by including a OUI prefix and for some reason WPA uses the OUI prefix belonging to Microsoft. Don your tin-foil hat here.

The WiFi Alliance has also specified WPA2 which is based on the final 802.11i standard. WPA2 includes all parts of 802.11i that are mandatory and is fully compatible with 802.11i. So you might variously see security settings being named “802.11i”, “RSN”, “WPA2” or as combinations thereof — they all describe the same thing. If you want to be strict about it, 802.11i _defines_ RSN and WPA2 uses 802.11i to define a limited but compatible RSN, so the proper name for this type of security would really be “RSN”. Since WPA isn’t compatible with the RSN specified by 802.11i (because it uses a different information element) it cannot really be called a RSN (in my humble opinion). Device manufacturers might be forgiven for presenting the options as “WEP, WPA or WPA2” instead of “WEP, WPA or RSN”.

WPA-Personal and WPA-Enterprise

You may remember me claiming that the way the master key is entered into the system is out-of-scope of the standard (how could you possibly forget?). So how would you actually manage the master key in the real world? The simplest way, which is probably the most common, is as a so-called Pre-Shared Key. The Pre-Shared Key is a pass-phrase that you enter into your access point on the one hand and into the devices you want to connect to the access point on the other hand. It turns out that you sometimes need a more sophisticated solution though. For example, say that you manage an company with a few hundred employees. Every person who needs to use your WiFi network needs to know the pass phrase. This is already pretty bad because every employee can decrypt everyone else’s traffic when they know the pass phrase. Even worse, if you have to strap a minion to an ejection seat after they’ve sold company secrets to a competitor you need to change the pass phrase everywhere to keep him or her out.

Wouldn’t it be pretty convenient to have a key management system where you could assign pass phrases to individual users in a centralised fashion and then use those pass phrases to generate unique master keys for users as they try to connect to the WiFi network? Is that a “Hells Yes!”? I thought so.

The master key is managed by the key master. Or a RADIUS server.
The master key is managed by the key master. Or a RADIUS server.

In order to support a centralised key management system the access point must allow the station to somehow communicate with a third party for the key exchange : the station needs to verify the user pass phrase with the key management system. This is where the layering abstraction gets a bit muddled. It would be nice and clean if the 802.11i standard could just leave the whole key-management issue to some upper layer but unfortunately it can’t — it must provide a means for conveying the authentication traffic between the station and whatever key management system exist in the network. This traffic doesn’t really belong in the “ethernet” layer as it could span many different network technologies. If this makes you feel as if 802.11i is trespassing on OSI layers that it should leave well enough alone, I understand you. On the other hand, if the 802.11i working group had tried to fix it some other way they would have had to create a bunch of extra architecture-astronaut standards and there would probably have been XML in there somewhere. Let us count our blessings and live with the warts.

As fortune would have it, supporting authentication in a network is basically what our old friend 802.1x was designed for so 802.11i uses that. In 802.1x the password management system is called an authentication server (AS). The authentication server works together with the supplicant (on the station) and the authenticator (on the access point).

The role of the authenticator is different in the pre-shared key case and in the 802.1x case. With a pre-shared key the authenticator is the party that negotiates the temporal key generation with the supplicant. With 802.1x the authenticator only acts as an intermediary: it allows the supplicant to communicate with the authentication server using EAPOL while blocking all other traffic to prevent the station (which is still unauthenticated, remember) from accessing the network in any other way.

Using 802.1x for key management is a great idea and also a cause of terminological befuddlement. You see, the 802.11i standard calls the pre-shared key system “PSK” (for Pre-Shared Key), which is fine. But then it calls the password management system “802.1x” which is technically correct but confusing because the pre-shared key system also uses 802.1x for the four-way handshake.

In the interest of keeping things simple (an admirable goal) the WiFi Alliance left out the key-management parts of 802.11i from the “basic” WPA specification; what they called “WPA-Personal”,
sometimes also called “WPA-PSK”. Instead they defined an enhanced version of WPA called “WPA-Enterprise”, sometimes called “WPA-802.1x”, which includes the key-management support.

The most common key management system is called RADIUS. RADIUS is short for Remote Authentication Dial In User Service which tells you that RADIUS is not the most hip acronym on the block. Dial In was exciting back when porn was distributed on magnetic tape. In 802.11i RADIUS is not the only option for key management, other systems are allowed. However, WPA-Enterprise requires RADIUS and it is so common that when you speak of “RSN with 802.1x” (as you do) then RADIUS is often assumed.

The Brass Tacks

The brass… the… what? What? “Brass tacks”? Get it together English, that makes no sense at all.

What did we learn today? We learned that a RSNA using TKIP (RC4) or CCMP (AES) via PSK or a 802.1x AS is established through a WPA or RSN IE from STA to AP so the PTK/GTK can be derived from the PMK using EAPOL in IEEE 802.11i. FYI: that SOB is a real PITA.

Also, I lied when I titled this post “A Brief Guide…”

If you made it this far you deserve a cookie.
If you made it this far you deserve a cookie.

The Case for Simpler Tools

George R. R. Martin famously writes his books using WordStar 4.0 on a DOS-machine. He has a “real” computer on which he reads email and surfs the web and does all the other things required of modern man. His writing computer has no network and can only do one thing. It doesn’t try to spell check for him. It doesn’t pop up alerts that an email has arrived, or that Stephen King wants to chat. A focused tool allows him to focus. I think there’s a lesson here for all of us: when you focus you end up killing all your protagonists. No, that’s not it. When you focus you start thinking that the phrase “her nether lips” has no cheese in it? Not that either. Is it that when you focus you can write endless amounts of stuff? Yes! Exactly! Focus increases productivity.

Focus level one million.
Focus level one million.

I’m going to go ahead and postulate that focused tools help you be more productive, based on no evidence whatsoever. I feel this to be true and I find myself gravitating towards tools that are designed to do one thing well and that otherwise gets out of my way as much as possible. I’m not trying to describe an objective reality: when I say that IDEs are evil I don’t mean that you are a bad person if you like IDEs. I’m just saying that I hate you and that your lineage is questionable.

Computer programs used to be very focused. This was because the computers of old were very limited so their programs had to be simple.  As computers and their operating systems got more powerful and more and more tasks were computerised, it became possible to try to do more than one thing at a time. And we did. I’m inclined to think that it is non-controversial to say that perhaps we overdid it.

I think I can detect a slight resurgence of the focused tools though. Smartphones and tablets have brought some of the old focus back. An “app” commonly occupies the whole screen and doesn’t trick you into trying to do five different things at once. And some people — like old George R.R. — has started to recognise that multitasking isn’t helping us do what we want to do (unless “what” is stressing out and dying at 45). Multitasking, by definition, removes focus. Less focus, less actual work gets done. As I’m sure is proven by a paper somewhere.

A rant for your entertainment

Most IDEs, or Integrated Development Environments (in case you’re reading this for the jokes), illustrates the lack of focus that I mean. I don’t know about you but when I code I want to see code. As much of it as I possibly can. I have a seriously defective short-term memory so I’m constantly jumping around the code, checking what a function is doing; what a parameter is supposed to be; what that struct field name is called. Sometimes I need to see some trace output and two or three source files at the same time. To me, screen real estate is precious.

An IDE commonly splits the screen into a series of panes or windows. There’s usually a project window on the left side where you can see files and folders. Why the heck would I want to see files and folders while I’m programming? If I’m looking for a file then a window for that could pop up, but it’s not like I’m constantly checking the file names while I’m writing code. On the bottom there may be some sort of message or status window. The status is that I’m programming! Leave me alone, message window! On the right side you might find a debugger window or a symbol browser or something like that. A GUI can be really great when debugging but that is only like 90% of the time so please get it out of my sight!

In the middle, fighting for relevance, is (if you are lucky) a poor source code editor in which you do your actual work. In the end your sweet 24″ monitor is reduced to a 17″ editor window and that is just no way to live. I realise that you can probably apply XML voodoo somewhere and get most IDEs to look like EMACS and that is great. My point is that viewing all the available tools all the time may look cool in a screenshot but is just distracting in practice.

A totally representative screenshot of an IDE.
A totally representative screenshot of an IDE.

I won’t go on and on about applications that I think are overly complicated or badly designed. I will instead write a little about those that get it right. Tools that I think exhibits an admirable level of focus and are a joy to use because of it. Things that help me focus. I apologise if this makes me sound like a management consultant peddling a trivial slogan, “simplicity and focus”, as some kind of road to success. I would never sell out like that unless someone offered me actual money.

EMACS

My preferred editor is EMACS. I recognise that I would absolutely hate EMACS if I had been introduced to it today. Back when I learned EMACS I was young and dynamic, curious, hungry for learning, and smelling slightly of peach. The inscrutability of EMACS made me want to learn it because knowing it would be impressive, and then (in the immortal words of Eddie Izzard), “people would shag me”. Now I have become a grumpy older guy. Some would say “mature” but that always makes me think of a particularly stinky cheese which is actually disturbingly accurate… These days I just want to get on with things. Write the code, fix the bug, have some coffee and get back home in time to mow the lawn and watch a bunch of identical TV shows about home styling. But I digress.

The reason I bring up EMACS is not because Meta-Control-Cokebottle-Hashtag-Detonate is the natural way to type the letter ‘n’, it is because EMACS illustrates the way good tools gets out of the way. EMACS is pretty hard to learn, but that does not mean that it is hard to use. Easy to use is often implicitly taken to mean easy to use for the beginner. This is not at all the same thing. Great tools are not designed to be easy to learn, they are designed to be easy to use. Easy to use, easy to learn, powerful. Pick any two.

Once you know the basics of it, EMACS allows you to edit text without EMACS getting in your way. It doesn’t hide some things in obscure menus, it hides everything. There’s no menu bar to steal pixels from your precious text. Well, actually, there is but fortunately you can turn that off with some LISP voodoo (clearly superior to XML voodoo). There’s no toolbar with stupid icons. Ok, it has that too but it can thankfully be disabled with some LISP magic. There no annoying status area. Eh, ok there is a status area and I’m not sure that you can turn that off. But it is really small! As I have convincingly demonstrated, EMACS is clearly superior to any IDE.

When all the voodoo has been applied EMACS really doesn’t get in your way.

Distraction-free EMACS. You can almost smell the awesome.
Distraction-free EMACS. You can almost smell the awesome.

iA Writer

Another editor that I really like is the one I’m using to write this very text: a program called iA Writer. iA Writer is designed explicitly to embody the idea of focus and simplicity. In its full screen mode iA Writer displays only the text that you write. No menus, no widgets, just black text on a white background. Options are severely limited: no font selection, no layout, no pictures. Just text. It’s not that iA Writer has no features, but the features have been very carefully picked. I find that this really helps. iA Writer is my WordStar 4.0.

I’m writing this on a bus on my way to work. That means no Internet. The combination of iA Writer and no Internet really makes a huge difference to my productivity. Being on a bus with nowhere else to go also helps.

If you don't get writers block by staring at this full-screen actual screenshot of iA Writer then you never will.
If you don’t get writers block by staring at this full-screen actual screenshot of iA Writer then you never will.

Xmonad

I recently ended a 13 year long love affair with the Sawfish window manager when I left it for Xmonad. Xmonad is a tiling window manager which means that it will put the windows on the screen according to a predefined layout, with no gaps. This means that there is no desktop, which I think is great. The desktop metaphor was useful when people didn’t know what a GUI was but it doesn’t provide any benefit to me. Why would I want a bunch of files lying around behind everything else?

The tiling aspect of Xmonad sort of forces you to be more organised. Since you cannot have windows that overlap you necessarily always see every open application. Instead if having a stack of windows ten deep, you need to put things on separate desktops. While this can at times feel restrictive, and doesn’t work well for some applications that love to open gaggles of small windows (GIMP, I’m looking at you!), it is remarkably liberating once you get used to it.

My Xmonad desktop, not at all posing for the shot.
My Xmonad desktop, not at all posing for the shot.

And the rest…

I’ll add two examples of simplicity and focus from the wonderful world of web.

This is the web page of Berkshire Hathaway. It is easily one of the 10 biggest publicly held companies in the world. Berkshire Hathaway owns, among many things, large parts of the Coca Cola Company, Wells Fargo and American Express. A single share of Berkshire Hathaway is worth $190.000 at the time of writing. Look at that web page again. It takes a certain kind of self confidence to maintain a web page that simple. And yet, nothing more is really needed.

Programming in the twenty-first century” by James Hague is my favourite programming blog. The site is, apparently, generated by 269 lines of Perl, including the HTML templates. Both the site and its implementation embodies simplicity and focus. Although I haven’t seen the Perl code I’m assuming that it is simple and pretty, based on mr Hagues blog posts. But I guess that it could also be Perl code.

Finally, I’ve written a little bit about analogue modular synthesisers. While a modular synth is by no means simple, I think the resurgence and current popularity (ok, “popular” is a rather relative concept) of analogue modulars at least in part reflects a wish to remove options. These days, when you make music in a computer your power is almost unlimited. Major music-making applications are huge and very complex and they encourage you to dive down and micro-edit the music. With a modular synth you remove a lot of the control. While you can nudge an analogue synth into a direction of your choice, you cannot force it to do your bidding exactly. You need to let go of some control and let the machine work on its own a little bit. This can be very freeing and creative. To be fair, a modular synth can also be an example of the exact opposite: an example of a tool that is so complex (and engrossing) that it consumes all your time and prevents you from spending any of it on the reason you bought  the damned thing the first place — the music.

Obligatory parting words of wisdom

Like anything in life, simplicity is best in moderation. A word processor in which you could only write the word “goat” would, perhaps, be too simple. A car without seat belts would be simpler, but not better. But in some cases, simpler really is better. The UNIX philosophy, if you will. The value of focus is readily evident in physical objects. Most people would probably agree that, all else being equal, a simple object is better than a complex one. Compare the Apple TV remote with the remote of any cheap TV. The TV remote isn’t complex because those who designed it thought that complex was better, it is complex because they couldn’t afford — or simply didn’t know how — to make it simple. Well judged simplicity is a valuable skill. Treasure it.

A Bug’s Life : Security Level Cray

Illustrations by Jonas Ekman.

When the nineties were getting old and people could value a pet shop at 100 million dollars as long as it had a web page, I was missing out on the hype by working as a sysadmin at a supercomputing site. This was before supercomputing got boring; when a super was a really odd machine running a demented OS instead of just being a thousand of what you have on your desktop. One of my babies was a Cray J90, more specifically a J932SE.

The J90 was, admittedly, not one of the awesomest Crays. It did not have an integrated couch (Cray 1, Cray XMP, Cray YMP).

Cray 1, when you want your super in the lobby.
Cray 1, when you want your super in the lobby.

You couldn’t serve beer on it (Cray 2).

The Cray 2 gets my vote as the coolest looking computer ever. It is a coolant aquarium!
The Cray 2 gets my vote as the greatest looking computer ever. It is a coolant aquarium!

It did not have a waterfall (Cray T90).

T916
A Cray T90 with coolant waterfall behind it.

However, it was a 32-CPU shared-memory vector supercomputer with a gigantic (at the time) 8GB of RAM. It was the size of 2 largish fridges, ran UNICOS — the Cray flavour of UNIX — and could multiply 64-bit floating point numbers like nobody’s business.

A Cray J90 wishing it had more awesome looks.
A Cray J90 wishing it had more awesome looks.

Seymour Cray was famous for his disdain for virtual memory and thus UNICOS did not have any. If you had 8GB of RAM then that was all you could use. When you wanted to start a process you’d better hope that the system could find a free, continuous chunk of RAM of the appropriate size. If your process ran out of memory it would be swapped out to disk in its entirety. Needless to say, swapping was death for performance.

A Cray machine was a reasonably expensive piece of equipment (some may even have called it unreasonably expensive) and most, very likely all, were bought to service many users at the same time. One of the system software packages Cray sold was called Multi-Level Security, MLS for short. MLS allowed you to virtually partition the machine in a fairly bullet-proof fashion. You could buy a 32-CPU system and let some users see 30 CPUs and half the memory while other users only saw 2 CPUs. If you were running in a partition you couldn’t access or even see processes running in another partition. In fact, you could be a system administrator and still have access to just a part of the physical machine. The effect was similar to virtualisation but without the performance penalty. MLS was a fabulous idea because it allowed, say, the NSA and the North Korean embassy to pool their resources and buy a Cray together and their users wouldn’t see each other. What could possibly go wrong?

MLS was a pig and a half and any sane sysadmin would run from it screaming. Some poor souls were forced to use it because it sounded like a great deal to management. We were lucky because our machine was open and we did not run MLS.

A typical sysadmin reaction.
A typical sysadmin reaction.

One carefree day I came into the office and found an envelope in my mail box with a DAT-tape in it. The label on the tape said UNICOS 10.0. UNICOS 10 had been out for a while and we could not put it off any longer, it was time to upgrade the OS on our machine from 9.0 to the latest and greatest.

An OS upgrade on a million dollar multi-user system is a pretty major undertaking. The machine ran batch jobs for dozens of users. Jobs could be queued for weeks waiting for a time slot to run in, and a single job could execute for days. Just turning the machine off would get you murdered by a PhD student (bad) or lectured at by some mechanics professor (worse). The OS upgrade had to be announced weeks in advance and planned like a Prussian wedding. The plan was written, the upgrade announced, the sacrificial goat was prepared and magical runes were appropriately carved on the computer room doors to ward off evil upgrade gremlins… the process for OS upgrades can be highly site-dependent.

The day arrived. We stopped the batch queues, rebooted the machine into single-user mode and went to work. The upgrade went well and after a few hours of testing we could restart the batch queue and let the users back on. Mission accomplished.

A week or so later I was doing some routine work (the technical term is “dicking around”) on the Cray and I discovered something strange: there was a file in the root directory that was owned by a normal user. The user was one of my co-workers and he did not have root access to the Cray. The file should have been unpossible. My immediate worry was that we had been hacked. A supercomputing site is a juicy target for hackers and a super, with its huge I/O capability and usually great network connectivity, can make an awesome denial-of-service booster.

The co-worker was trusted so I asked him about the file. He had been testing a script with a path bug in it so he could explain where the file came from. He could not explain how it had managed to be written in a read-only directory. This pretty much discounted the theory that the file was part of a hacking/cracking/fracking operation. Looking around some more I quickly found other files that were owned by users in places where those users didn’t have write access. After some testing the horrible truth dawned: any user could write anywhere. While academic institutions have a reputation for openness even Richard Stallman might agree that this was a step too far.

We had a common login system for all our machines. It was based on an implementation of Kerberos called Heimdal that had been developed locally and luckily for me I shared an office with one of the main authors of Heimdal. He was the kind of wizard every computing site needs. If you complained vaguely about some proprietary protocol not working he would sit quietly for 30 minutes (not acknowledging that he had even heard you) and then suddenly announce that he had reverse-engineered the protocol and mailed you the source code for a replacement server that he had coded up from scratch. Once, when confronted with what appeared to be a corrupt file name in a directory his first action was to hexdump the directory inode. While he did not have an actual, physical, beard, his virtual Unix beard was mighty to behold.

Everyone needs a UNIX beard.
Everyone needs a UNIX beard.

I was a bit out of my depth so I recruited my wizardly office mate to the cause. The best way to do that was to not ask for help directly but to just complain generally until he either got tired of listening to it or got curious. After applying beard to the problem for a while he managed to figure out what was going on.

The Heimdal kxd program was responsible for accepting logins via the network, authenticating them and spawning a user login session on the machine. When doing this it had to change the owner of the spawned process to the UID of the user logging in, which it did with the setuid() system call. This used to work just fine. But now the setuid() call changed the UID and left the process with root file access permissions.

As a part of the planning for the OS upgrade I had pored over the release notes for UNICOS 10 time and again to make sure that I hadn’t missed something that would cause us problems. One of the changes that Cray had made was that MLS, the Multi-Level Security that everyone loved so much, had been included as a default component in the OS. It used to be a separate product but now everyone got it. The release notes promised that if you disabled MLS in the kernel it would work exactly as a non-MLS system did in UNICOS 9. What the release note should have said was that if you disabled MLS it would work exactly as a non-MLS system did except for the part where we changed the way access permissions are inherited by the setuid() system call. A consensus was developing around the view that MLS was the spawn of satan.

Are you sure you didn't order a MLRS?
Are you sure you didn’t order a MLRS?

The virtually bearded one went quiet for 15 minutes or so and then announced that he had fixed it. The fix can still be found in the Heimdal source code today:

 if (setgid (passwd->pw_gid) ||
     initgroups(passwd->pw_name, passwd->pw_gid) ||
 #ifdef HAVE_GETUDBNAM /* XXX this happens on crays */
     setjob(passwd->pw_uid, 0) == -1 ||
 #endif
         setuid(passwd->pw_uid)) {
         syslog(LOG_ERR, "setting uid/groups: %m");
         fatal (kc, sock, "cannot set uid");
 }

I would frankly have been a lot less restrained in the comment…

Cleaning up the mess took a while, we had to check all file systems for integrity. Happily we had dumped all system disks to tape after the upgrade but prior to letting any users on so we had a known good set to compare against.

Lessons learned? If you change the semantics of setuid() then maybe, just maybe, drop a line about it in the release notes. Also, a paranoid mind set is a great asset to a system administrator (although it can definitely go off the rails without some healthy pushback from management) and preparation is key to any successful system upgrade.

While this MLS thing was a bit of a fiasco I have to say that UNICOS was incredibly stable. It just ran for years and years and the only crashes that I can remember were caused by hardware failures (typically hard drives). Oh, and that one time when I shorted out the entire machine with a carelessly applied screwdriver…

A Bug’s Life : The One Sample Solution

At one point in my life I found myself working for a company that made synthesisers of the musical instrument kind. The guts of the synths consisted of a bunch of DSPs that did all the audio processing, some memory, and a general purpose CPU that handled the rest: scanning the keyboard, scanning the knobs and buttons on the control panel and providing MIDI and USB connectivity.

One of my jobs was to clean out a list of bugs that users had reported on a released product. Most of these bugs were minor and sometimes obscure but the company prided itself on high quality so they would strive to fix all reported bugs, no matter how minor. Unless the bug report was that “the product should cost 50% less, be all-analogue, and be able to transport me to my place of work”. Some people have strong feelings of entitlement.

One of the bugs on my list was that the delay glitched if you turned it off and on again quickly. A delay is a type of audio effect that simulates an echo: if you send a sound into it then the sound will be repeated some number of times while fading out. Early delays were implemented using magnetic tape. If the recording head was placed some distance ahead of the playback head then a sound that was recorded into the delay unit would play back as an echo a short while later as the tape passed by the playback head. In this particular case the delay was digital and consisted of a circular buffer of audio samples that the DSP would continuously feed data into (mixed together with feedback from audio that was already in the buffer). The buffer would then be played back by being mixed with the main audio output signal.

A tape-based delay effect unit.
A tape-based delay effect unit.

The delay had an on/off button on the front panel of the synth and when you switched it off the DSP would set the input volume to the delay buffer to 0 so that it would fill up with silent samples. However, since this happened at the normal audio rate it would take a few seconds before the whole buffer was filled with zeros. A user had discovered that if you played something with the delay enabled, then switched it off and then quickly switched it on again then parts of the delay buffer would still contain sound that you would hear, sometimes with nasty transients. The solution was to program the DSP to quickly zero the delay buffer using a DMA transfer whenever the delay was turned off.

This may sound trivial but the code running on the DSP was hand-coded assembly optimised to within an inch of its life. The DSP had one job: to present a pair of completely processed 24-bit samples — a stereo signal — to the Digital-Analogue converter inputs at the sample rate, which was 44100 times per second. If a sample wasn’t ready in time then digital noise would result at the output. This was frowned upon because it would sound horrible and if it happened while the instrument was being played on stage, through a serious sound system, you might as well poke out the eardrums of your audience using an ice pick. This made the rules of the game for the DSP pretty simple: if it could execute one instruction per cycle and was running at F cycles per second then that meant that it could spend N=F/(2*44100) instructions per sample. In fact it had to spend exactly N instructions or the timing would drift off. Any unused cycles had to be filled in with “No-op” instructions that do nothing but waste time. In this case N was a couple of hundred cycles. This meant that the DSP code was a couple of hundred instructions long, which was good because there was not so much of it, but bad because there were very few unused cycles left in which to set up the DMA.

This type of DSP is built to do one thing: to perform arithmetic on 24-bit fixed point numbers. It is a multiply-add engine. Multiplying and offsetting 24-bit fixed point numbers is easy and everything else is a pain in the upholstery. Instructions are often very limited as to which registers they can operate on and the data you want to operate on is therefore, as per Murphy’s law, always in the wrong type of register.

After scrounging up some spare cycles I managed to set up a DMA that zeroed the delay buffer whenever it was turned off. Apparently. I tested it: no problem. I told my boss and he came in and listened. Now, my boss had worked with theses synths for years and years and he immediately heard what I didn’t: a diminutive “click” sound that was so weak that I couldn’t hear it at all. “Probably you didn’t turn off the input to the delay buffer so a couple of samples gets in there while the DMA is running.” I verified it but no, the input was properly turned off.

Everyone needs Wilson's Common Sense Ear Drums.
Everyone needs Wilson’s Common Sense Ear Drums.

Now that I knew what to listen for I could hear the click if I put on headphones and turned the volume up. Headphones are always scary when programming audio applications because if you screwed up the code somewhere you might very well suddenly get random data at the DACs which means very loud digital noise in the headphones which means broken eardrums and soiled pants in the programmer. In contrast, a single sample of random data at 44.1kHz sounds a little bit like a hair being displaced close to your ear. In the beginning I had to stop breathing and sit absolutely still to hear the click noise. Moving the head just a little would make mechanical noise as the headphones resettled, noise that would drown out the click. After a while though, my brain became better at picking out the clicks, after a day or so I could hear it through loudspeakers. Unfortunately I soon started to hear single-sample clicks everywhere…

"Wait, wait! I think I can hear it!"
“Wait, wait! I think I can hear it!”
The good thing about this bug was that it was fairly repeatable. I like to think that no repeatable bug is fundamentally hard. Unless the system is very opaque you just have to keep digging and eventually you’ll get there. It may take a long time, but you’ll get there.

So what was going on? Was the delay buffer size off by one? No. Was the playback pointer getting clobbered? No. Did the DMA do anything at all? Yes, by filling the buffer with a non-zero value and then running the DMA I could see that the buffer was zeroed afterwards. Was it some sort of timing glitch that caused a bad value to appear at the DACs? Not that I could tell. Blaming the compiler is always an option but in this case the code was hand-written assembly so that alternative turned out to be less satisfying. A compiler can’t defend itself but your co-worker can…

brainstorm-blame-workplace-ecard-someecards

The debugging environment was pretty basic. The only tool available was a home-grown debugger that could display something like 8 words of memory together with the register contents and not much else. Looking through a megabyte or so of delay buffer data through an 8 word window might sound like fun but I guess I’m easily bored…

One thing that became apparent after a while was that the click sound would appear even when the delay buffer was empty to begin with. This indicated that the bad data was put there instead of being left behind. At this point I started to throw things out of the code. I find this to be a pretty good way forward if a bug checks your early advances. Remove stuff until the bug goes away, then you can start to add stuff back again. After a while I had just the delay code left. I wrote a routine that filled the delay buffer with a known value, ran the DMA, and then stepped through the whole buffer, verifying the contents word-by-word. And bingo, it found a bad sample.

bingo2

Seeing it made it real. As long as I only heard it, the defect could theoretically have been introduced somewhere later in the processing or in the output chain. But now I could see the offending sample through my tiny window, sitting there in the middle of a sea of zeros. What is more, the bad sample did not contain the known value that the delay buffer was initialised with. At this point suspicions were raised against the DMA itself. A quick look through the DSP manual revealed an errata list on the DMA block several pages long. This DSP had a horrendously complex DMA engine that could do all sorts of things. For example, it could copy a buffer to a destination using only odd or even destination addresses — in other words it could copy a flat buffer into to a single channel of a stereo pair. It seemed like half of these modes didn’t really work.

None of the issues listed in the errata list fit what I was seeing but I still eyed the DMA engine suspiciously. I therefore tried to zero the delay buffer using a different DMA mode and it worked! Ah, hardware: can’t love it, can’t kill it. Hardware designers on the other hand…

bad_dog

So the bug was put to rest and I moved on to the next issue on my list. Did we learn something? When you can’t blame the compiler, blaming hardware remains an option. In the end I came away pretty impressed by the dedication to quality and stability that this small company displayed. The original bug report was on a glitch that many companies wouldn’t bother to correct at all. The initial fix took maybe half a day to implement and took the amount of unwanted data in the delay from 1-2 seconds (between 90000 and 180000 samples) down to just a single, barely audible, sample in what was already a rare corner case. Fixing it completely took about a week. In other words, it took four hours to fix 99.99999% of the bug and 36 hours to fix the rest of it. But the message was pretty clear : fix it and fix it right.

And to all the semiconductor companies out there: don’t let your intern design the DMA engine.

The Philosophy of Bug-fixing

I spend way to much time fixing bugs. I don’t say that because my code is excessively buggy (although that is also true), I say that because quite a few of the bugs I fix are probably not worth fixing. Not really. Let’s imagine…

Your customer has submitted, along with vague threats of ninja assassins being dispatched to your company headquarters, a bug report about a crash in your software. He cannot really supply any useful information on what led up to the crash but the error message tells you exactly where in the code the crash occurred. Firing up your favourite editor you take a look. The crash happens because an unallocated resource is freed.

A bug :

mystificate(wizard)
 ...

 fire(wizard)

The crash is easily prevented, just check if the resource is allocated before freeing it. The thing is, that resource should be allocated at this point in the code. You know that it usually is or you would have seen this crash before. The crash, while a bug in itself, is clearly also a symptom of another bug. What do you do and why?

A bug fix :

mystificate(wizard)
 ...

 if (hired(wizard))
    fire(wizard)

Any diligent programmer would at least spend a little time considering the implications of this discovery. If you can easily figure out how the resource came to be unallocated then perhaps you can fix that too. And you should. You may also realise that the unallocated resource is a fatal error which absolutely, positively, must be fixed or the plague breaks out. Put on your mining helmet and start digging. Perhaps your product is a pacemaker. Fix the bug.

Most of the time though, the situation is not as clear cut. You just don’t know what the implications are. The bug could be a corner case with no other consequences than the crash that you just fixed. Or it could be a symptom of some intricate failure mode that may cause other bad things to happen randomly. I would guess that most competent programmers would feel decidedly uncomfortable papering over the problem. Some may even find it hard to sleep at night, knowing that the bug is out there, plotting against you. They lie awake worrying about when and how the bug will crop up next. So you start to investigate.

Initial state.
Initial state.

You rule out the most likely causes. You try to repeat the problem. You try to make it happen more often. If you’re an embedded programmer you verify five times that RX and TX pins haven’t been swapped somewhere. You instrument the code. You meditate over reams of trace printouts. You start to suspect that this is a compiler bug, but that is almost never really the case. You read the code (desperation has taken hold). You poke around in a debugger. You remove more and more code, trying to reduce the problem. You start to become certain that this is a compiler bug (it never is). You dream about race conditions. You stop showering. You start to have conversations with an interrupt handler like it was a person. You dig into the assembly. You now know that it is a compiler bug (it isn’t). You mutter angrily to yourself on the subway. A colleague rewrites a completely unrelated part of the code and the symptom inexplicably goes away. Your cat shuns you. You start to question everything; how certain are you that a byte is 8 bits on this machine?

Intermediate stage.
Intermediate stage.

After aeons of hell a co-worker drops by. She looks over your shoulder and asks “isn’t that resource freed right there?” And sure enough — in the code that you’ve been staring at for weeks, a mere 4 lines above the bad free, the resource is freed the first time. To say that the error had been staring you in the face the whole time would be understating it. You kill yourself. The cycle starts again.

The final state.
The final state.

The terrible truth:

mystificate(wizard)
   fire(wizard)

   hire(izzard)
       if (flag)
           occupy(india)

   if (hired(wizard))
       fire(wizard)

Or maybe the bug turned out to be something really gnarly that could explain dozens of different bug reports.

Should you have fixed that bug? The thing about bugs is that they are unknown. Until you know what is going on you can’t tell wether the bug will cause catastrophic errors or is relatively harmless, right? The bug is what Donald Rumsfeld would call a “known unknown,” something that you know that you don’t know. Not tracking down the bug means living with uncertainty. Fixing bugs increases the quality of the software so fixing any bug (correctly) makes the software better. You can easily convince yourself that fixing the bug was the right thing to do, even if it turned out to be relatively harmless. But you’re probably wrong.

sheldon

The “known unknown” argument cuts both ways. Even if the bug turns out to be serious you didn’t know that beforehand. Putting in serious effort before you know what the payoff would be may not be the wisest way to allocate your time. And the question isn’t really “did the fix make the software better” but “was the time spent on the fix the best way to make the software better”.

Living with uncertainty can be difficult but lets face it, your code probably has hundreds or thousands of unknown bugs. Does it actually make a difference that you know that this one is out there?

Am I saying that it was wrong to start looking for the bug? Definitely not. If you suspect that something is amiss, you should absolutely strive to understand what is going on. The difficulty lies in knowing when to stop. There’s always a sense that you are this close to finding the problem. If you would just spend 10 more minutes you would figure it out, or stumble upon some clue that would really help. Now you’ve spent half a day on the bug, might as well spend the whole day since you’ve just managed to fit the problem inside your head. You’ve spent a day on it, might as well put in another hour just to see where this latest lead goes. You’ve spent three days and if you don’t fix it now you’ve just wasted three days. You’ve spent two weeks and dropping the whole thing without resolution at this point would cause irreparable psychological damage. The more you dig the deeper the hole becomes.

So what do you do to prevent this downward spiral? Talk to a co-worker. I know it might seem uncomfortable and unsanitary and possibly even unpossible, but it can be an invaluable strategy even if your colleagues are all muppets. Take ten minutes a day to talk about where you are and how it is going. Just describing the problem often makes you realise what the solution is. “So the flag is unset when frobnitz() is called and… and… never mind, I am an idiot” is a common conversation pattern (although “never mind, you are an idiot” is idiomatic in some places). Sometimes this even works with inanimate objects like managers. This is sometimes called “Rubber duck debugging“: carefully explain your problem to your rubber duck and the fix will become apparent. Don’t own a rubber duck? What is wrong with you??

Even if describing the problem doesn’t help, a sentient co-worker has a tendency to ask if you have checked 10 different things that are so obvious that how can you even ask, and do you think I’m completely incompetent and, by the way, no, I meant to get around to it. “Did I recompile the code after changing it? No, do you have to?” If your colleague is very polite you may have to insult him a bit before he asks the obvious questions. Obvious questions can be hard on the ego but very useful because the odds are in favour of the obvious things. For some reason the human mind seems to prefer the mystical and complex (“it’s a compiler bug”) to the simple and likely (“I forgot to recompile”).

The final reason it is good to talk to a co-worker is that you get some distance from the problem. Just discussing loosely where you are and what you know and how much time you’ve spent can really help when you start to suspect that you should strap your bug-fixing project to an ejection seat.

Conclusions:
  • Don’t get lost fixing a hard bug, keep an eye on the big picture. This is a lot easier to write than to do.
  • Learn to live with uncertainty. If this is impossible: embrace misery.
  • Co-workers can serve a useful purpose other than being on the opposite side of indentation wars.