The Goo Software Blog

All The Goo That's Fit To Print

Passwords: Of MD5 and Mistresses

Errata Security have an interesting post on the hacking of a general’s mistress. In it, Robert David Graham looks at how long it would take someone to discover Paula Broadwell’s Yahoo! email password based on the hashed copy leaked in an email hack in late 2011. He states:

it’ll take 17 hours to crack her password using a GPU accelerator trying 3.5-billion password attempts per second, trying all combinations of upper/lower case and digits.

(My emphasis)

I read that and thought: clearly he’s making an assumption here about the (maximum) length of the password. I wonder what the assumption was?

We can figure it out. Using maths!

All combinations of upper/lower case and digits means the following character set:

abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890

That’s 62 characters in all. If passwords were only one character long, there would be 62 possible unique passwords (“a”, “b”, and so on). If they’re two characters long, there would be 62 * 62, or 3844 possible passwords (“aa”, “ab”, “ac”…). More generally, if the maximum password length is n characters, there are 62n (62 raised to the power of n) possible unique passwords.

So how many characters did Graham assume for Paula Broadwell’s password? Well, he said it took 17 hours at 3.5 billion attempts per second to try every possible combination.

  3,500,000,000 attempts per second
= 3,500,000,000 * 60 attempts per minute
= 3,500,000,000 * 60 * 60 attempts per hour
= 3,500,000,000 * 60 * 60 * 17 attempts in total
= c. 214,200,000,000,000 attempts

So what power of 62 is that? Let’s ask Wolfram Alpha.

But before we do, ask yourself: what do you think the answer is? Have a guess. What does your intuition tell you? A quarter of a quadrillion guesses should let you crack a fairly meaty password, right?

OK, now for the answer. Hey Wolfram Alpha, what’s log base 62 of 214,200,000,000,000?

The answer: 8.00

So there we go, he was assuming a maximum of 8 characters in her password.

Then I actually read the rest of the blog post. :-)

As you see, it’ll take 17 hours to brute-force eight upper/lower case and digits, even though it tries 3.5-billion passwords/second… Had her password been one character longer, I wouldn’t have cracked it.

(My emphasis again)

Well, maybe he would, but it would have taken 62 times as long to crack – approximately 44 days. If she’d used a 12-character password – just four little characters extra – it would have taken over 29,000 years to crack it using the same brute force approach on the same hardware. Four characters to make the difference between an overnight job and one that your thousandth generation of descendants may not live to see completed. That’s pretty mind blowing.

So there you have it. Use a longer password, folks!

This article was originally published six months ago on gist.io. I wanted to give it a permanent home on this blog.

Unit-testing a CoreData Manager Class

I’m working on a CoreData-backed app that lets users create and manipulate… [checks terms of NDA…] ahem, “Widgets”. So I’ve got an NSManagedObject subclass, GSWidget, which I’d instantiate the usual CoreData way:

1
2
3
4
5
6
7
8
9
GSWidget *widget = [NSEntityDescription insertNewObjectForEntityForName:@"Widget"
                                                 inManagedObjectContext:aManagedObjectContext];

// Maybe I want to save to disk here, too
NSError *error;
BOOL wasSaved = [aManagedContext save:&error];
if (!wasSaved) {
    // Do stuff with error here
}

I don’t want my view layer to have to know how the data’s being persisted, and I definitely don’t want those big, ugly chunks of CoreData-specific boilerplate all over the place. In my view controllers I’d just like to be able to do this:

1
GSWidget *widget = [someThingy createAWidget];

So, I’ll hide all that CoreData-specific code away in a class that I’ll call GSWidgetManager. The header looks like this:

1
2
3
4
5
6
7
8
9
@interface GSWidgetManager : NSObject

+ (GSWidgetManager*)sharedInstance;

- (GSWidget*)createWidget;
- (NSArray*)allWidgets;
// etc...

@end

and GSWidgetManager.m looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
@interface GSWidgetManager() {
    // Declare some private instance variables for our CoreData stack
    NSManagedObjectContext *_managedObjectContext;
    NSManagedObjectModel *_managedObjectModel;
    NSPersistentStoreCoordinator *_persistentStoreCoordinator;
}
@end

@implementation GSWidgetManager

+ (GSWidgetManager*)sharedInstance {
    static GSWidgetManager *sharedManager;
    static dispatch_once_t onceToken;
    dispatch_once(&onceToken, ^{
        sharedManager = [[GSWidgetManager alloc] init];
    });
    return sharedManager;
}

- (GSWidget*)createWidget {
    return [NSEntityDescription insertNewObjectForEntityForName:@"Widget"
                                         inManagedObjectContext:aManagedObjectContext];
}

- (NSArray*)allWidgets {
    NSFetchRequest *request = [[NSFetchRequest alloc] initWithEntityName:@"Widget"];
    NSError *error;
    NSArray *widgets = [[self GS_managedObjectContext] executeFetchRequest:request
                                                                     error:&error];
    if (widgets == nil) {
        // Do stuff with error
    }

    return widgets;
}

#pragma mark - Private CoreData-related methods

- (NSManagedObjectContext *)GS_managedObjectContext {

    if (_managedObjectContext) {
        return _managedObjectContext;
    }

    NSPersistentStoreCoordinator *coordinator = [self GS_persistentStoreCoordinator];
    if (coordinator) {
        _managedObjectContext = [[NSManagedObjectContext alloc] init];
        [_managedObjectContext setPersistentStoreCoordinator:coordinator];
    }

    return _managedObjectContext;

}

- (NSManagedObjectModel *)GS_managedObjectModel {

    if (_managedObjectModel) {
        return _managedObjectModel;
    }

    NSURL *modelURL = [[NSBundle mainBundle] URLForResource:@"WidgetApp"
                                              withExtension:@"momd"];
    _managedObjectModel = [[NSManagedObjectModel alloc] initWithContentsOfURL:modelURL];

    return _managedObjectModel;

}

- (NSPersistentStoreCoordinator *)GS_persistentStoreCoordinator {

    if (_persistentStoreCoordinator) {
        return _persistentStoreCoordinator;
    }

    _persistentStoreCoordinator = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:[self GS_managedObjectModel]];

    NSError *error = nil;
    NSURL *storeURL = [[self GS_applicationDocumentsDirectory] URLByAppendingPathComponent:@"WidgetApp.sqlite"];
    if (![_persistentStoreCoordinator addPersistentStoreWithType:NSSQLiteStoreType
                                                   configuration:nil
                                                             URL:storeURL
                                                         options:nil
                                                           error:&error]) {
        // Do stuff with error
    }

    return _persistentStoreCoordinator;

}

- (void)GS_saveContext {

    NSError *error = nil;
    NSManagedObjectContext *managedObjectContext = [self GS_managedObjectContext];
    if ([managedObjectContext hasChanges] && ![managedObjectContext save:&error]) {
        // Do stuff with error here
    }

}

- (NSURL *)GS_applicationDocumentsDirectory {
    return [[[NSFileManager defaultManager] URLsForDirectory:NSDocumentDirectory inDomains:NSUserDomainMask] lastObject];
}

@end

OK, that’s some nice separation of concerns and it works well. Now I need to test it.

Unit testing the manager

When I test CoreData code, I want to use an in-memory data store rather than a SQLite one. I like the idea that at the end of the tests the store’s definitely not persisted on disk. In the unit tests for GSWidget, I can just create a separate NSManagedObjectContext object in -setUp, one that uses an in-memory store, and use that in my tests. (Graham Lee has a blog post describing this approach.)

But how about when I write the unit tests for GSWidgetManager? In my unit tests, when I call [[GSWidgetManager sharedInstance] createWidget] I want that widget associated with a nice, lightweight, throw-away-at-the-end-of-testing, in-memory store. This was not as straightforward as I’d expected. First, the way that didn’t work…

The approach that didn’t work

The first thing I tried was to define a GS_UNIT_TEST preprocessor macro in the build settings for my WidgetApp Tests target, as suggested in these StackOverflow posts. I could then use that macro to create an in-memory store instead of a SQLite one:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
- (NSPersistentStoreCoordinator *)GS_persistentStoreCoordinator {

    if (_persistentStoreCoordinator) {
        return _persistentStoreCoordinator;
    }

    _persistentStoreCoordinator = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:[self GS_managedObjectModel]];
    NSError *error = nil;
#ifdef GS_UNIT_TEST
    if (![_persistentStoreCoordinator addPersistentStoreWithType:NSInMemoryStoreType
                                                   configuration:nil
                                                             URL:nil
                                                         options:nil
                                                           error:&error]) {
        // Do stuff with error
    }
#else
    NSURL *storeURL = [[self GS_applicationDocumentsDirectory] URLByAppendingPathComponent:@"Widgets.sqlite"];
    if (![_persistentStoreCoordinator addPersistentStoreWithType:NSSQLiteStoreType
                                                   configuration:nil
                                                             URL:storeURL
                                                         options:nil
                                                           error:&error]) {
        // Do stuff with error
    }
#endif
    return _persistentStoreCoordinator;

}

However, this didn’t work, because for apps Xcode now configures the unit test target as application tests by default, which means the tests are run within the context of the running app.When my manager class gets compiled, the target being built is WidgetApp, not WidgetApp Tests, so the GS_UNIT_TEST pre-processor macro is undefined.

You can change the test target to run logic tests by following the instructions in the Xcode Unit Testing Guide, under Setting Up Logic Unit Tests. If you do that, this macro solution now works. However, switching to logic tests can introduce its own complications (most notably for me, calls to [NSBundle mainBundle] no longer work – they need to be changed to [NSBundle bundleForClass:[self class]] in order to work within logic tests, per this Stack Overflow answer.)

Meanwhile, Apple has decided in its wisdom that application tests are now the default, and a solution that works with the default project configuration is always worth knowing about. (A wise manager once told me, “don’t fight the default” – advice that’s saved me hours of frustration over the years.)

The approach that worked

Ideally I’d like my unit tests to use a tweaked version of GS_managedObjectContext, one that returns a context over an in-memory store. At first I tried using a category on GSWidgetManager that overrode GS_managedObjectContext. That works OK, but it’s an abuse of categories, which strictly speaking aren’t meant to be used to override existing methods, only to add new ones.

So, I turned to method swizzling, a technique that uses functions from the Objective-C runtime to switch the implementations of two methods around. First, in my unit test class I wrote a quick and dirty GS_managedObjectContext replacement that used an in-memory store:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
- (NSManagedObjectContext *)managedObjectContextForTesting {
    NSManagedObjectContext *moc = [[NSManagedObjectContext alloc] init];

    NSURL *modelURL = [[NSBundle mainBundle] URLForResource:@"WidgetApp" withExtension:@"momd"];
    NSManagedObjectModel *mom = [[NSManagedObjectModel alloc] initWithContentsOfURL:modelURL];

    NSPersistentStoreCoordinator *psc = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:mom];

    [psc addPersistentStoreWithType:NSInMemoryStoreType configuration:nil URL:nil options:nil error:nil];

    [moc setPersistentStoreCoordinator:psc];

    return moc;
}

Ideally I’d like to cache the managed object context so that I don’t have to create it every time this method gets called. I can’t reference the instance variable _managedObjectContext from GSWidgetManager within my unit test class, and if I declare a separate instance variable within the unit test class it won’t be available at runtime when I’ve swizzled this method with the original GS_managedObjectContext. So, I declared a static variable within the method and used that instead:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- (NSManagedObjectContext *)managedObjectContextForTesting {
    static NSManagedObjectContext _managedObjectContext = nil;
    if (_managedObjectContext == nil) {
        NSManagedObjectContext *moc = [[NSManagedObjectContext alloc] init];

        NSURL *modelURL = [[NSBundle mainBundle] URLForResource:@"WidgetApp" withExtension:@"momd"];
        NSManagedObjectModel *mom = [[NSManagedObjectModel alloc] initWithContentsOfURL:modelURL];

        NSPersistentStoreCoordinator *psc = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:mom];

        [psc addPersistentStoreWithType:NSInMemoryStoreType configuration:nil URL:nil options:nil error:nil];

        [moc setPersistentStoreCoordinator:psc];

        _managedObjectContext = moc;
    }

    return _managedObjectContext;
}

That’s a bit of a hack; it means that if I generate multiple GSWidgetManager objects in my unit tests they’ll all share the same managed object context. An instance variable would be better, but since GSWidgetManager is a singleton it’s a hack I can live with.

OK, now for the fun stuff: swizzling the methods. It’s pretty easy. At the top of the unit test class I added:

1
2
3
// Obj-C runtime stuff for method swizzling
#import <objc/runtime.h>
#import <objc/message.h>

Then in -setUp, I added this:

1
2
3
Method orig = class_getInstanceMethod([GSWidgetManager class], @selector(GS_managedObjectContext));
Method new = class_getInstanceMethod([self class], @selector(managedObjectContextForTesting));
method_exchangeImplementations(orig, new);

That was almost it, except when I ran my tests I could see that some were still using a SQLite store. The reason, once I realised it, was obvious. method_exchangeImplementations does exactly as the name suggests; it swaps implementations such that calling method A executes the implementation of method B, and vice versa. Calling it more than once switches your methods back to their original state. Wrapping those lines in a trusty dispatch_once fixed the issue.

1
2
3
4
5
6
static dispatch_once_t onceToken;
dispatch_once(&onceToken, ^{
    Method orig = class_getInstanceMethod([GSWidgetManager class], @selector(GS_managedObjectContext));
    Method new = class_getInstanceMethod([self class], @selector(managedObjectContextForTesting));
    method_exchangeImplementations(orig, new);
});

So, the final test class looked like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#import <SenTestingKit/SenTestingKit.h>
#import "GSWidgetManager.h"
#import "GSWidget.h"

// Obj-C runtime stuff for method swizzling
#import <objc/runtime.h>
#import <objc/message.h>

@interface GSWidgetManagerTests : SenTestCase

@end

@implementation GSWidgetManagerTests

- (NSManagedObjectContext *)managedObjectContextForTesting {
    static NSManagedObjectContext *_managedObjectContext = nil;
    if (_managedObjectContext == nil) {
        NSManagedObjectContext *moc = [[NSManagedObjectContext alloc] init];

        NSURL *modelURL = [[NSBundle mainBundle] URLForResource:@"Quibbler" withExtension:@"momd"];
        NSManagedObjectModel *mom = [[NSManagedObjectModel alloc] initWithContentsOfURL:modelURL];

        NSPersistentStoreCoordinator *psc = [[NSPersistentStoreCoordinator alloc] initWithManagedObjectModel:mom];

        [psc addPersistentStoreWithType:NSInMemoryStoreType configuration:nil URL:nil options:nil error:nil];

        [moc setPersistentStoreCoordinator:psc];

        _managedObjectContext = moc;
    }

    return _managedObjectContext;
}

- (void)setUp {
    [super setUp];

    static dispatch_once_t onceToken;
    dispatch_once(&onceToken, ^{
        Method orig = class_getInstanceMethod([GSWidgetManager class], @selector(GS_managedObjectContext));
        Method new = class_getInstanceMethod([self class], @selector(managedObjectContextForTesting));
        method_exchangeImplementations(orig, new);
    });
}

- (void)testUsingInMemoryStore {
    GSWidget *widget = [[GSWidgetManager sharedInstance] createWidget];
    NSManagedObjectContext *moc = [widget managedObjectContext];
    NSPersistentStoreCoordinator *psc = [moc persistentStoreCoordinator];
    NSArray *stores = [psc persistentStores];
    STAssertEquals([stores count], 1U, @"Only one store in use");
    NSPersistentStore *store = [stores objectAtIndex:0];
    STAssertEqualObjects([store type], NSInMemoryStoreType, @"Using an in-memory store for testing");
}

- (void)testCreateWidget {
    GSWidget *widget = [[GSWidgetManager sharedInstance] createWidget];
    STAssertNotNil(widget, @"createWidget returns a widget");

    // More tests here   
}

@end

And there you have it. I’m sure there’s a better way to do this, so do get in touch if you know one and I’ll add an update here.

Updates

  • Updated 11th Feb 2013 to explain more clearly why the preprocessor macro solution didn’t work

Project-wide Indentation Preferences in Xcode 4

Are you a tabs kinda developer but working on a project where the coding convention is to use spaces for indentation? Other way around? If so, you’ll be familiar with the five stages of coping with alien indentation:

  1. Denial
  2. I could always switch my system preference
  3. But what about all my existing source code?
  4. How about a pre-commit conversion script…?
  5. NYAAAAAAAAAAARRRRGGGGHHHH!!!!

What you really want is a way of setting your indentation preference on a per-project basis. That way you can set it to your non-preferred option for projects that require it, and let your default choice prevail otherwise. Luckily Xcode’s got you covered. Just select your project file, make sure the Utilities sidebar is visible, and choose the File Inspector. (Hit ALT-CMD-1 if you’re lazy.) In the Text Settings area there’s an “Indent Using” dropdown, which is populated with whatever you’ve set your system default to. Switch it to the alternative and it’ll affect all files you edit in this project. Bingo!

Xcode screenshot showing the per-project indentation setting

(The change is written to the project file, so will affect all other developers on the same project, but that’s fine – they’re either using the alternative option already, in which case there’s no change, or they’ll benefit from the per-project setting.)

Adding Dropbox to Zippity

Zippity 1.4 just went live on the App Store, and with it a really cool new feature of which I’m very proud of: Dropbox uploads. You can now select files from within your zip files (or indeed, the zip files themselves) and upload them to Dropbox in a couple of taps.

The feature sounds deceptively simple: add a button that uploads stuff to Dropbox and that’s it, right? Well almost, but the devil’s in the detail; this change ended up encompassing over sixty git commits and spawning two new open source projects.

How it looks

In edit mode when you select some files and tap the share button you’ll now see a Dropbox option:

iOS 6 activity picker with Dropbox option

Tap it and, once you authenticate with Dropbox, you’ll be prompted to pick a destination for your upload:

Dropbox destination selection

Then your uploads start, and you’ll see a little status bar at the bottom of the screen to let you know where you’re up to.

Dropbox upload progress indicator Dropbox upload complete

The status bar persists across views, so you can navigate around the Zippity interface and you won’t lose it; it’ll slide out of view once the uploads are complete.

Adding the Dropbox uploader

Dropbox have an API and they publish an iOS SDK, making integration pretty straightforward. Authentication works the same way it does with the Facebook SDK: if you have the Dropbox app installed on your phone it handles authentication. If not, you get directed to the Dropbox website to sign in. In either case you get redirected back to the app that triggered the authentication challenge once you’re done.

So, getting a working prototype that could upload a file to Dropbox was the easy part. The complexity came in working out how the user experience would work.

Unobtrusive status

My early prototype just popped up a little temporary alert (using GSSmokedInfoView) once the upload was complete:

Dropbox upload complete - early prototype

That worked OK, but for uploads that took more than a couple of seconds the lack of feedback during upload was a really poor user experience. I wanted Zippity users to be able to see the progress of their Dropbox uploads.

After a bit of experimentation I settled on the concept of the main app UI sliding up to reveal a status bar beneath it, then sliding back down once the status bar was no longer needed.

Early sketch of the status bar idea

The result is clear and informative while remaining unobtrusive. Since Zippity requires iOS 5 as a minimum I could create this status bar as a container view controller. I use this view controller as the window’s root view controller, and the navigation controller that was the root view controller is now a child of the status bar view controller. That means that users can navigate around the app while uploads are in progress and the status bar persists at the bottom of the screen. I’m really happy with the way this worked out.

Final status bar implementation in Zippity

Per-file progress with GSProgressView

To the left of the status bar in the above screenshots you can see a “pie chart” progress indicator. This animates through 360 degrees as the file uploads. At 100% it shows a tick to confirm that the upload has completed. That’s GSProgressView, one of the open source projects to come out of this work. Using it looks something like this:

GSProgressView *pv = [[GSProgressView alloc] initWithFrame:CGRectMake(0, 0, 20, 20)];
pv.color = [UIColor redColor];
pv.progress = 0.6;
[myView addSubview:pv];

GSProgressView is implemented entirely using UIBezierPath, which means it uses vector graphics. You can tweak the scale to fit your app and it remains razor sharp. If you think you can find a use for GSProgressView, please go ahead and grab yourself a copy.

The iOS 6 activity sheet

On iOS 5, Zippity handles sharing of files by email using a trusty UIActionSheet with an “Email” button. I implemented Dropbox sharing by adding a “Dropbox” button to that sheet:

Dropbox share sheet in iOS 5

iOS 6 introduced the more flexible, powerful UIActivityViewController which presents a range of context-sensitive options such as email, SMS, uploading images to Facebook, and so on. Zippity ties into this new system on iOS 6 devices by adding a Dropbox icon:

Dropbox share sheet in iOS 6

Integration with UIActivityViewController is achieved by writing a service-specific UIActivity subclass. GSDropboxActivity was the result; the second open-source project to come from this new release of Zippity. It includes everything you need to add a Dropbox upload activity to your iOS 6 project, including the Dropbox icon you can see in the above screenshot (even the Opacity source file).

Wrap-up

So there you have it; a little bit of insight into the work behind a “trivial” new feature in Zippity. I hope you like it! If you have suggestions for how Zippity can be improved, I’m s1mn on Twitter and I’d love to hear your ideas.

How ASCII Lost and Unicode Won

If you’ve ever heard someone complaining that “this system doesn’t support double-byte characters”, or asking whether “this data’s in Unicode”, and felt as though you really ought to understand what those things mean, then this post is for you.

This isn’t a technical discussion, but I will use the terms bits and bytes. So before we go on, let’s recap and ask:

What’s a bit?

A bit is a binary digit. It can store one of two possible values: 0 or 1. Computers like bits because there are plenty of mechanical ways you can distinguish between two possible values – for example, by switching a current on or off.

What’s a byte?

A byte is a set of 8 bits. Computers typically move data around a byte at a time. You can store 2-to-the-power-8 or 256 possible values in a single byte.

In the beginning there was ASCII

In order to exchange text with people using computers, we need a way of representing words as sequences of zeroes and ones.

ASCII is the American Standard Code for Information Interchange. It uses 7-bit numbers to represent the letters, numerals and common punctuation used in American English.

The fact that ASCII uses 7-bit numbers means there are 2-to-the-power-7 or 128 possible values it can represent, from 0 to 127 inclusive. Each of those 128 values is assigned to a character. For example, in ASCII the number 65 represents an upper-case letter A, 61 represents an equals sign, and so on. So if a word processor that displays ASCII text gets a byte with a value of 65, it displays an upper-case letter A on the screen.

These mappings of numbers to characters are just a convention that someone decided on when ASCII was developed in the 1960s. There’s nothing fundamental that dictates that a capital A has to be character number 65, that’s just the number they chose back in the day.

What’s wrong with ASCII?

Nothing whatsoever – as long as you speak English. Here’s the English word “Hello” with the ASCII numbers for each character shown beneath.

"Hello" in ASCII

Nothing wrong there! Let’s try something with some punctuation.

"ASCII rocks!" in ASCII

It certainly does! This is great. How about foreign languages?

"Es gefällt mir nicht" in ASCII

Oh.

Accented characters such as a-umlaut don’t exist in ASCII. You simply can’t represent them in text encoded as ASCII. The best you could do is use the unaccented equivalent and hope it doesn’t change the meaning to something rude.

ASCII really should have been named ASCIIWOA: the American Standard Code for Information Exchange With Other Americans.

What’s the solution?

“Well,” thought our forebears, “we could just add an extra bit. If we made ASCII an 8-bit code, it could store another 128 values. That should be enough to store all those weird, accented characters, right?”

And this is where it started to go really wrong.

Different bodies came up with different extended ASCII character sets: they all had the same, standard ASCII characters in the first 128 character slots, but the additional 128 characters varied from one character set to the next.

That had two consequences. Firstly, you couldn’t simply take text encoded in an extended ASCII dialect and display it; you had to know which dialect of extended ASCII you were dealing with. For example, in the ISO 8859-1 (Western Europe) extended ASCII character set, number 224 represents a lower-case letter A with a grave accent. However, in ISO 8859-2 (Eastern Europe) the same number represents a lower-case letter R with an acute accent. Interpreting data using the wrong character set was a recipe for disaster.

The second consequence was that if, for example, your software only understood ISO 8859-1 and you received data in ISO 8859-2, and that data included an r-acute character, you couldn’t even display it, since r-acute doesn’t exist in the ISO 8859-1 character set. The best the software could do was to replace these unprintable characters with a question mark, an empty square, or something else to indicate that an encoding problem had occurred. More often than not, they replaced them with the character corresponding to the same number in some other extended ASCII character set, resulting in gibberish.

You could argue that extended ASCII, flawed though it was, could at least provide a usable, basic computing experience for people in most parts of Europe, Africa and the Americas. Put another way, for Slovaks swapping data with other Slovaks, 256 characters gave them enough scope to express their language. It was only when they tried communicating with Greeks, Finns or Estonians that problems arose.

Meanwhile, in Asia…

However, a 256-character set is no use in languages like Chinese and Japanese, where there are thousands of characters in common use. Users of those languages had a whole new level of complexity to deal with.

The solution was multi-byte character sets (sometimes misleadingly referred to as double-byte characters sets). These used sequences of bytes to represent individual characters, and specified rules that software developers could use to determine whether a particular byte in a stream of bytes was a continuation of the previous character or the start of a new one. Multi-byte character sets removed the 256-character limit imposed by extended ASCII.

However, like extended ASCII, they were focused on a particular language. And just like extended ASCII, multiple, incompatible standards emerged. The same code meant different things in different character sets, and as a result cross-encoding issues were as much a problem in Japan as they were in Johannesburg.

Unicode to the rescue

This is where Unicode comes in.

Designed as a single, global replacement for localised character sets, the Unicode standard is beautiful in its simplicity. In essence: collect all the characters in all the scripts known to humanity and number them in one single, canonical list. If new characters are invented or discovered, no problem, just add them to the list. The current Unicode standard has space for over a million different characters, only around 10% of which are in use.

So, in Unicode, character 341 is a lower-case r-acute. Wherever you are in the world, be it in Massachusetts, Mombasa or Myanmar, character 341 is lower-case r-acute. It is the one, true Unicode number for lower-case r-acute. Likewise, character number 2979 is the rather beautiful Tamil letter NNA, character number 5084 is the equally delightful Cherokee letter DLA, and so on.

Lower-case r-acute, Tamil NNA and Cherokee DLA in with their Unicode code numbers

Unicode contains entries not just for script characters but also mathematical symbols, box drawing elements, braille patterns, domino tiles and all other manner of stuff. The latest version of Unicode contains over 110,000 characters.

Encoding the code

OK, this sounds great. But computers still talk to each other by swapping information a byte at a time. How do I transfer a Cherokee DLA (Unicode character 5084) using bytes?

This is where the Unicode character encodings come in. The most common, UTF-8, is a multi-byte encoding standard. For characters in the original ASCII character set, UTF-8 only needs one byte per character. (In fact, it’s completely backwards compatible with the original 7-bit ASCII standard.) For other Unicode characters, UTF-8 uses two or more bytes per character, and just like the east Asian multi-byte character sets, it adopts a convention that dictates how to determine whether a byte is an ASCII character, the continuation of a previous character, or the start of a new one.

However, unlike those east Asian multi-byte character sets, which were locale-specific, UTF-8 encodes Unicode numbers. A particular UTF-8 byte sequence encodes a particular Unicode number, which in turn represents a particular character, regardless of where in the world you are, or which language you speak. No overlap, no ambiguity.

Alternative schemes for encoding Unicode numbers include UTF-16, also a variable-byte character encoding but one in which every character is represented by one or more pairs of bytes. That ends up being wasteful if you’re dealing predominantly with English text; one of the reasons why UTF-8 has become the dominant choice.

So there you have it. The history of character encoding in a nutshell. I hope you found it useful. I’m @s1mn on Twitter, drop me a line if you have any comments.

Corrections

  1. The article originally said that Unicode “isn’t an 8-bit list, or a 16-bit list, it’s just a list, with no limit on its length”. That’s not correct; Unicode allows for 1,114,111 distinct values.

GSKeychain

Spurred on by reading Peeking Inside App Bundles by Nick Arnott, I went ahead and wrote a simple keychain wrapper class for iOS. GSKeychain makes storing secrets like passwords and access tokens in the keychain as easy as storing them in NSUserDefaults. Here’s a synopsis of its usage:

// Store a secret
[[GSKeychain systemKeychain] setSecret:@"t0ps3kr1t" forKey:@"myAccessToken"];

// Fetch a secret
NSString * secret = [[GSKeychain systemKeychain] secretForKey:@"myAccessToken"];

// Delete a secret
NSString * secret = [[GSKeychain systemKeychain] removeSecretForKey:@"myAccessToken"];

Currently this class should be considered a work-in-progress. Feel free to use it, but please enhance and tweak as you see fit and send me pull requests.

How about OS X?

In theory this class will work fine with OS X. The functions it uses all exist in OS X as well as iOS. However, there are some caveats:

  1. I’ve never used this class on OS X. If you try it you’ll be my guinea pig! (Please let me know how you get on.)
  2. On iOS, keychain entries are specific to an app and can’t generally be read by other apps (unless you’ve used keychain-access-groups). That doesn’t apply on OS X, so you may want to add additional data to the lookup dictionary that GSKeychain uses. (See genericLookupDictionaryForIdentifier: in GSKeychain.m to see where that lookup dictionary is created.) I’ve never used Keychain Services in OS X though, so I’m not best places to advise on this point. Read the docs.
  3. OS X has some utility functions (e.g. SecKeychainAddGenericPassword) that make interacting with the keychain a little easier and more task-centric, so you may want to use those instead.

Zsh – One More Thing…

Following on from my recent post on ten great zsh tips, here’s an eleventh: you can tab-complete remote paths when using scp. How cool is that? Commands like scp will also tab-complete hostnames at the appropriate juncture, so you might issue a typical scp command something like this:

$ scp s[TAB]
sample.mp3 some_file.txt
$ scp sample.mp3 [TAB]
ahost.com anotherhost.com example.org
$ scp sample.mp3 ahost.com:[TAB]
Desktop/ Downloads/ Library/
$ scp sample.mp3 ahost.com:Desktop/[RETURN]

If you’re on a high-latency network connection then obviously completing remote paths will be a bit laggy, but it remains in my mind a pretty cool tip.

Also, if you’re not using it already you should check out oh-my-zsh without delay. It’s like giving your zsh a steroid shot. Seriously, just install it now.

Bashisms – a Note on Creating Portable Shell Scripts

Earlier this week I finally scratched a long-standing itch and wrote a shell script for automating the fetching of gitignore boilerplates from GitHub’s gitignore library. It’s called gibo, and it works like this:

$ gibo Python vim OSX >> .gitignore

That line will squirt the GitHub gitignore templates for Python, vim and OSX into my .gitignore file. Nifty eh?

Having scratched the itch I did what any good itch-scratcher would do and uploaded it to GitHub. The Changelog blog picked up on it and it got a load of downloads. Cool!

And eventually, it got a pull request. Even better! I love it when my peers help to improve my code! The pull request read as follows:

if [ $# == 0 ] ; then
  # ...
fi

The above works in BASH (or probably similar shells). But it fails on DASH (default sh in debian).

This pull request fixes this.

It turns out that == is a ‘bashism’; a non-POSIX-compliant construct only supported by bash and a handful of other shells, but not by all. Bashisms are bad; they make your script less portable. gibo’s shebang line tells it to run using /bin/sh, and on some flavours of Debian that’s an alias for /bin/dash.

I fixed the bug, but it left me wondering whether there were any other bashisms in my script. Then I came across checkbashisms, a script that answers exactly that question. Here it is running on gibo before that pull request was raised:

$ checkbashisms gibo
possible bashism in gibo line 103 (should be 'b = a'):
if [ $# == 0 ]; then
possible bashism in gibo line 114 (should be 'b = a'):
    if [ $( echo $opt | cut -c1) == "-" ]; then
possible bashism in gibo line 121 (should be 'b = a'):
if [ $has_opts == 1 -a $has_args == 1 ]; then
possible bashism in gibo line 121 (should be 'b = a'):
if [ $has_opts == 1 -a $has_args == 1 ]; then

And here it is running on the latest version:

$ checkbashisms gibo

Perfect!

If you’re writing shell scripts and distributing them, checkbashisms is your friend! Check it out. (If you’re on Debian/Ubuntu you can install it as part of the devscripts package with apt-get install devscripts, otherwise you can just download the script from SourceForge and copy it to somewhere on your $PATH.)