Sunday, February 24, 2008

The downside of proprietary data

Mark Pilgrim is a published author, Google employee and long-time Apple Macintosh user and programmer. In the Macintosh universe, he's part of the pantheon: although never an Apple employee, and not quite up there with folks like Andy Hertzfeld, he's nevertheless one of the minor demi-gods of Apple mythology. He also helped create one of the few Mac viruses (the MBDF-A), but co-operated with police on his arrest and paid restitution for the damage done.

Putting aside his checked past, Pilgrim was considered one of the Mac power-user evangelists, so it came an unpleasant shock to the Mac community when he finally discarded his Mac in favour of Linux. There were tears and predictions of doom. Those predictions turned out to be wrong, and Pilgrim is predicting that 2008 will be the year of Linux on the Desktop. (With the sudden expansion of notebooks running Linux, like the EEE, I think those predictions will finally be right. And not before time.)

Not long after jumping ship to Linux, Pilgrim discussed his experiences with long-term data storage, and his frustration with the difficulty of keeping data accessible over a time frame measured in decades instead of months or years. The bottom line? Long-term storage of data is like a series of migrations from data format to data format. Anything which makes that migration harder is going to hurt you. Companies like Apple who don't grok openness are constantly trying to lock people into their products, then change the products. Every time they do that, there's pain and inconvenience for users, and usually the loss of data.

Pilgrim's conclusion is that using open source software and, more importantly, open formats, goes a long way to reducing this problem. You will still need to migrate data from computer to computer (anyone think that the computers of 2028 will still be running Windows Vista?) but the pain will be less.

There’s an important lesson in here somewhere. Long-term data preservation is like long-term backup: a series of short-term formats, punctuated by a series of migrations. But migrating between data formats is not like copying raw data from one medium to another. [...] But converting data into a different format is much trickier, and there’s the potential of data loss or data degradation at every turn.

Fidelity is not a binary thing. Data can gradually degrade with each conversion until you’re left with crap. People think this only affects the analog world, like copying cassette tapes for several generations. But I think digital preservation is actually much harder, in part because people don’t even realize that it has the same issues.


So if you care about long-term data preservation, your #1 goal should be to reduce the number of times you convert your data from one format to another. You should also strive to increase the fidelity of each conversion, but you may not have any control over that when the time comes. Plus, you may not know in advance how faithful the conversion will be, so planning ahead to reduce the number of conversions is a better bet.

Open source software is not a panacea for this sort of data loss: as Pilgrim discusses, the open source photo-editing software Gimp uses a deliberately undocumented file format that no other application can fully read.

If you care about accessing your data in ten years time, then go read the rest of his conclusions. (And if you care about people accessing your data in 200 years time, print it out on good acid-free paper and deposit it somewhere dry and safe.)

No comments: