Northern Planets: programming

Showing posts with label programming. Show all posts

Monday, November 12, 2007

Internet tutorials

Justin a.k.a. _harlequin_ on LiveJournal rants the good rant about technical Internet tutorials:

I've been reliving this experience recently by trying to learn to program AVR microcontrollers in C from internet tutorials for "beginners", written by adults with mental capabilities similar to those of the ten-year-old, who hadn't yet grasped the concept that beginners (funnily enough) don't have an expert’s vast array of existing expertise.
It’s cute in a ten-year-old. But coming from an adult, it makes you want to hit them.

Adding insult to injury, they focus on explaining the obvious as if you are a moron rather than a beginner, whilst being completely oblivious to the number of advanced, unexplained steps they unthinkingly used to get there. If these people wrote cooking tutorials, they would go something like:

First, we start with some flour. This is flour [example of flour]. It is white and powdery. You can buy it at a place called a "shop", or a "supermarket", trading for it using a thing called "money". Next, the muffins come out of the oven, cooked and ready. You tell when they are baked correctly because they are brown. Not too brown [example of too brown], and not too light [example of undercooked muffin], just right.

And now you know how to make muffins!

SOMEONE SLAP THIS IDIOT!

Thanks to Mrs Impala for pointing me at this one.

Friday, March 09, 2007

Precision

Some people think you can never have too much precision...

While writing data to a DVD with growisofs, it reported a transient error:

:-? the LUN appears to be stuck writing LBA=310h, retry in 141ms
141 milliseconds? Why not 140 milliseconds? Does the extra 0.001 of a second really make such a difference? 141ms isn't even a simple fraction of a second (it's a little less than one seventh).

Friday, March 02, 2007

How (not) to validate email addresses

A question that programmers often ask is "How do I validate an email address?"

At first glance that appears to be a sensible question. If you're writing a web form or some other application that needs to accept an email address, you might want to detect errors (say, typing fred4my.com instead of fred@my.com) and give the user a chance to correct the error.

But the question of what is a valid email address is much harder than you might expect. The official standard for email accepts a very broad range of email address formats.

[Aside: what's with Google? Try searching for "how to validate email addresses" (without the quotes). I get a 403 error page:

We're sorry...
... but your query looks similar to automated requests from a computer virus or spyware application. [...]

After some experiments, it looks like Google UK blocks (almost) any search containing "email" and "address". But Google Australia doesn't seem to care; and even Google UK will accept the query if it comes from Konqueror's toolbar.]

The best advice for validating email addresses is: Just Say No. At most, check that the email address isn't blank. If you absolutely know that the address can't be a local address, check for the presence of at least one at-sign @. (Yes, you read me right the first time: at least one.) And that's it -- leave the validation up to the mail server. If the mail server can deliver it, it is valid, and if it can't, it isn't.

If you want to guard against user typos, get the user to type the address twice, like they do for a password.

But ignorant programmers -- and it's frightening how many programmers fall into that category -- insist on doing incorrect validation. This example shows the danger of false negatives: anyone using this code will wrongly reject perfectly valid email addresses like:

my.name@somedomain.info
professor@ancienthistory.museum
somebody (see me @ the pub) @somewhere.com

Yes, the third one is valid: the part between ( and ) is a comment, and is ignored by any compliant mail server.

Another common mistake is to reject emails like something+else@domain.com: plus signs in the user name part are allowed.

And then there are the commercial sites that won't let you register with a Hotmail, Yahoo or Gmail address. Don't get me started on the sheer pig-ignorance and stupidity of that...

But ultimately, even if an email address is syntactically valid (and it is a horrific task to check that!) there's no guarantee that the address is valid until you've actually sent to it successfully. fred@somedomain.com is syntactically valid, but you still have to send an email to that address to find out whether the address is valid or not! That's why using a validator that works for "99% of email addresses" is bad practice -- not only do you needlessly reject the 1% of valid email addresses that your software can't handle, but you still don't know whether the address is valid until you actually try it.

The only thing worse than people who insist on validating email addresses are people who insist on validating email addresses with a regular expression. To quote Jamie Zawinski:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Somebody, I think in the spirit of George Leigh Mallory ("because it's there"), wrote a regular expression to almost validate email addresses (it can't deal with comments, and naturally it can't tell whether or not the address actually exists). To give you a flavour of this regex, here are the first sixty-five characters of this 6343-character monster:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]

Multiply that by a hundred. Now imagine trying to track down a bug in this beast. How confident are you that the creator of this regex has correctly dealt with all the odd corner cases?

Sunday, January 14, 2007

Immutable instances in Python

Imagine you have a class like Coordinate that implements a two-dimensional coordinate pair. You might want to use instances of that class in a dictionary, but the problem is that instance keys compare by identity, not equality:

>>> D = {Coordinate(2, 3): "something"} # Coordinate is a custom class. >>> D.has_key(Coordinate(2, 3) False
This is not what you expect: even though the two instances of Coordinate(2, 3) have the same value, they don't have the same ID and therefore Python won't treat them as the same dictionary key.

The answer to this problem is to give the class __hash__ and __eq__ methods:

class Coordinate(object):
   def __init__(self, x, y):
       self.x = x
       self.y = y
   def __hash__(self):
       return hash(self.x) ^ hash(self.y)
   def __eq__(self, other):
       try:
           return self.x == other.x and self.y == other.y
       except AttributeError:
           return False

Now two instances that compare equal will also have the same hash, and Python will recognise them as the same dictionary key.

But there's a gotcha: unlike built-in types like int and tuple, classes in Python are mutable. That's generally what you want, but in this case it can bite you. If the instance which is the key is changed, the hash will also change and your code will probably experience difficult to track down bugs.

The solution is to make Coordinate immutable, or at least as immutable as any Python class can be. To make a class immutable, have the __setattr__ and __delattr__ methods raise exceptions. (But watch out -- that means that you can no longer write something like self.x = x, you have to delegate that to the superclass.)

class Coordinate(object):
   def __setattr__(*args):
       raise TypeError("can't change immutable class")
   __delattr__ = __setattr__
   def __init__(self, x, y):
       super(Coordinate, self).__setattr__('x', x)
       super(Coordinate, self).__setattr__('y', y)
   def __hash__(self):
       return hash(self.x) ^ hash(self.y)
   def __eq__(self, other):
       try:
           return self.x == other.x and self.y == other.y
       except AttributeError:
           return False

There are a few other things you can do as well: as a memory optimization, you can use __slots__ = ('x', 'y') to allocate memory for the two attributes you do use and avoid giving each instance an attribute dictionary it can't use. If the superclass defines in-place operators like __iadd__ etc. you should over-ride them to raise exceptions. If your class is a container, you must also make sure that __setitem__ etc. either don't exist at all or raise exceptions.

(I am indebted to Python guru Alex Martelli's explanation about immutable instances.)

[Update, 2007-04-02: fixed a stupid typo where I called super(Immutable, ...) instead of super(Coordinate, ...).]

Tuesday, January 02, 2007

Asking why on technical forums

If you spend any time on technical mailing lists or newsgroups, you'll often come across conversations that go something like this:

"How do I frabulate the transfibulator?"

"Why do you want to do that?"

"Why do you care? Just tell me how to frabulate the transfibulator!"

Why should people on technical lists care about the why? Why not just answer the question?

Firstly, and most importantly, because people have an ethical obligation not to give bad advice.

It is foolish to assume that every random poster on the Internet or Usenet is a responsible, intelligent, clear-thinking, sufficiently cautious adult who knows what they are doing. In fact, if you were going to play the odds, you'd bet on them being the complete opposite. This is even true on many of general purpose technical mailing lists (although perhaps not so much on the more elite lists). If frabulating the transfibulator carries risks or serious costs, then the chances are very good that the person asking about it isn't aware of those risks.

It is one thing to give a straight technical answer if it seems that the poster knows what they're doing. There's no reason not to tell someone how to shoot themselves in the foot if they are fully aware of the consequences of doing so; it is another thing altogether if their post indicates that they haven't thought it through and have no idea that they are even pointing the gun at their foot.

If somebody asks for help writing a rotor-based encryption engine (like the World War Two Enigma), it would be sheer irresponsibility to answer their technical question without pointing out that Enigma was broken back in the 1940s and is not even close to secure today. So ask "Why do you want to do that?". If the answer is "I'm storing confidential medical records in a database", then you can gently apply the cluebat. It might be your own medical records you prevent from being stolen. But if the answer is "I'm doing it to obfuscate some data in a game, I know this is weak encryption, but it is good enough for a game", then that's a horse of a different colour.

The second reason for asking "why?" is that it is extremely common for people to ask the wrong question because of a misunderstanding or misapprehension. Some time ago I read an exchange of posts on comp.lang.python started by a programmer who was looking for a faster method to access items in a list. Eventually somebody asked him "Why?", and it turned out that he had assumed that Python lists are linked lists and that item access was a very slow procedure. In fact, Python lists are smart arrays, and item access is exceedingly fast.

If folks had merely answered his technical question, he would have solved a non-problem, learnt nothing, and ended up with slow and inefficient code. His real problem wasn't "How do I this...?". His real problem was that he was labouring under false information, and by asking "Why do you want to do this?", people helped him to solve his real problem.

Wednesday, November 29, 2006

Random number generators

"Anyone who uses arithmetic methods to produce random numbers is in a state of sin."
-- John von Neumann.

Paradise Poker has an interesting page on how they generate random numbers for their Internet casinos, and why most random number generators can't shuffle even a small list of items properly. For example, a standard deck of cards can be shuffled 52! different ways (more than 10⁶⁷, or ten thousand billion billion billion billion billion billion billion). A 32-bit random number generator can generate at most four billion combinations -- clearly inadequate.

Tuesday, October 10, 2006

Major software projects

Lousy.
Late.
Expensive.
Choose any three.

(Not that I'm cynical or anything like that. Not in the least.)

Tuesday, September 19, 2006

What is a hacker?

Bruce Schneier has a great explanation of what is a hacker, why they are important, and why he doesn't buy into the "hackers good, crackers bad" meme.

A hacker is someone who thinks outside the box. It's someone who discards conventional wisdom, and does something else instead. It's someone who looks at the edge and wonders what's beyond. It's someone who sees a set of rules and wonders what happens if you don't follow them. A hacker is someone who experiments with the limitations of systems for intellectual curiosity.

[...]

Hackers are as old as curiosity, although the term itself is modern. Galileo was a hacker. Mme. Curie was one, too. Aristotle wasn't. (Aristotle had some theoretical proof that women had fewer teeth than men. A hacker would have simply counted his wife's teeth. A good hacker would have counted his wife's teeth without her knowing about it, while she was asleep. A good bad hacker might remove some of them, just to prove a point.)

Saturday, September 09, 2006

Video editing badness

The Linux video editing software Kino doesn't support AVI files natively, it works with camcorder DV files. However, it will import AVI files and convert them into DV format. I have had problems with Kino importing a 1GB AVI file.

I'm not specifically upset that after 15 minutes of processing, the temporary .DV file it created had expanded to 6GB. These things happen -- some data formats are bigger than others. I'm not even upset that processing hadn't finished -- some things take time.

But it is absolutely unforgivable that after not just clicking the Cancel button, but having quit the Kino application, the data import was still churning away in the background. What sort of jerry-built, buggy piece of crap software leaves processes running after you've not just explicitly said "Stop that!" but even quit the application?

I miss the days when Linux programmers actually had a clue. If your application launches a thread to run a job, and the user says cancel the job, CANCEL THE JOB. You don't need an IQ of 168 to know that.

As it was, I was lucky that I knew what was happening. I had launched Kino from the command line, instead of from a menu command, and the status messages were flying thick and fast in the CL window. After much to-ing and fro-ing, from process manager to command line and back again, I eventually discovered that the process in question was the ffmpeg library, and was able to stop it.

The lesson from this is not that Linux isn't ready for use on desktop PCs: this sort of behaviour is no different from what goes on under Windows, except under Windows it is even harder to track down the rogue process. I've resorted to reboots under Windows to stop software running. The lesson is that a colourful animated user interface does not make quality software. I wish application developers would spend more time on getting the basics right.

Thursday, August 03, 2006

The Kingdom of Nouns

A lovely little tale about different stylistic conventions in computer languages, focusing on the Kingdom of Nouns (Java).

In the Kingdom of Javaland, where King Java rules with a silicon fist, people aren't allowed to think the way you and I do. In Javaland, you see, nouns are very important, by order of the King himself. [...]

In Javaland, by King Java's royal decree, Verbs are owned by Nouns. But they're not mere pets; no, Verbs in Javaland perform all the chores and manual labor in the entire kingdom. They are, in effect, the kingdom's slaves, or at very least the serfs and indentured servants. The residents of Javaland are quite content with this situation, and are indeed scarcely aware that things could be any different.

Verbs in Javaland are responsible for all the work, but as they are held in contempt by all, no Verb is ever permitted to wander about freely. If a Verb is to be seen in public at all, it must be escorted at all times by a Noun.

Of course "escort", being a Verb itself, is hardly allowed to run around naked; one must procure a VerbEscorter to facilitate the escorting. But what about "procure" and "facilitate?" As it happens, Facilitators and Procurers are both rather important Nouns whose job is is the chaperonement of the lowly Verbs "facilitate" and "procure", via Facilitation and Procurement, respectively.

Alas, many of the ~~idiots~~ programmers commenting on the post miss the point of the tale, which is not that Java isn't English. Nor is it that Object Oriented design is a bad thing (although, OO is merely one means to a greater end, namely information hiding). The point is that Java's type system forces people to create classes solely for the purpose of carrying functions about, often to an absurdly exaggerated degree.

The question of nouns and verbs (data and functions) is important for programmers, as this article by Joel Spolsky explains.

Saturday, July 08, 2006

Nearly all binary searches are broken

Joshua Bloch from Google writes about the recent discovery of a bug in almost all implementations of binary search:

I was shocked to learn that the binary search program that Bentley proved correct and subsequently tested in Chapter 5 of Programming Pearls contains a bug. Once I tell you what the it is, you will understand why it escaped detection for two decades. Lest you think I'm picking on Bentley, let me tell you how I discovered the bug: The version of binary search that I wrote for the JDK contained the same bug. It was reported to Sun recently when it broke someone's program, after lying in wait for nine years or so.

Sunday, July 02, 2006

The one good thing about Fortran

The one good thing about Fortran 77 is the ability to redefine the numeral 7. That and optional whitespace between tokens.

The two good things about Fortran are redefining integers, optional whitespace, and computed GOTOs.

Three good things! The three good things about Fortran are redefining integers, optional whitespace, computed GOTOs, and self-modifying code.

Four! Four good things! The four good things about Fortran are redefining integers, optional whitespace, computed GOTOs, self-modifying code, and six-character variable names.

Five! The five good things...

Northern Planets