Regular Expressions Are the Best! s/Best/Worst/

March 5, 2022
Regular Expressions. Ya love 'em or ya hate 'em. But it shouldn't be so black-or-white. Here's when they do, and don't make sense.

Regular Expressions. Regex.

Ya love ’em or ya hate ’em, it seems.

In some languages (Perl anyone?) they’re so central to the language, that it’s almost impossible to imagine life without them. In others, they’re tacked on as a cumbersome library.

So should you use regular expressions? Should you avoid them? Does it depend on your language of choice?

My opinion on the matter has evolved over the years, to the point that I now fall firmly in the camp of…

“You probably should mostly, usually avoid regular expressions, except in some specific cases.”

Bleh.

Nuance. Don’t you hate that?

Let me break this down a bit, to explain my nuanced approach.

What’s all the fuss about?

First, for anyone here who’s not already familiar with regular expressions, it might be nice to know what they even are.

Wikipedia offers this explanation:

A regular expression is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.

Okay. So that’s abstract. What does a regular expression look like? Here are a few examples, which I won’t explain in detail, just to give you a sense:

  • ^Mr\. – Does the name begin with the title “Mr.”?
  • /(Na na na nananana, nannana, hey Jude)*/ – Are these the lyrics to Hey Jude?
  • ^$ – Is this the empty string?
  • \d3-?\d3-?\d4 – A US phone number that may or may not contain dashes

Also of interest is the fact that regular expressions were not originally a computer thing, originating in the 1950s, and only becoming mainstream in the world of computers in the 1980s:

The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the description of a regular language. They came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.

Regular Expressions are powerful!

So that’s a bit abstract, and maybe even confusing. Why would we even want to use such a thing in computing?

Well, the simple answer is that regular expressions are both powerful and flexible.

That is to say, we can write a lot of search or search/replace patterns using regular expressions. Want to validate a phone number that may or may not contain dashes? Or might even use periods in place of dashes? Do you want to extract the names of animals and the sounds they make from the lyrics to Old MacDonald Had a Farm? Regular expressions can do it.

And this helps explain why Unix adopted regular expressions for text-based processing tools such as grep—a use case I’ll get back to at the end of this article.

Where regular expressions fail

That sounds nice. Powerful and flexible things are great in computing, right! So what’s the problem with regular expression, then?

Well, categorically, regular expressions are applicable to one type of input, and the clue is in the name: a regular language.

Now that’s a bit of a circular definition… as a regular expression can parse a regular language, and a regular language is essentially defined as a language that can be parsed with a regular expression. But that’s okay for right now, since this isn’t an accademic paper. The point here is: Not all inputs can be parsed with a regular expression.

And this point is lost on many people. Perhaps most famously, HTML is not a regular language, and therefore you cannot parse HTML with a regular expression.

Some other problems may technically be addressed with regular expressions, but often shouldn’t. Email validation is a great example.

But should we…?

Maybe you don’t need to validate all valid email addresses. Or you aren’t parsing all valid HTML. Maybe you have a limited range of inputs. Then maybe regular expressions can be used. But that doesn’t necissarily mean they should…

Your programmers were so preoccupied with whether they could use a regular expression, they didn’t stop to think if they should.

Even when they can be used, regular expressions have a few potential down sides.

Regular expressions can be inefficient

Several years ago I was working on a spam-filtering project where, one day, our mail server started crashing with an out-of-memory error.

Long story short, after debugging, we found that a customer had entered a blocking filter that looked something like this (the actual regex was actually more complex, and thus worse at performance, but this should illustrate the point):

/.*(viagra|cialis|levitra).*/

If you’re unfamiliar with regex, this pattern searches for any set of characters (.*) followed by any of the three words viagra, cialis or levitra, followed by any set of characters (.*).

When someone sent an email with a 10gb attachment, for example, the regex engine would read the entire message into memory, then start looping through the entire message, repeatedly, one more character per iteration.

Our fix was to take any user-provided regex and replace .* with .{0,1024}, which effectively changed the search from “any set of characters” to “any set of up to 1024 characters.” This was still very inefficient, but was enough of an improvement that our customers could no longer DoS our mail servers.

While this is an extreme example, inefficiences in regex abound.

In fact, regular expressions are almost always less efficient than their non-RE alternatives. While this is something to keep in mind, performance is usually not the most important aspect of our code.

Regular expressions are confusing

And this leads to the next big problem with regular expressions. They’re hard to read. Even a simple regular expression, like the one above, can be more clearly stated as call to a function like strings.ContainsAny.

When you get to any non-trivial regular expression, even the most experienced regex whisperers amoung us will take at least minutes to understand what a regex is doing. Never mind trying to debug it!

And because I put such a high premium on code readability, I tend to discourage the use of regular expressions in code, whenever there is an alternative.

For example, instead of:

name.match(/^Mr\./)
/^Mr\./

Prefer:

name.startsWith("Mr.")

Sometimes regex is still the best tool

Now that I’ve hopefully firmly established that I believe regular expressions are not readable, and therefore should usually be avoided, there are times when a regular expression in your code is still the best way to solve a problem.

If you’re parsing a log file, for example, it’s likely much easier (and more readable) to put your log format into a regular expression, than to write some sort of character-based parser.

If I were to summarize this, I’d say that regular expressions may be appropriate when parsing data structures more complex than CSV, but less complex than XML or HTML, when a stand-alone library is not available.

Handling user-input

The other place where I’m actually a big fan of regular expressions is the same use case that earned it its popularity starting in the 1980s: Arbitrary user input.

If you need a user of your system to be able to enter a search pattern, such as used when filtering email, querying a large dataset, or calling grep, a regular expression is a perfect tool. It can still be a bit dangerous, as I explained above. But with the proper sanitization and precautions, this great power can usually be wielded responsibly.

Of course there are simpler alternatives to regular expressions for many forms of user input. File globs (aka “wildcards”) are a great example, for certain use cases.

TL;DR;

Here’s my simplified list of when to use a regular expression:

  1. Handling HTML, XML, email addresses, or any other input with complex or irrelgular parsing rules? Don’t use regex
  2. Is there a library or function to handle parsing/validation/search for you? Don’t use regex
  3. Are you parsing regular, but complex input, that would be tedious to parse without a regular expression? Maybe use regex
  4. Is your search/replace pattern provided by a user? Likely use regex

What do you think? Did I overlook any benefits (or drawbacks) to regular expressions that I should consider in my list?

Share this