/Reg(ular)? ?[Ee]xp(res{2}ion)?s/

Dima Parzhitsky
ITNEXT
Published in
15 min readJul 30, 2019

--

Read the last line of a file? Yep.
Count usage of a given word in a text? Check.
Validate user input? All the time!

These are the tasks that we, as software engineers, come across pretty frequently. Chances are, you have already ran into one of them at some point. But there is one thing that makes them very similar to each other — in most cases they are better handled with regular expressions.

I know, I know, regular expressions have a reputation of being unnecessarily overcomplicated. The goal of this article is exactly to show you that this is just a bias, far from the truth. I believe that the true power of regular expressions, once possessed, will give you previously unimagined control over the quality and quantity of your work.

Pro tip: If you are not satisfied with scalability of regexps (which I think everybody is), try reScaled — the tool for building regular expressions seamlessly from reusable atoms.

Basic syntax

Let’s get a refresher on the basics that you’re probably already familiar with. Regexps (a shorter name for “regular expressions”) were originally created to be (and are still used as) a generalization tool for strings, sort of like a pattern for them. The logic of your program directly depends on whether a given string matches the pattern or not.

Text “abc” enclosed between two slashes
This is a regular expression
Text “abc” enclosed in two quotes
This is a primitive string

And regexps are in many ways similar to strings. Just like strings are enclosed in two " quotes, regular expressions live inside two / slashes. Just like with strings, some characters need to be escaped with a \ backslash to be appropriately used as part of a regular expression.

I prefer using “regexp” instead of “regex” because JavaScript suggests the first one through the name of RegExp constructor.

Except for a handful of special characters (more on that later), most of the things that regexp body contains of is matched literally. It means that this regexp /abc/ matches only the string "abc" and not “ab”, “bc”, “Abc”, “abbc”, “xyz” or anything else.

/abc/.test("abc"); // true
/abc/.test("xyz"); // false

But if the text is known precisely in advance, you rather don’t need a regexp, just use the string itself. In order to do a useful work, you need something more, like…

Quantifiers

These guys deal with repetition. They help you if, say, you’re not sure how many times the pattern will occur (consequently) in a given string. Or the opposite — you know that this digit is repeated here 16 times exactly, not more, not less. Those kinds of things.

In regexp syntax, there are three basic quantifiers:

  • optional occurrence (0 or 1 time);
  • definite occurrence (1 or more times);
  • ambiguous occurrence (0 or more times);

These terms are not real though.

Use ? for optionality, + for definitiveness and * (asterisk) for ambiguity.

Question mark before “message” parameter means that it is optional
Question mark before “message” parameter means that it is optional

I find these characters really easy to remember. ? is a yes/no question. Also, many programming languages use the question mark to indicate optional parameters.

PEGI 18 icon before June 2009
PEGI 18 icon before June 2009

+ is often used in texts like “18+”, meaning that you have to be 18 or more years old to watch the movie / play the game / visit the site.

It is important to note here that + is not about math. The regexp /18+/ does not match “19” nor “20”, but it does match "188" and "1888".

Interface of GitLab.com allows using wildcards to protect several branches
Interface of GitLab.com allows using wildcards to protect several branches

Lastly, * could be thought of as a placeholder. If you think about it, zero or more times is really any amount of times. And indeed, it is used quite often to say that “anything will do.”

Regular expression with quantifiers
Regular explession with quantifiers

If we’d change our regexp from /abc/ to /a?b+c*/, it would be significantly more powerful. Now it matches also strings like "abc", "bc", "ab", "bbbb", "bcccccc" and more. It will not match “aabc” though, because ? takes one occurrence of the character “a” at most. Also, the regexp does not match string “ac”, because “b” is expected by + to be there at least once.

Try it.

Quantifiers may be even more versatile. With custom quantifiers one could quantify stuff using formulas like “exactly x times”, “at least x times”, or “between x and y times”. In the following example we match exactly five “a”s, followed by at least six “b”s, followed by seven-to-twelve “c”s.

Example of a regular expression with all kinds of custom quantifiers
Example of a regular expression with all kinds of custom quantifiers

Notice that there are no spaces around commas. Adding spaces will make it stop being a quantifier.

Try it.

Quantify parts of strings properly is one of the most important things to be done in regular expressions. Also, you can already see that regexps are kind of sort of stencils for text, limiting or diversifying the acceptable variants of it.

Groups

You may have already noticed that we are talking characters. But what about words, can we quantify them? For example, is it possible to say something like: “At this point, there might or might not be this word” using regexps? Well, words are just ordered groups of characters (“ordered” is important here), so we just have to group a bunch of’em. To do that, we pretty intuitively use parentheses ().

Regular expression with group
Regular expression with group

By creating a group we are creating a token, so from now on it can be treated almost as if it was a single character. E.g., by quantifying the group in the /hello( world)?/ regexp we’re saying that we accept both "hello" and "hello world" strings — with or without the second word.

Note that if there is no word, there is no whitespace before it. That’s why the space had to be included in the group too. Without this, the string "hello " would be accepted, which is bad since texts shouldn’t end with a whitespace.

Try it.

Grouping also allows us to make complex “either this or that” logic with the | (vertical line) character. Let’s say we’ve invited both Tony and Howard Stark, famous Iron Man and his father to our backyard party. All they have to do is to enter their full name on a special digital board, which will have to be configured in a way so that only “Tony Stark” and “Howard Stark” names are considered valid.

Regular expression with alternation
Regular expression with alternation

Since we have only two names and a subtle difference between them, instead of a list of allowed names we can use a simple regexp. Both of them end with “Stark”, which makes it easy to extract the rest into the subpattern of a valid first name: /(Tony|Howard) Stark/

Now, if Eddard Stark sneakily enters his name, he is going to be soon kicked off from our awesome party.

Try it.

Character sets

Okay, let’s return to individual characters.

It seems reasonable to use the | vertical line to vary usage of different letters in a word, for example, to accept both spellings of “-ize/-ise” words. Here you might decide to use the vertical line like this: /utili(z|s)e/, but… Although this is kinda fine and it does indeed work, still you can do better than that.

My favorite quote ever

Another set of brackets that allows you to group characters is a set of two square brackets [].

Regular expression with two character sets
Regexp with two consequent character sets

Square brackets make an unordered set, and perform the “one of them” logic. It means that instead of explicitly saying “try z, if it fails, then try s” we can create a small set of two characters [zs], and make regexp machinery do the heavy work for us. The latter is also more optimized and it results in a regexp that looks cleaner and reads easier: /utili[zs]e/.

For reasons, that I think are obvious, you don’t have to put characters in the set if you don’t want them to be matched. Also, it is useless to put a character in such a set twice.

Try it.

Ranges

But sometimes you need even more flexibility. If you expect one of not two but a whole bunch of related characters, wouldn’t it be cumbersome to list them all one by one while creating a set? Sure it would. That’s why regexps have yet another often used feature up their sleeve — ranges.

In the Unicode table, each character is associated with a particular number, and the more you go “right” or “down”, the higher the value of that number is — simply put, it increases by 1 from one character to another. That allows locating a group of consequently placed characters by range.

Regular expression with character range
Regular expression with character range

Ranges are defined using - hyphen, which is again very intuitive. It is also very simple: you just place the hyphen between two characters that will both be included in the range. Like in the regular expression [3-8]. Since digits in the Unicode table are placed from “0” to “9”, this regexp matches any digit between “3” and “8”, inclusively.

Compare it to [345678] and (3|4|5|6|7|8). So concise, isn’t it‽

Ranges are useful also because they can be merged seamlessly. You can use several ranges right next to each other to broaden the options for that character.

More verbose representation of an alphanumeric character
Set of all alphanumeric characters

For example, an alphanumeric character can be expressed as [A-Za-z0-9_], which is Regexpian for: “upper- or lowercase letter, digit, or the underline character.” FYI, that is what alphanumeric character is.

There are some minor non-(very)-intuitive gotchas here though:

  • range boundaries have to be specified start-to-end, e.g., from “a” to “z”, not in reverse;
  • you have to take extra care to include hyphen itself in the set;
  • hyphen defines ranges only when placed inside [] square brackets; outside of them, it is matched literally as a hyphen (duh!).

Try it.

Metacharacters

What if it doesn’t matter which character it is, as long as it is a letter? How to match any letter exactly once? Or any digit? Surely, you could use a range, but wouldn’t it be nice to just say “letter” or “digit” without ranges or something?

Yep, exactly. For just that there are metacharacters.

Metacharacters
Metacharacters

There exists a bunch of metacharacters, and IMO the most useful of them are \d (for “digit”), \w (for “word”), \s (for “space”) and \n (for “new line”). Notice that they all are prepended with a backslash — that of all things is what makes them metacharacters.

Some metacharacters can be substituted with ranges to demonstrate exactly what they do:

  • \d means “digit”, so it is simply the same as [0-9].
  • \w, despite the obvious, does not match words, but rather word-like a.k.a. alphanumeric characters (yep, those dudes from the earlier of this article), and it can be safely substituted with [A-Za-z0-9_] (but why would one do that though?). Notice how \d is a subset of \w.
  • \s is a very convenient representation of whitespace characters — such as the actual space, tab, the new line and many others. In fact, it encapsulates unshowable characters, so the corresponding range looks bit like a mess: [ \f\n\r\t\v​\u00A0\u1680​\u180e\u2000​\u2001\u2002​\u2003\u2004​\u2005\u2006​\u2007\u2008​\u2009\u200a​\u2028\u2029​\u202f\u205f​\u3000] Here you go, now memorize it 😜

Other metacharacters are often the only way of doing something:

  • \n is not a range, but rather a single character (new line character), though very often used — to the point where it has a separate metacharacter for itself. Also, it is worth noting that \s contains \n, therefore it is pretty useless to write something like /[\s\n]/.
  • \b (for “boundary”) is a metacharacter used to match places where words start and end (by “words” I mean groups of alphanumeric characters). See, there is no even such a character in the table!

To recap, metacharacters are useful because they a) are slim abbreviations for popular use cases; b) contain a lot of information, otherwise (if “otherwise” exists) expressed with great verbosity; c) are superstars of readability.

Try it.

Special characters

Throughout the article you may have noticed that regexps treat some characters in special way — like with the case of |, ?, +, *, \, square brackets and parentheses. Not surprisingly, these characters are called special. They are what enriches regexps with all sorts of useful features — and still there are not so many of them. Besides those already familiar to us, there are ^ (circumflex), $ (dollar sign) and . (dot).

The first two are very often used together because of their complementary nature: ^ matches the beginning of string, and $ matches its end. This is very useful in cases when you have to match texts that start or end with a given pattern; or to ensure that the whole string obeys the pattern, rather than just some part of it.

This regular expression matches strings that either start with “Lorem…” or end with “…amet”
This regular expression matches strings that either start with “Lorem…” or end with “…amet”

Lastly, the . dot is a “catch-all” guy, so that it matches any other character, including the dot itself. If you would use it in conjunction with the * quantifier, you’d express the “any amount of anything” logic — when you don’t know or don’t really care about this particular part of the input.

This regexp matches a string that starts with “Lorem”, ends with “amet”, and maybe has something in between
This regexp matches a string that starts with “Lorem”, ends with “amet”, and maybe has something in between

The important point is that the . by default does not match the \n new line character, so you have to be a little bit creative to include it in the “anything” group.

Special characters by definition do not match themselves literally. You can already see where this might be a problem — that is, when they have to be matched literally, of course. In this case you should be able to somehow “turn off” their specialness.

Guess what, you don’t need any super-special character to do that (and yet another super-duper-special character to match this one, and so on), no. It would actually suck! Instead, you just use \ backslash. Think of it as of a character that “reverts” specialness of its patient: d is literally the “d” character, but \d is a digit; s matches only itself literally, but \s stands for a space-like character. And at the same time, ? does a special function of quantifying stuff before it, but \? is just a question mark. You get the idea.

As you can see, \ backslash exhibits special behavior, therefore it is also a special character. Now, how do you think you would match the backslash itself in the input string? Think about that for a moment; as I’ve said earlier, you don’t need any fancy character for that… That’s right, you just use the backslash again!

Regular expression for a single backslash
Regexp for a single backslash

The regexp in the “Try it” section below aims to match sentences that end with the question mark (namely questions).

Try it.

Examples

These are those 20% of regular expressions, that you have to know in order to squeeze 80% of juice from them. The rest will come later in this article, but for now, let’s have some fun with real-world use cases.

Before we start, I’d like to emphasize an important point:

Composing regular expressions starts with defining the model of ideal value — the more precise the model, the better regexp it will produce.

And, like with any model, there are no perfect ones.

Now, onto examples!

Validating user’s full name

In a lot of applications we trust users to fill in their full names. But not every user is trustworthy, so we have to consider malicious inputs and prevent those from being submitted.

In other words, we have to validate the user’s full name.

Most of the time the full name is comprised of two components: the first and the last name.

Surely, users’ full names do not always have exactly two parts. For example, my own full name also includes patronym. But let’s keep it simple.

The ideal text containing person’s full name would obey these rules:

  • it contains two words, separated by a space;
  • both parts are at least one character long;
  • each parts starts uppercase and continues lowercase.

I think, we’re now ready to compose the pattern:

/[A-Z][a-z]* [A-Z][a-z]*/

Since both parts of the full name have similar rules, the pattern becomes very repetitive. I’ve created a solution for this issue.

Here we used [A-Z] and [a-z] ranges instead of \w, because the latter also allows digits and the underscore character, and we know those cannot be a part of name. Also, our choice is more precise, which is always good.

Try it.

Locating Vue component usage

Visual Studio Code is one of the most popular IDEs right now. And Vue.js is one of the most popular Frontend frameworks. So, let’s try and find all the occurences of an arbitrary <base-input> component throughout the application written with these tools.

Vue allows both PascalCase and kebab-case references to the component, so we have to account for that. Let’s also consider camelCase, just for lulzies.

Now, open the search panel on the left (or press Ctrl+Shift+F), locate the topmost text input, and switch the regexp mode on (or try Alt+R). Now we can find component references by a pattern:

<[Bb]ase(-i|I)nput\b
Screenshot of a search panel in Visual Studio Code editor
Here’s how it should look like

This magic pattern will find us both <BaseInput> and <base-input> usages, as well as the <baseInput> variant — all of them with or without attributes. Also, notice that the pattern does not match the “<base-Input>” variant (with the uppercase “i”), since it is neither of the desired cases.

Finding trailing commas in JSON

This example was inspired by this answer given by myself at StackOverflow. The OP struggled with erroneous JSON output and wanted to get rid of the trailing commas in the response.

One of the possible solutions was to find all commas that are followed by a closing bracket of any kind. This is the easiest approach so far, nevertheless it has some minor complications:

  • both closing brackets ] and } are similar to regexp’s special characters, so they both have to be escaped;
  • the closing bracket could be placed after an unknown amount of spaces, tabs and new lines;
  • the closing bracket itself, as well as the preceding whitespace, should not be included in the regexp’s output (i.e. matched against), so that we don’t accidentaly delete it.

The first one is trivial, but what about the second one? Does it sound familiar? Yes, this is where the * quantifier is a perfect fit: “zero or more whitespace characters” is the same as \s*.

The third one though is something new — we don’t know yet how to check for something without matching it. So, let me briefly introduce the (?=…) kind of group. It is called “positive lookahead” and it does exactly what we need: it expresses the “followed by” logic, but drops the contents of itself from the final match.

Also, we’ll have to use the g (for “global”) flag to indicate that we expect several occurrences of a pattern throughout the input.

The g flag is by far the most popular flag of them all, so I’ll mention them later in the next section of this article. Just bare with me for a minute.

All of the above comes together in the following regexp:

/,(?=\s*[\}\]])/g
Syntax highlighter produced beautiful pattern of colors
Colorful!

Used as a first argument of .replace(…) method of the string, this regexp will allow changing all that’s found to anything else; so how about substituting them with empty strings?

Try it.

The rest of it

If you followed along with the text and got here in one piece, then I think it is safe to say that you know regexps and can get the most out of them. Hopefully, you will introduce them to yourself and use them throughout the development process (say, when finding code snippets in the project).

As I’ve mentioned before, not every single feature is addressed and described here. If you’d like to increase your knowledge and strengthen your skills, consider reading about these topics:

Also, it is always useful to know when not to use regexps, or use them with a sign of caution:

P. S.

Thank you so much for reaching the end of the article! It is so cool that you’ve read the whole thing! 🎉🎊🎈🙏

I’ve tried to make it average-paced and easy to follow. If you find it helpful, consider giving some claps (several perhaps? maybe 50? anyone?). If you’d like to read more of my stuff, please contact me via email or by leaving a comment in the corresponding section below.

Also, don’t forget to take a look at the reScaled npm package. It has a lot of useful utility functions that together will make it much easier to write complex regular expressions by defining and reusing smaller atomic ones.

--

--