UPGRADE YOUR SKILLS: Learn advanced Swift and SwiftUI on Hacking with Swift+! >>

Advanced regular expression matching with NSRegularExpression

Match text using flexible search criteria

Paul Hudson       @twostraws

Previously I wrote an article about how to use regular expressions in Swift, but I want to go a step further and discuss how to get more fine-grained control over your regexes by customizing the options used.

Whenever you create a regex you get an optionset to work with: NSRegularExpression.Options. In this article we’ll be looking at what control each of those options gives us, with practical code examples along the way.

Hacking with Swift is sponsored by RevenueCat

SPONSORED Take the pain out of configuring and testing your paywalls. RevenueCat's Paywalls allow you to remotely configure your entire paywall view without any code changes or app updates.

Learn more here

Sponsor Hacking with Swift and reach the world's largest Swift community!

Setting up

We need a sandbox to work with, so please create a new playground in Xcode and give it this code:

// look for the exact word "the"
let pattern = "the"

// we're starting with no options for creating the regex
let regexOptions: NSRegularExpression.Options = []
let regex = try NSRegularExpression(pattern: pattern, options: regexOptions)

// a nice multi-line string to work with
let testString = """
The cat
sat on
the mat
"""

// check whether the string matches, and print one of two messages
if let index = regex.firstMatch(in: testString, range: NSRange(location: 0, length: testString.utf8.count)) {
    print("Match!")
} else {
    print("No match.")
}

Our regex searches for “the” anywhere in the string, which will be found because it exists on the third line – hopefully Xcode should print out “Match!”, otherwise the rest of this article will be very confusing indeed.

allowCommentsAndWhitespace

This option allows your regex to match even when any amount of whitespace gets in the way. This is particularly helpful when parsing user-entered text, because whitespace can be anywhere. As an example, look at this function signature in Swift:

func getUsername(from: [String: String]) -> String

There are lots of ways of writing that, but even if you discount the extreme options you could still see code like this:

func getUsername ( from : [String : String]) -> String

Using the option .allowCommentsAndWhitespace means whitespace is automatically matched anywhere in the regex. So, this will match:

let pattern = "t h e"
let regexOptions: NSRegularExpression.Options = [.allowCommentsAndWhitespace]

As for the “comments” part of .allowCommentsAndWhitespace, once you ignore whitespace you can start to use comments inside your regular expression. These start with a # symbol, and everything afterwards is ignored. Comments are tied to ignoring whitespace because they are usually written across lines to make them easier to read.

So, this will match:

let pattern = """
t # look for a T
[a-z] # then any lowercase letter
e # then an e
"""
let regexOptions: NSRegularExpression.Options = [.allowCommentsAndWhitespace]

anchorsMatchLines

The ^ and $ metacharacters allow us to match the start and end of lines, but this often doesn’t work quite as you’d expect.

Regexes were originally designed to handle one line of text at a time, but nowadays it’s much more common to parse hundreds or even thousands at a time. To preserve backwards compatibility, most programmatic regex engines (i.e., ones you use in code) consider the start and end of the line to be the start and end of your whole text no matter how many line breaks it has.

To demonstrate the problem, try using these settings:

let pattern = "^sat"
let regexOptions: NSRegularExpression.Options = []

That looks for “sat” at the start of a line, and we can see that our text string has just that – but it won’t match, because by default ^ and $ match the start and end of the whole string.

To fix the problem we need to use the .anchorsMatchLines option, like this:

let pattern = "^sat"
let regexOptions: NSRegularExpression.Options = [.anchorsMatchLines]

And that will now match correctly.

caseInsensitive

This is probably the most commonly used regular expression option, and unless you’re working with very large strings it doesn’t have much of a performance impact.

Right now, this will match because we have the substring “the” in our test string:

let pattern = "the"
let regexOptions: NSRegularExpression.Options = []

However, this will not match, because regexes are case-sensitive by default:

let pattern = "THE"
let regexOptions: NSRegularExpression.Options = []

If you want to search for “the”, “THE”, “tHe” and all other case variations, you can collapse the case by using the .caseInsensitive option like this:

let pattern = "THE"
let regexOptions: NSRegularExpression.Options = [.caseInsensitive]

That will match, because the regex treats “THE” and “the” as the same.

Although this is common, you might prefer to be clear about which case variations are allowed. For example, you might want to match precisely “The” with a capital T, but then any three-letter word after it regardless of case:

let pattern = "The [A-Za-z]{3}"
let regexOptions: NSRegularExpression.Options = []

dotMatchesLineSeparators

By default, the . metacharacter matches any single character except for line breaks, and is commonly used with quantifiers like * and ? to match ranges of unknown text.

Because it doesn’t match line breaks, these settings won’t match anything:

let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [] 

That will match “The” followed by anything except a line break, “cat” followed by anything except a line break, then “sat”, but in our test string “sat” appears on a new line and so . won’t work.

To fix this and make the test string match, add the .dotMatchesLineSeparators option, like this:

let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [.dotMatchesLineSeparators]

ignoreMetacharacters

Metacharacters are any characters that don’t have their explicit meaning, e.g. . matches any character that isn’t a line break, * is the zero-or-more quantifier, and \d matches any digit.

Very rarely – perhaps if you were mixing regexes with non-regexes – you might want to treat your pattern string as a literal sequence of characters, ignoring the special meaning of any metacharacters. To do that, add the .ignoreMetacharacters option to your regex, like this:

let pattern = "The.+cat.+sat"
let regexOptions: NSRegularExpression.Options = [.ignoreMetacharacters]

Because we’re ignoring the meanings of . and +, that pattern won’t match “The cat sat” or “The cat sat”, but will match the string “The.+cat.+sat”.

useUnicodeWordBoundaries

Regular expressions were first used in code 50 years ago, and although they had a formal mathematical definition it took quite some time to add a formal lexical definition.

One gray area for a long time was word boundaries: what constitutes the start and end of a word? As an example, consider this test string:

let testString = """
The child's cat
sat on
the mat
"""

You can search for the word “child” in that string by using the word boundary metacharacter, \b:

let pattern = "\\bchild\\b"
let regexOptions: NSRegularExpression.Options = []

That will match our new test string. But should it match? Our test string has “child’s”, so if you were looking specifically for the string “child” as a standalone word it would match incorrectly.

Fortunately, the Unicode Consortium got busy doing their usual excellent work of studying language, and wrote a formal definition of what constitutes a word boundary. The result is called Unicode TR#29, and you can enable it with your regular expressions by adding the .useUnicodeWordBoundaries option like this:

let pattern = "\\bchild\\b"
let regexOptions: NSRegularExpression.Options = [.useUnicodeWordBoundaries]

That will no longer match, because “child” doesn’t appear as a standalone word in the test string.

useUnixLineSeparators

This is a more esoteric option for most of us, but if you’re working in a cross-platform environment it's more helpful.

Historically line breaks have been represented in a number of ways, and regexes are designed to work with them all. For example, Unix and macOS line breaks are written as \n, but Windows line breaks are written as \r\n.

If you specifically want to limit your regexes so they match only Unix/macOS line breaks you should use the .useUnixLineSeparators option, like this:

let regexOptions: NSRegularExpression.Options = [.useUnixLineSeparators]

Where next?

We’ve covered the full range of NSRegularExpression.Options here, but if you want even more control you might want to investigate NSRegularExpression.MatchingOptions as well – these let you manipulate specific match calls rather than the regular expression itself.

You can also mix together most of the options listed above: NSRegularExpression.Options is a Swift option set, which means you can specify them as single items:

let regexOptions: NSRegularExpression.Options = .caseInsensitive

…or as arrays:

let regexOptions: NSRegularExpression.Options = [.caseInsensitive, .useUnicodeWordBoundaries]

Do whichever feels most natural for you.

Hacking with Swift is sponsored by RevenueCat

SPONSORED Take the pain out of configuring and testing your paywalls. RevenueCat's Paywalls allow you to remotely configure your entire paywall view without any code changes or app updates.

Learn more here

Sponsor Hacking with Swift and reach the world's largest Swift community!

BUY OUR BOOKS
Buy Pro Swift Buy Pro SwiftUI Buy Swift Design Patterns Buy Testing Swift Buy Hacking with iOS Buy Swift Coding Challenges Buy Swift on Sundays Volume One Buy Server-Side Swift Buy Advanced iOS Volume One Buy Advanced iOS Volume Two Buy Advanced iOS Volume Three Buy Hacking with watchOS Buy Hacking with tvOS Buy Hacking with macOS Buy Dive Into SpriteKit Buy Swift in Sixty Seconds Buy Objective-C for Swift Developers Buy Beyond Code

Was this page useful? Let us know!

Average rating: 5.0/5

 
Unknown user

You are not logged in

Log in or create account
 

Link copied to your pasteboard.