A Crash Course in Regular Expression (Regex)
As the title of this blog implies I am nothing but honest about how much I still have left to learn as a programmer. One such thing, which I had managed to avoid using for a long time, is regular expression. As a game developer I haven’t found myself in a lot of situations where I am needing to do anything but the most simple of tasks with strings and as such have managed to get by without it. If I did need to use regular expression, then it was usually so simple that a quick google search would provide the answer.
However, I was finding myself turning to Google more often, and I am not usually someone who likes to just use something without understanding how it works, so I finally spent sometime getting to grips with the basics. This article is me sharing my understanding of regular expression as well as a breakdown of some of the most common language features.
What is Regular Expression?
As I understand it regular expression or regex for short is a string of characters that represents a pattern. We use these patterns to match all or part of some input string for a variety of reasons. For me, I often used regular expression to help me find certain keywords within debug logs, but you can use regex for all sorts of reasons. One common reason is for validation to check if a user inputs an email address, phone number, password etc matches the requirements of this type of input.
```
// Example regex pattern that matches text that has 0 or
// more digits followed by a space followed by 1 or
// more alphanumericcharacters e.g. 124 ab42c
\d*\s\w+
```
All regular expressions are interpreted by a regular expression engine. A lot of modern programming engines come included with a regular expression engine like C#, C++ and Perl however they are not all equal.
There are two types of regular expression engines, deterministic and non-deterministic. I am not going to pretend to understand how they work but I know from a practical standpoint deterministic engines tend to be faster, and don’t support backtracking, something we will get into later.
The Big Benefit of Regular Expression
The main draw for using regular expressions is that it allows us to define a set of strings within a concise and simple format in situations where we would otherwise need to specify every single variant ourselves.
For example, let's imagine we want to write a password validator for a website. For simplicities sake our password requirements are extremely insecure. Our website requires all passwords follow the below requirements:
- a password can only contain characters a and b
- a password must end with a 1 or a 2
Every time a user enters a password we need to check the string input in that password field to see if it matches our requirements. In a world where regex doesn’t exist a simple way to do this would be to write a simple list of acceptable words. In our example there are only four possible choices:
- ab1
- ba1
- ab2
- ba2
In the scenario above this solution is more than acceptable and would work in a lot of scenarios where we only have a small finite list of possible strings. All we would need to do is iterate through the list comparing the inputted password with each string in the list until we found a match or reached the end.
Obviously, in the real world we would need much less restrictive requirements for passwords but as we relax the requirements this opens up the number of possibilities. A password requirement of 10 characters would require 3,628,800. Suddenly this becomes a lot less feasible to store and search through in a list.
If we ignore symbols and focus on characters a-z then we could validate a password that meets these requirements with the following regex pattern.
[a-z]{10}
No large list of strings to manage. No long iteration times searching through the entire list. Just a nice simple pattern!
How Does Regular Expression Work
The three things we require for regular expression is a string, a regex pattern and the regex engine itself. The engine will move linearly through a string, attempting to match parts of the string that fulfil the requirements defined in the pattern.
Example: Failing a Match
Input: Hello!
Pattern: elo
The above pattern is as easy as it looks and tries to match elo in the string “Hello!”. A regex engine will start from the left of the string, in this case H and will advance character by character until it finds a match or reaches the end of the input.
e
Hello
So starting at the letter H the regex engine will try to match the letter e. Since H is not equal to e no match is found and the engine moves to the next character in the string.
e
Hello
The next character is e, which matches the first character in the pattern elo. Since we have a match the engine will move to the next character in the string and try to match that character with the next character in the pattern.
el
Hello
The next character is l which matches the next character in the pattern l so again the engine moves to the next character and tries to match that with the next character in the pattern.
elo
Hello
The next character is l but the next character in the pattern is o so we don’t have a match. In this case instead of the engine advancing to the next character in the string, it stays at the current character but starts back at the beginning of the pattern.
e
Hello
e
Hello
In the above the engine tries to match the letter e with the letter l in the string. This match also fails so the engine will move to the next character in the string which is also the letter l, and will again fail.
e
Hello
Finally, we reach the end of the string. The last character is o which again does not match the starting character in the pattern e. As we are at the end of the string the regex engine stops attempting to find a match to the pattern in the string.
Example: Finding a Match
Input: Helo!
Pattern: elo
In this example we use the same pattern but remove an l from the original string Hello in order to allow the match to succeed as well as adding an exclamation mark to the end.
e
Helo!
So starting at the letter H the regex engine will try to match the letter e. Since H is not equal to e no match is found and the engine moves to the next character in the string.
e
Helo!
The next character is e, which matches the first character in the pattern elo. Since we have a match the engine will move to the next character in the string and try to match that character with the next character in the pattern.
el
Helo!
The next character is l which matches the next character in the pattern l so again the engine moves to the next character and tries to match that with the next character in the pattern.
elo
Helo!
The next character is o which matches the last character in the pattern. Now the regex engine has successfully matched the entire pattern. Depending on the settings being used for the engine, it will either move to the next character and start trying to match the pattern in the rest of string, or simply return after the first match.
Language Overview
Character Escapes
A nice and easy one to start of with, character escapes allow us to use special characters that are part of the regex language literally. For example a full stop .
has a special meaning in regex but if we want to just match a full stop within a string we can use a backslash \.
Character Classes
Character classes allow any characters within a defined range to count as a match. So instead of saying I want to match elo
we could say we want to match [a-z]lo
which would mean any three letter word that starts with any character, followed by the characters lo e.g. alo, blo, clo, dlo, elo, etc.
Something more practical could be to validate that some string input only contains numbers. If each digit can be in the range 0-9 we could just specify a pattern [0-9][0-9][0-9].
`This would allow us to validate the inputted string is a 3 digit number not caring about the value of each digit. Therefore to specify a range we use square brackets []
but we can alternatively use some predefined character classes. Common groups of characters such as the digits through [0-9] using \d
`,any alphanumeric character [A-Za-z0-9] using \w
, white spaces \s
to name a few. See that these predefined classes also use backslashes.
In addition we can also use negation ^
to specify characters we don’t want to match, for example [^abc ]
will match any character that isn’t a, b, or c.
Anchors
Anchors are useful when want to match text at a specific position within a string such as at the start of a string ^
or end of the string $
.
Hello World
World Hello
Goodbye Hello World
Given the three strings above if we just wanted to match Hello we could use the regex pattern Hello
. If we wanted to match Hello when it only appears at the start of a string we would use ^Hello
likewise we would use Helllo$
to match when the word appears at the end of a string.
Quantifiers
Quantifiers allow us to specify how many times in a pattern we want to match a character or group of characters. Don’t worry we will cover groups in the next section. To define an exact amount we use curly braces {}
.
a{3} // match a three times aaa
a{1,2} // match a one to three times a, aa, aaa
One benefit of using this type of quantifier is that it can simplify certain regular expressions. If we wanted to match a UK phone number which starts with 07 followed by 9 digits we could write 07\d\d\d\d\d\d\d\d\d
or shorten it to ^07\d{9}
.
If we for whatever reason don’t know how many characters we want to match then regex has that covered as well. Two popular quantifiers are the *
which is used to match zero or more of a character or +
which matches one or more.
The quantifier *
is especially helpful in scenarios where you aren’t 100% sure on the input or you want to support different types of input such as whole and decimal numbers.
// match a number with at least one digital which can also be a decimal number
\d+\.?\d*
Greedy vs Lazy Quantifiers
Quantifiers come in two flavour sort of speak, greedy or lazy. A greedy quantifier tries to match the longest string possible, and lazy the shortest. I have found that the distinction between the two only really comes up when you are trying to match a number of characters between a start and end point.
Pattern: <.*>
String: <h1>heading</h1>
Result: <h1>heading</h1>
In the above we use a greedy quantifier to match anything within the tags in some html code which results in us grabbing the entire string. We can look at why this happens when explore backtracking in a moment but for now just know it will end up consuming everything between the first <
and the last >
even if >
characters exist between the two like in our string.
Pattern: <.*?>
String: <h1>heading</h1>
Result: <h1>
In the above we apply ?
after our quantifier to tell the engine to recognise it as lazy. In this case it now tries to consume as few characters as possible so stops at the first character that matches >
.
Backtracking
A greedy quantifier works because the engine supports what is referred to as backtracking which allows the engine to return to a previous state to continue to search for a match. For a greedy quantifier, the engine using a pattern <.*>
will consume all the characters in the input string, because the engine can backtrack once it reaches the end of the string to find a match.
As you can probably tell this is more work for a regex engine, especially if, in the last example the >
is not at the end of the string as the engine has to iterate through the entire string, and then back track vs just stopping at the first occurrence of <
when using a lazy quantifier.
Groups
Another powerful feature of regular expression is the idea of grouping. We can use ()
to match subexpressions within a pattern and apply a bunch of the other language features we have talked about such as quantifiers and anchors to a subexpression instead of the whole pattern.
One use of grouping is to allow us to match a repeated pattern. If we had the string abc abc abc
we could use the following pattern abc\sabc\sabc
. However, if we didn't know how many times it would repeat, or even if we did we could simplify the pattern by using groups.
Pattern: (abc\s){3}
String: abc abc bdc
Match Result: abc abc
You might be wondering why we don't use capture groups such as [abc\s]{3}
or [abc\s]+
which would work, but would also capture repetitions of any of the characters defined in the capture group such as aaa
, abc
, a\sb
since the engine would be looking for a repetition of any of the characters defined in the capture group and not the grouping of abc\s
.
Back References
A back reference can be used in tandem with a group allowing us to use the result "captured" by the group further on in the expression. So, if we had the pattern (\w+)
which would capture any words in a string, then we could use a back reference \1
where the number refers to the capture group, since an expression can have multiple, to use the match captured by the subexpression.
One example of when this could be used would be to capture when duplicate words appear after one another in some string.
Pattern: (\w+)\s\1
String: Hello Hello World
Match Result: Hello Hello
Summary
I hope this article gave you somewhat of an insight into what regular expression is, how it works and what it can be used to do. The language features touched upon should hopefully give you enough of a understanding to help you start writing your own expressions, but do know that this article is by no means exhaustive of all the features available.
One thing I found super helpful when learning regular expression was to practice writing problems with regex101: build, test, and debug regex and RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCsand is definitely worth doing if you are interested in learning more and furthering your understanding of what has been written in this article.