Regular Expression (Regex) Syntax
There are specific syntax and rules you must follow to define a regex pattern. Regex pattern syntax includes a combination of characters, special characters, metacharacters, etc. These elements form a pattern that can be used to match specific sequences of text/characters.
/regex-pattern/flag
Regex pattern by default is surrounded by /. The first / means the start of the regex pattern and second / means end of the regex pattern. After the regex pattern, a flag is used to specify how regex engine will perform the search. The default flag is ‘g' that finds all the occurrence of the pattern.
Regex Flags/Modifiers
Flags/Modifiers are optional parameters that can be added to a regex pattern to modify its behavior. For example, the "g" modifier makes the pattern global, meaning it will match all occurrences of the pattern in the text.
A flag can be specified after the end of the regex pattern. By default, a regex pattern are wrapped inside /. E.g. /hello/g where /hello/ is regex pattern and g is a flag.
Regex Engine
The Regex engine does all the heavy lifting, from parsing a pattern to searching for matching strings in the text based on a provided regular expression pattern. Although, different programming languages may use slightly different regex engines, but the underlying principles are generally the same.
Understanding how the regex engine works allows you to use its full power and help you avoid common mistakes The followings are the overall steps the regex engine performs:
Compilation:The Regex engine first compiles the regex pattern and transforms it to an internal representation that the engine can efficiently process. This internal representation is often referred to as a regex object or a compiled pattern that is used to search for matches within a given string.
Pattern matching:The regex engine will start at the beginning of the string(leftmost position) and attempt to match the pattern against each character. If a match is found, the engine will return the match result, including the matched text, the captured groups, and possibly other details like the position of the match.
A regex engine always returns the leftmost match. It tries every possible combination of the regular expression at the first character. The engine doesn't move on to the second character in the text until it has tried all of the other options and found that none of them work.
Let's see how regex engine matches the following pattern:
The regex engine compiles the pattern lo
into internal representation. Here it is simple string pattern so it would be the same. Now, it starts matching lo
to Hello World!
input string, the engine tries to match the first token in the regex l
to the first character H
. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the l
with the e
. This fails too. Arriving at the 3rd character in the string, l
matches l
. The engine then tries to match the second token a to the 4th character, l
which faild because it needs o
. At this point, the engine knows the regex cannot be matched starting at the 3rd character in the string. So it continues with the 4th char again where l
matches with l
and the engine carries on to the next character for second match. The engine then proceeds to attempt to match the remainder of the regex at character 5 and finds that a matches o
. Thus, it founds the left most matching string 'lo' in the intput string and return the result including matching string and the position.
Regex Syntax Components
Regex supports many components that have a special meaning for defining the regex pattern. Let's have an overview of each component of the regex pattern:
Characters
The simplest type of regex component contains exact characters that must appear in the pattern.
Metacharacters
Metacharacters are special characters that have special meanings that help in defining more complex patterns. For example, the dot (.) matches any character except a newline.
Anchors
Regex anchors specify a position in a string where a match should occur. For example, the caret () matches the beginning of a line, and the dollar sign ($) matches the end of a line.
Character classes
These define sets of characters using square brackets [ ] and follow specific rules. For example, [a-z] matches any lowercase letter from a to z.
Quantifiers
Quantifiers specify how many times the specified pattern should be repeated, such as the asterisk (*), which matches zero or more occurrences of the preceding character or group.
Lookarounds
Lookarounds only allow you to match a pattern if another pattern comes after or comes before it. For example, (?<=Hello)\w+ can be used to find a word that comes after "Hello". Use "\w+" to return a word after "Hello".
Grouping and Capturing
You can group parts of your regex pattern using parentheses ( ) and refer back to them later.
Learn about each of these components in detail in the next chapters.