INDUSTRIES

Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
Discover More
- "Working with Arbisoft has felt less like hiring a vendor and more like gaining a team of trusted colleagues. Their developers don’t just build what we ask, they think alongside us, offer smart suggestions, and care deeply about getting it right."
  Sarah Johnson / SVP of Product, Summit K12
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
Discover More
- “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
  Paul English / Co-Founder, KAYAK
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
Discover More
- "I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented."
  Matt Hasel / Program Manager, eHuman
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
Discover More
- “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
  Jake Peters / CEO & Co-Founder, PayPerks
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
Discover More
- "The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met."
  Veronika Sonsev / Co-Founder
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
Discover More
- “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
  Silvan Rath / CEO, Predict.io

A Developer's Guide to Understanding Regular Expressions

Saleem LatifPosted on October 18, 2024

9-10 Min Read Time

Regular Expression (Regex for short) is a way to perform data extraction and validation on text. A regular expression is a sequence of characters that specifies a pattern for matching. It can be used to extract data like email addresses, IP addresses, or URLs from text—a principle that underpins many data scraping services—and to perform validation on user-provided inputs, query string parameters, and UUIDs in a URL.

In this article, I will focus on how to read a previously written regular expression. I will assume you already have a basic understanding of regular expressions and are familiar with the basic building blocks and their meaning including greedy specifiers like * , + , {1,3} and ? and their lazy counterparts like *? , +? , {1,3}?, ?? .

Implementation Engines for Regular Expressions

There are 2 main types of implementations for processing regular expressions.

DFA (Deterministic Finite Automaton)
NFA (Non-Deterministic Finite Automaton)

Most modern languages such as Python, Javascript, Java, Ruby, etc. use NFA and I will focus on this engine for the remainder of this article. If you are curious to learn about DFA and how it differs from NFA then I would suggest reading “Mastering Regular Expressions”. It has a detailed example-led explanation of both implementation engines and an in-depth comparison of both in different scenarios.

Components of a Regular Expression

For this article, I will assume you already know about the following components of a regular expression and the meaning of each including special characters like * , ? , . , *? , +? , ?? .

Literal text: Like a , b , G , \* , 2 , \[ etc.
Character classes: Like [a-z] , [A-M] , {1, 99} ,\w , \W, \d , \D, \s , \S etc.
Capturing Parentheses: (...) , (?:...) , (?:P<name>...) etc.
Anchors: ^ , $ , \Z

How Does Matching Work in Regular Expressions?

The Regex library starts execution from left to right on the given string and the expression, it attempts to match the expression one character at a time the first match that is successfully completed wins and is returned by the Regex library. To sum it up

Matching starts from the left
The first match wins

A Simple Example of the Execution of Regular Expression

Let’s consider the example of a regex [0-9]+ against the string abc123efg56 . The execution of regex starts from the left and the match is carried out one character at a time. I will list down the steps taken by the execution engine for improved readability.

1. [0-9] means match any number between 0 and 9 . The execution starts at the index 0 of the string with the letter a . This is not a match, so execution moves forward with the next character.

2. Then the execution tries the match with b at index 1 which is not a match either.

3. Next, execution tries another failing match of c at index 2

4. Next, the execution finally succeeds at matching 1 at index 3 and adds this character into its state of successful matches. Since there is + at the end of the expression (which means “One or more matches”).

5. Execution moves ahead in the string to the character 2 at the index 4 which is a successful match. It updates its state to show 12 as the matching sub-string.

6. Next character 3 at index 5 is also a match, so the state now holds 123 .

7. Next character e at index 6 is not a match. With this the part of the regex [0-9]+ completes and since there is no more regular expression to match the engine returns the result 123 .

8. You can see from above that the match starts from the left and the first match wins, even though the string contains another string of numbers (i.e. 56) it is not even tried by the execution engine.

In essence, the above regex [0-9]+ matches the first set of numeric characters and returns the result.

All of the regular expressions no matter how complex can be broken down this way for better understanding. Any regular expression that you want to understand apply it against a simple example and see how it will execute a match. This process will help de-mystifying the complex nature of some regular expressions. Let’s consider one more example to understand this further.

Another Example of Understanding Relatively Complex Regular Expression

Regex: https?://(?:(?:[^/\s]+)/?)+

String: “A regular expression, sometimes referred to as rational expression, is a sequence of characters more at https://en.wikipedia.org/wiki/Character_(computing) that specifies a match pattern in the text.”

This is a simple regex to extract a URL from a string. Let’s divide it into parts and try to understand what it is doing.

1. https? : Match literal text https or http. The ? at the end means s is optional, this allows the matching of http .

2. :// : This is another set of literal characters to match the corresponding part of the URL.

3. Next, first consider the nested parenthesis (?:[^/\s]+) : The characters ?: inside the parenthesis () mean these are non-capturing parenthesis and hence do not play any role in the actual matching and can be ignored. [^/\s] is saying to match any character that is not / or a white space (white space to end the match when the URL has ended and the normal text begins) and the + in the end, will mean it will match all the characters between https:// and the next forward slash / which for the above example is en.wikipedia.org

4. (?:(?:[^/\s]+)/?)+ : We know regex in nested parenthesis (?:[^/\s]+) matches the part of the URL between forward slashes. Let’s replace that with X to get (?:X/?)+ . The substring ?: can be ignored here as well as these are not match-related and only mean do not capture the match. /? means match forward slash / and ? is for the end of the URL as some URLs do not end with / and finally + means one or more matches which will enable matching the string until the end of the URL is reached.

The match in the above example would be https://en.wikipedia.org/wiki/Character_(computing) . I understand the regex above may not work in certain scenarios and would need to be updated but it can be good enough for a huge range of cases where you know the URLs in the text are space delimited. This regex would also be useful if you want to validate user-provided URLs, you might need to update it to ^https?://(?:(?:[^/\s]+)/?)+$ to make sure a valid URL is provided. The design of any regular expression is application-dependent and based on the use case you can greatly simplify it which sometimes would also mean better performance.

Greedy and Lazy Matching

Greedy specifiers try to match as many characters as possible of the given text and because of the eager matching, sometimes the regex engine has to backtrack to complete the match. Let’s consider the following example to explain this a bit further.

re.search(r'.+b', 'aaabbcdef')
# This will match "aaabb"

In the above example, .+ tries to first match each character at a time. Since, . can match any character it goes through all the characters one by one storing each matched character into its state including character f until the end of the string is reached. There are no more characters to match so execution tries to match the second character b in the regex but there are no more characters in the string so it backtracks to f which does not match the character b either. It backtracks again to character e which is not a match either. It continues to backtrack through d and c and then reaches b at index 4 which is a match. (Note: the regex state at this point is aaabb). The execution then ends because there are no more characters in the regex string.

But what would happen if you added ? to make the dot matching lazy instead of greedy?

re.search(r'.+?b', 'aaabbcdef')
# This will match "aaab"

In the above example, we are using ? specifier to trigger lazy matching. In greedy matching, execution tried to match as much as possible and then deferred to backtracking when it could not complete the match of the entire regex. In lazy matching, the execution engine tries to do as little work as possible and tries to complete the match as early as possible. Let’s explain this further step by step using the above example.

The match, as usual, starts from the left. . matches a at index 0 first but since the +? is lazy it will try to move on and match the next character of the regex b but it can not match it with character a at index 1 so it goes back and matches . with a at index 1 . It again tries to move on to match b from regex with the next character a at index 2 but fails again and goes back to . and matches a at index 2. The next time as it again moves on and tries to match b with the character b at index 3 it can match the character. Since there are no more characters in the regex left it returns the match aaab .

Practical Usage of Greedy Vs Lazy Matching

The answer to the question, which variant of matching to use? Depending on your use case, greedy matching is significantly more performant and if in your use case, both greedy and lazy matching are returning correct results then using greedy matching would be better. e.g. regex to extract the subject of an email Subject: .*$ would be better and faster than its lazy counterpart.

Sometimes, greedy matching results in incorrect results, e.g. if you want to extract the contents of a tag from HTML as shown in the example below, the greedy match will result in incorrect data because it tries to greedily match everything and then perform backtracking to complete the match.

re.search(r'<B>.*</B>', '<B>Billions</B> and <B>Zillion</B> of suns')
# This will return '<B>Billions</B> and <B>Zillion</B>'

In the above example, the backtracking completes the match but it returns the contents of multiple tags including everything in between, since we only want to get the contents of one tag. Using lazy matching would return correct results.

re.search(r'<B>.*?</B>', '<B>Billions</B> and <B>Zillion</B> of suns')
# This will return '<B>Billions</B>'

Final Remarks

To keep the article concise, I had to skip some details like possessive quantifiers, matching bounded repetitions {min,max}, capturing named matches, etc. I also could not discuss how to go about writing regular expressions from scratch and how to improve a previously written regular expression. However, I hope this article will help you get better at understanding regular expressions. I strongly recommend reading “Mastering Regular Expressions” to get a deeper understanding of the topic and understand the nuances of regular expressions that sometimes are overlooked and later cause bugs in the code.

Just published

img-https://d1foa0aaimjyw4.cloudfront.net/Predictive_Analytics_Pillar_Sub_topic_2_What_Criteria_Should_CT_Os_Use_to_Evaluate_and_Select_an_AI_Vendor_for_Predictive_Analytics_Solutions_19feb8c871.png

What Criteria Should CTOs Use to Evaluate and Select an AI Vendor for Predictive Analytics Solutions?Read more

img-https://d1foa0aaimjyw4.cloudfront.net/Predictive_Analytics_Pillar_Sub_topic_6_How_Can_Predictive_Analytics_Solutions_Prevent_IT_Outage_and_Minimize_Downtime_769529bb75.png

How Can Predictive Analytics Solutions Prevent IT Outage and Minimize Downtime?Read more

img-https://d1foa0aaimjyw4.cloudfront.net/Predictive_Analytics_Pillar_Sub_topic_4_How_Does_Predictive_Analytics_in_QA_Improve_Product_Quality_9a4360d2ef.png

How Does Predictive Analytics in QA Improve Product QualityRead more

...Loading Related Blogs

Explore More

Trusted by Market Leaders in Education, Travel, Finance and E-commerce since 2007

We put excellence, value and quality above all - and it shows

NPS

INDUSTRIES

Real-time Maintenance Reporting

Workflow Automation Platform

Recruitment Automation Tool

Learner Engagement Platform

Customer Feedback Analytics

School Communication Suite

Digital Learning Suite

Software Development Outsourcing

Dedicated Teams

IT Staff Augmentation

New Venture Partnership

A Developer's Guide to Understanding Regular Expressions

Implementation Engines for Regular Expressions

Components of a Regular Expression

How Does Matching Work in Regular Expressions?

A Simple Example of the Execution of Regular Expression

Another Example of Understanding Relatively Complex Regular Expression

Greedy and Lazy Matching

Practical Usage of Greedy Vs Lazy Matching

Final Remarks

Just published

Have Questions? Let's Talk.

Just published

A Developer's Guide to Understanding Regular Expressions

Implementation Engines for Regular Expressions

Components of a Regular Expression

How Does Matching Work in Regular Expressions?

A Simple Example of the Execution of Regular Expression

Another Example of Understanding Relatively Complex Regular Expression

Greedy and Lazy Matching

Practical Usage of Greedy Vs Lazy Matching

Final Remarks

Just published

Have Questions? Let's Talk.

Newsletter

Just published