arbisoft brand logo
arbisoft brand logo

A Technology Partnership That Goes Beyond Code

  • company logo

    “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”

    Jake Peters profile picture

    Jake Peters/CEO & Co-Founder, PayPerks

  • company logo

    “They delivered a high-quality product and their customer service was excellent. We’ve had other teams approach us, asking to use it for their own projects”.

    Alice Danon profile picture

    Alice Danon/Project Coordinator, World Bank

1000+Tech Experts

550+Projects Completed

50+Tech Stacks

100+Tech Partnerships

4Global Offices

4.9Clutch Rating

Trending Blogs

    81.8% NPS Score78% of our clients believe that Arbisoft is better than most other providers they have worked with.

    • Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.

      Companies that we have worked with

      • MIT logo
      • edx logo
      • Philanthropy University logo
      • Ten Marks logo

      • company logo

        “Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”

        Ed Zarecor profile picture

        Ed Zarecor/Senior Director & Head of Engineering

    • Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.

      Companies that we have worked with

      • Kayak logo
      • Travelliance logo
      • SastaTicket logo
      • Wanderu logo

      • company logo

        “I have managed remote teams now for over ten years, and our early work with Arbisoft is the best experience I’ve had for off-site contractors.”

        Paul English profile picture

        Paul English/Co-Founder, KAYAK

    • As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.

      Companies that we have worked with

      • eHuman logo
      • Reify Health logo

      • company logo

        I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.

        Matt Hasel profile picture

        Matt Hasel/Program Manager, eHuman

    • We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.

      Companies that we have worked with

      • Payperks logo
      • The World Bank logo
      • Lendaid logo

      • company logo

        “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”

        Jake Peters profile picture

        Jake Peters/CEO & Co-Founder, PayPerks

    • Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!

      Companies that we have worked with

      • HyperJar logo
      • Edited logo

      • company logo

        The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.

        Veronika Sonsev profile picture

        Veronika Sonsev/Co-Founder

    • Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!

      Companies that we have worked with

      • Indeed logo
      • Predict.io logo
      • Cerp logo
      • Wigo logo

      • company logo

        “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.

        Silvan Rath profile picture

        Silvan Rath/CEO, Predict.io

    • Software Development Outsourcing

      Building your software with our expert team.

    • Dedicated Teams

      Long term, integrated teams for your project success

    • IT Staff Augmentation

      Quick engagement to boost your team.

    • New Venture Partnership

      Collaborative launch for your business success.

    Discover More

    Hear From Our Clients

    • company logo

      “Arbisoft partnered with Travelliance (TVA) to develop Accounting, Reporting, & Operations solutions. We helped cut downtime to zero, providing 24/7 support, and making sure their database of 7 million users functions smoothly.”

      Dori Hotoran profile picture

      Dori Hotoran/Director Global Operations - Travelliance

    • company logo

      “I couldn’t be more pleased with the Arbisoft team. Their engineering product is top-notch, as is their client relations and account management. From the beginning, they felt like members of our own team—true partners rather than vendors.”

      Diemand-Yauman profile picture

      Diemand-Yauman/CEO, Philanthropy University

    • company logo

      Arbisoft was an invaluable partner in developing TripScanner, as they served as my outsourced website and software development team. Arbisoft did an incredible job, building TripScanner end-to-end, and completing the project on time and within budget at a fraction of the cost of a US-based developer.

      Ethan Laub profile picture

      Ethan Laub/Founder and CEO

    Contact Us
    contact

    A Developer's Guide to Understanding Regular Expressions

    October 18, 2024
    https://d1foa0aaimjyw4.cloudfront.net/Cover_12_8dc51214f2.png

    Regular Expression (Regex for short) is a way to perform data extraction and validation on text. A regular expression is a sequence of characters that specifies a pattern for matching. It can be used to extract data like email address, IP address, or URL from some piece of text and it can be used to perform data validation on user-provided data like input fields, Query String parameters, and UUID in a URL.

     

    In this article, I will focus on how to read a previously written regular expression. I will assume you already have a basic understanding of regular expressions and are familiar with the basic building blocks and their meaning including greedy specifiers like * , + , {1,3} and ? and their lazy counterparts like *? , +? , {1,3}??? .

     

    Implementation Engines for Regular Expressions

    There are 2 main types of implementations for processing regular expressions.

    1. DFA (Deterministic Finite Automaton)
    2. NFA (Non-Deterministic Finite Automaton)

     

    Most modern languages such as Python, Javascript, Java, Ruby, etc. use NFA and I will focus on this engine for the remainder of this article. If you are curious to learn about DFA and how it differs from NFA then I would suggest reading “Mastering Regular Expressions”. It has a detailed example-led explanation of both implementation engines and an in-depth comparison of both in different scenarios.

     

    Components of a Regular Expression

    For this article, I will assume you already know about the following components of a regular expression and the meaning of each including special characters like * , ? , . , *? , +? , ?? . 

     

    1. Literal text: Like a , b , G , \* , 2 , \[ etc.
    2. Character classes: Like [a-z] , [A-M] , {1, 99} ,\w , \W, \d , \D, \s , \S etc.
    3. Capturing Parentheses: (...) , (?:...) , (?:P<name>...) etc.
    4. Anchors: ^ , $ , \Z

     

    How Does Matching Work in Regular Expressions?

    The Regex library starts execution from left to right on the given string and the expression, it attempts to match the expression one character at a time the first match that is successfully completed wins and is returned by the Regex library. To sum it up

    1. Matching starts from the left
    2. The first match wins

     

    A Simple Example of the Execution of Regular Expression

    Let’s consider the example of a regex [0-9]+ against the string abc123efg56 . The execution of regex starts from the left and the match is carried out one character at a time. I will list down the steps taken by the execution engine for improved readability.

    1. [0-9] means match any number between 0 and 9 . The execution starts at the index 0 of the string with the letter a . This is not a match, so execution moves forward with the next character.

    2. Then the execution tries the match with b at index 1 which is not a match either. 

    3. Next, execution tries another failing match of c at index 2

    4. Next, the execution finally succeeds at matching 1 at index 3 and adds this character into its state of successful matches. Since there is + at the end of the expression (which means “One or more matches”).

    5. Execution moves ahead in the string to the character 2 at the index 4 which is a successful match. It updates its state to show 12 as the matching sub-string.

    6. Next character 3 at index 5 is also a match, so the state now holds 123 .

    7. Next character e at index 6 is not a match. With this the part of the regex [0-9]+ completes and since there is no more regular expression to match the engine returns the result 123 .

    8. You can see from above that the match starts from the left and the first match wins, even though the string contains another string of numbers (i.e. 56) it is not even tried by the execution engine.

     

    In essence, the above regex [0-9]+ matches the first set of numeric characters and returns the result.

     

    All of the regular expressions no matter how complex can be broken down this way for better understanding. Any regular expression that you want to understand apply it against a simple example and see how it will execute a match. This process will help de-mystifying the complex nature of some regular expressions. Let’s consider one more example to understand this further.

     

    Another Example of Understanding Relatively Complex Regular Expression

    Regex: https?://(?:(?:[^/\s]+)/?)+ 

    String: “A regular expression, sometimes referred to as rational expression, is a sequence of characters more at https://en.wikipedia.org/wiki/Character_(computing) that specifies a match pattern in the text.”

    This is a simple regex to extract a URL from a string. Let’s divide it into parts and try to understand what it is doing. 

    1. https? : Match literal text https or http. The ? at the end means s is optional, this allows the matching of http .

    2. :// : This is another set of literal characters to match the corresponding part of the URL.

    3. Next, first consider the nested parenthesis (?:[^/\s]+) : The characters ?: inside the parenthesis () mean these are non-capturing parenthesis and hence do not play any role in the actual matching and can be ignored. [^/\s] is saying to match any character that is not / or a white space (white space to end the match when the URL has ended and the normal text begins) and the + in the end, will mean it will match all the characters between https:// and the next forward slash / which for the above example is en.wikipedia.org

    4. (?:(?:[^/\s]+)/?)+ : We know regex in nested parenthesis (?:[^/\s]+) matches the part of the URL between forward slashes. Let’s replace that with X to get (?:X/?)+ . The substring ?: can be ignored here as well as these are not match-related and only mean do not capture the match. /? means match forward slash / and ? is for the end of the URL as some URLs do not end with / and finally + means one or more matches which will enable matching the string until the end of the URL is reached.

     

    The match in the above example would be https://en.wikipedia.org/wiki/Character_(computing) . I understand the regex above may not work in certain scenarios and would need to be updated but it can be good enough for a huge range of cases where you know the URLs in the text are space delimited. This regex would also be useful if you want to validate user-provided URLs, you might need to update it to ^https?://(?:(?:[^/\s]+)/?)+$ to make sure a valid URL is provided. The design of any regular expression is application-dependent and based on the use case you can greatly simplify it which sometimes would also mean better performance. 

     

    Greedy and Lazy Matching

    Greedy specifiers try to match as many characters as possible of the given text and because of the eager matching, sometimes the regex engine has to backtrack to complete the match. Let’s consider the following example to explain this a bit further.

     

    re.search(r'.+b', 'aaabbcdef')
    # This will match "aaabb"

    In the above example, .+ tries to first match each character at a time. Since, . can match any character it goes through all the characters one by one storing each matched character into its state including character f until the end of the string is reached. There are no more characters to match so execution tries to match the second character b in the regex but there are no more characters in the string so it backtracks to f which does not match the character b either. It backtracks again to character e which is not a match either. It continues to backtrack through d and c and then reaches b at index 4 which is a match. (Note: the regex state at this point is aaabb). The execution then ends because there are no more characters in the regex string.

     

    But what would happen if you added ? to make the dot matching lazy instead of greedy?

    re.search(r'.+?b', 'aaabbcdef')
    # This will match "aaab"

    In the above example, we are using ? specifier to trigger lazy matching. In greedy matching, execution tried to match as much as possible and then deferred to backtracking when it could not complete the match of the entire regex. In lazy matching, the execution engine tries to do as little work as possible and tries to complete the match as early as possible. Let’s explain this further step by step using the above example.

     

    The match, as usual, starts from the left. . matches a at index 0 first but since the +? is lazy it will try to move on and match the next character of the regex b but it can not match it with character a at index 1 so it goes back and matches . with a at index 1 . It again tries to move on to match b from regex with the next character a at index 2 but fails again and goes back to . and matches a at index 2. The next time as it again moves on and tries to match b with the character b at index 3 it can match the character. Since there are no more characters in the regex left it returns the match aaab .

     

    Practical Usage of Greedy Vs Lazy Matching

    The answer to the question, which variant of matching to use? Depending on your use case, greedy matching is significantly more performant and if in your use case, both greedy and lazy matching are returning correct results then using greedy matching would be better. e.g. regex to extract the subject of an email Subject: .*$ would be better and faster than its lazy counterpart.

     

    Sometimes, greedy matching results in incorrect results, e.g. if you want to extract the contents of a tag from HTML as shown in the example below, the greedy match will result in incorrect data because it tries to greedily match everything and then perform backtracking to complete the match. 

    re.search(r'<B>.*</B>', '<B>Billions</B> and <B>Zillion</B> of suns')
    # This will return '<B>Billions</B> and <B>Zillion</B>'

    In the above example, the backtracking completes the match but it returns the contents of multiple tags including everything in between, since we only want to get the contents of one tag. Using lazy matching would return correct results.

    re.search(r'<B>.*?</B>', '<B>Billions</B> and <B>Zillion</B> of suns')
    # This will return '<B>Billions</B>'

     

    Final Remarks

    To keep the article concise, I had to skip some details like possessive quantifiers, matching bounded repetitions {min,max}, capturing named matches, etc. I also could not discuss how to go about writing regular expressions from scratch and how to improve a previously written regular expression. However, I hope this article will help you get better at understanding regular expressions. I strongly recommend reading “Mastering Regular Expressions” to get a deeper understanding of the topic and understand the nuances of regular expressions that sometimes are overlooked and later cause bugs in the code.

     

      Share on
      https://d1foa0aaimjyw4.cloudfront.net/1647337383571_75edada412.jpeg

      Saleem Latif

      I am a full-stack software developer with expertise in the backend using Python and a keen interest in the front-end and DevOps. I am an avid science fiction reader and want to become a writer and share my ideas.

      Related blogs

      0

      Let’s talk about your next project

      Contact us