Don't Be Afraid of Regular Expressions

Don't Be Afraid of Regular Expressions

Regular expressions are very powerful and intimidating at the same time. Let's see why there's no reason to fear them!

I know, I know. How on Earth can you not be afraid if something like this haunts you in your dreams:

^[\w]{1,}[\w.+-]{0,}@[a-zA-Z0–9]{1,}[\w-]{1,}([.][a-zA-Z]{2,}|[.][a-zA-Z0–9]{1,}[\w-]{1,}[.][a-zA-Z]{2,})$

Actually, it's pretty easy to use and modify regular expressions. All you need is to take time to understand the structure. I see a lot of people asking for the right pattern for this and for that instead of investing in themselves and learning the rules behind regex.

In this article, I would like to help you, so you no longer need to spend hours on StackOverflow looking for the right pattern which will finally work.

I can guarantee that once you read this short article, you will understand the pattern I frightened you with at the beginning. But on top of that, you'll be able to modify it to fit your needs!

E-mail

In case you didn't recognize it, the pattern above is for matching an e-mail address. Let's take a look how to construct it step by step.

We want to make sure that the e-mail address always starts with a word character. A word character is any English letter, digit or underscore. We use the ^ sign to define the position at the start of the string. We use the [] to define the list of individual characters like [abcde] or the range if we use the - sign like [a-e]. We can use the [a-zA-Z0–9_] pattern, to include all lowercase and uppercase letters, all ten digits and the underscore, but there's a shortcut which will give us the same result:

^[\w]

Next, we want to make sure that there is at least one such character:

^[\w]{1,}

Next, we want to allow any word character or three special characters .+- in the name. Because we should allow jan.zavrel, jan-zavrel and even jan+zavrel. This way, we can be sure that the e-mail won’t start with the dot, plus or hyphen, but can contain these special characters on other than the first position:

^[\w]{1,}[\w.+-]

And of course, there doesn’t have to be any such character because e-mail address can have only one word character in front of the @ character. In other words, we should allow it, but not force it:

^[\w]{1,}[\w.+-]{0,}

Next, we need to always include the @ character which is mandatory, but there can be only one in the whole e-mail address:

^[\w]{1,}[\w.+-]{0,}@

Right behind the @ character, we want to have a domain name. Here, we can define how many characters we want as a minimum and from which range of characters. I would go for all word characters including the hyphen [\w-] and I want at least two of them {2,}. If you want to allow domains like t.co, you would have to allow one character from this range {1,}:

^[\w]{1,}[\w.+-]{0,}@[\w-]{2,}

But this would allow even domain names like -_-.net, right? Yes, it would. And that's something we need to fix. Do you remember how we limited what can be at the very beginning of the e-mail address? Well, we can use a similar approach here as well.

So we want just a word character right behind the @ character, but no hyphens, and no underscores. Unfortunately, there's no shortcut for this, so we need to specify it by three ranges [a-zA-Z0–9] and we need at least one such character right behind the @ character {1,}:

^[\w]{1,}[\w.+-]{0,}@[a-zA-Z0–9]{1,}[\w-]{2,}

But now we set that there must be at least three characters for the domain name. To return to just two, we need to fix it like this:

^[\w]{1,}[\w.+-]{0,}@[a-zA-Z0–9]{1,}[\w-]{1,}

Next, we need to deal with two cases. Either there’s just the domain name followed by the domain extension, or there’s subdomain name followed by the domain name followed by the extension. For example, abc.com versus abc.co.uk. To make this work, we need to use the (a|b) token where a stands for the first case, b stands for the second case and | stands for logical OR. In the first case, we will deal with just the domain extension, but since it will always be there no matter the case, we can safely add it to both cases. Domain extension always starts with the dot [.], followed by letters and we will limit the number of letters to at least two {2}. So we need to add this pattern [.][a-zA-Z]{2,} for both cases:

^[\w]{1,}[\w.+-]{0,}@[a-zA-Z0–9]{1,}[\w-]{1,}([.][a-zA-Z]{2,}|[.][a-zA-Z]{2,})

Now, for the second case, we will add the domain name in front of the domain extension, thus making the original domain name a subdomain. The domain name can consist of letters including the hyphen and again, we want at least two characters here, but no hyphen or underscore should be at the beginning of the domain name:

^[\w]{1,}[\w.+-]{0,}@[a-zA-Z0–9]{1,}[\w-]{1,}([.][a-zA-Z]{2,}|[.][a-zA-Z0–9]{1,}[\w-]{1,}[.][a-zA-Z]{2,})

Finally, we need to mark the end of the whole pattern:

^[\w]{1,}[\w.+-]{0,}@[a-zA-Z0–9]{1,}[\w-]{1,}([.][a-zA-Z]{2,}|[.][a-zA-Z0–9]{1,}[\w-]{1,}[.][a-zA-Z]{2,})$

Go here and test if your e-mail matches the pattern: https://regex101.com/r/cUuG4K/1

Now, I'm pretty sure if you try hard enough, you will find a perfectly legit e-mail address that won't match this pattern, but that's not the point. The point is that you can alter this basic structure to pinpoint exactly the pattern you need. The power is in your hands.

Password

Ok, let's try something that will be a piece of cake for you now. Let's say we need to create a pattern for password with these requirements:

  • length must be between 8 and 16 characters
  • must include at least one uppercase and one lowercase letter
  • must include one number and one special character (@, *, $ or #)

We will start with a simple dot . which matches any single character:

.

Next, we will set the range of characters between 8 and 16:

.{8,16}

Next, we will add the so called positive lookahead (?=.*[a-z]) in front of our pattern. This checks if at least one lower case letter exists.

(?=.*[a-z]).{8,16}

Next, we need to add uppercase letter in a similar fashion. Again, we will use the positive lookahead, but this time it will look like this (?=.*[A-Z]). Add this pattern right behind the first lookahead:

(?=.[a-z])**(?=.[A-Z])**.{8,16}

Next is at least one number. For the range of digits we can use either [0–9] pattern or \d shortcut. We will use the shortcut since it’s shorter (obviously :-) thus making the whole expression more readable. The whole lookahead will look like this (?=.*\d). So again, add this pattern right behind the last lookahead:

(?=.[a-z])(?=.[A-Z])(?=.*\d).{8,16}

Finally, we need to make sure that there is at least one special character from defined group (@, *, $, #). So the last lookahead will choose from these specific characters like this (?=.*[@*$#]). Again, add it behind the last lookahead:

(?=.[a-z])(?=.[A-Z])(?=.\d)**(?=.[@$#])*.{8,16}

That's it! Here's the pattern: https://regex101.com/r/xquX9X/3

So, what do you think? Not that scary, right? :-)

###Shameless plug This is just a small snippet of a complete web development course. If you enjoyed this article, consider my Total Web Development Course available at www.twdc.online