Regular expressions, often referred to as regex, refer to encoded text strings designed to match patterns in other strings. Regular expressions are particularly helpful when you need to find a string of characters that matches a certain type of pattern. These patterns can be simple, where they match an exact string, or they can be more complex, where they match strings that contain a set of rules. One common usage for regex is for identifying whether a user has correctly entered something into a form, such as an email address.
Regex provides the ability to validate the structure of an email address. It can be handled with one or two lines of code and can easily be tweaked to handle a wide variation of different parameters.
However, one thing to keep in mind is that it can only check the structure of an email address. There is no way, using regex by itself to verify if the person is using a legitimate address. It cannot check, for instance, against MX or SMTP records to ensure that the address being provided is a legitimate address. In other words, someone can easily construct any string of characters that fits the rules and pass through this validation, and it is not possible to check whether the address provided is actually real.
There are a wide range of possible combinations with emails. Up until a few years ago we could look for only 2- or 3-character top level domains, however the ICANN has recently opened up a large number of new TLDs, which means that these can have a much longer range of characters. It is also important to keep in mind international domains, wherein there is a country abbreviation associated with a domain (e.g., example.co.uk). What this means from a regex perspective is that you need to account for several periods after the “@” symbol. Overall, there is no such thing as a perfect regex to capture all legitimate email addresses.
One thing to remember is that there are two basic reasons for validating email addresses; one is to improve usability, and make sure users don’t accidentally leave off an important part of their email addresses. The second is to make sure that people are not entering dummy addresses. One thing you don’t want to do is to reject possibly legitimate email addresses; this goes against the basic principles of usability, so you need to take a measured approach.
For this reason, we will not be showing overly restrictive regex examples here.
Email addresses generally take a fairly standard structure (e.g. firstname.lastname@example.org). There are, however, a wide range of other limitations. A basic email regex is built into HTML5, and uses the following expression:
What this does is it looks for any combination of A-Z (both upper and lowercase) and numbers, and allowing a few specific special characters, including:
Followed by the “@” symbol, and then allowing for a standard domain name and TLD after this. However, there are a few specific rules, including that a special character cannot appear as the first or last character in an email address, nor can it be repeated consecutively. Other special characters not included in the above list are forbidden. For this reason, we explicitly allow only a few special characters here.
Below we will provide a series of methods for validating email using regex in a variety of different programming languages.
Here is a straightforward method for checking whether an email is valid in Python:
It is worth noting that PHP has a built-in method for validating email addresses. You can do this using the following:
However, if you prefer to use regex, here is one basic method you can use, using the preg_match function:
Ruby has built in email validation into a standard library, so you can use the following:
However, if you want to understand the way that this regex works in Ruby, you can see it below:
Using Go, you can import a few libraries to make sure that this process happens correctly. Here is a variation on the standard regex expression that we have been using:
Java requires that you go through a few extra steps but offers a few useful features such as the ability to search for a string use a case insensitivity pattern.