Ruby regex syntax
Let’s start with learning how we can use regex in Ruby in general. You can use the standard ruby library to apply regex validation.
Forward slashes with equal tilde symbols
The syntax result will return 10 which is the index of the first occurrence that matches the regex. If the text does not contain a string that is matched, it will return nil.
If you want to use the if statement with regex, you can use the match function.
The result will be printing out “Matched” since the text contains the word, “regex”.
These two options are useful when you want to detect a word or words in a text. If you want to check characters, you can use a different regex syntax.
A group of characters can be included within square brackets ‘’.
In the contains_abc function, it uses the equal tilde symbols to return the first occurrence index. At least one letter must be matched by the group. The result of the first function call returns index 8 and the next one returns nil.
You can use a range in the regular expression and this can make regex very compact.
The first regex, [0-9], means it wants to match a number between 0 and 9. In the first sentence, there is the number, 5, where the regex check stops and returns the index. The second regex, [a-z], tries to find any alphabet character between a-z that exists at index 21 from the sentence. When you use the alphabet range, note that it is case-sensitive.
The character class in the regex works like a function. Using this, you do not have to list entire characters or numbers. Among many classes, these three will be most frequently used.
- \w matches any word that contains numbers.
- \d matches any numbers.
- \s matches a space.
You can use the character classes below.
Why do we validate emails using regex?
We learned how we can use the regular expression in Ruby. Before jumping into developing email regex patterns, let’s discuss the reasons why we want to do validating email addresses.
Validating email format
When you sign up for an online service and create a new account, you always have to provide an email account these days. What happens when you type your email is that the website checks if it has the right format. According to the email address standard, an email pattern must meet the following conditions.
- Alphabet letters, numbers, and specific special characters including underscores, periods, and dashes
- Underscores, periods, and dashes must be followed by one or more letters or numbers.
Let's have a look at some of the valid emails.
Some of the invalid email addresses look like the following.
- sample..email@example.com (only one period is allowed)
If a company fails to validate the email format from the user, the firm loses a means to contact the person. Also, when a wrong email flows into the company system, it can cause a problem in the system. Thus, email validation is a basic but essential task.
Extracting email from unstructured text
Regex in ruby is not only to validate email format. When data is structured or semi-structured, it is easy to retrieve email data. However, when you have to deal with unstructured data such as plain text and you want to collect email data, using Ruby regex can be helpful. Regex is often used to extract data out of the text. As you capture email addresses from the free text, you can convert the email data into clean structured data.
Replacing email from unstructured text
In addition, a company may want to replace the email address (which is considered PI, or personal information) with some random string to de-personalize the text data. To do so, you will first have to match all emails and then replace them with the string you provide. Sometimes, a company wants to perform this task following its security policy.
An email address consists of the prefix and the domain. The prefix appears to the left of the symbol, @, and the domain appears to the right. So, an email contains two types of information. Your company might want to use the prefix to check if the person tries to create duplicate accounts, for example, to take advantage of free services. Or, you may just want to use the prefix as a username for the person. All these become easy when you first can extract email addresses from plain text, for example, and then split each email by ‘@’.
When you have email addresses in a large amount of plain text and if you want to validate the email domain, as the first step, you will want to extract emails from them. The email data in the text might not have a valid domain as some people just put a random email to bypass validation. You can use regex to match emails and then split it by ‘@’ to get the domain. You can perform follow-up research, for example, to check if the server exists.
Using email regex in Ruby
Let’s learn how we can use regex in Ruby for the possible scenarios with sample codes.
Validating email format
You can use the email regular express in the sample code to validate an email address. Using the if-else statement, you can differentiate between valid and invalid email addresses.
The sample email above has a valid email format, so it will return “This is a valid email!”. Alternatively, you could check the index and if the index returned equals 0, then you can consider it as a valid email.
How about invalid emails? Let’s test it.
Since they are all invalid emails, all of them will return nil.
Extracting email in unstructured text
Validating and extracting are different tasks. In extracting, you may get one or more matched email addresses.
There are three emails in the text and as a result of text.scan(EMAIL_REGEX), a list containing the extracted emails will be returned. You can access an index by the following.
This gives you firstname.lastname@example.org.
Replacing email in unstructured text
Imagine that you have to deidentify PI information in documents and remove or replace email addresses. To find email addresses, you need to use regular expressions. To replace matched strings using regex, you can use the gsub function.
The matched email addresses will be replaced with DEIDENTIFIED@EMAIL.COM. The outcome from the above code is the full text with the email addresses changed to the deidentified email address.
Let’s say that you have extracted email addresses from documents and want to use the prefix as the username.
Using the same text, we will assume that the result of the scan returns the three emails. The for loop will return individual emails and using the split by ‘@’ function, it will return an array. The first index will be the prefix and the second will be the domain.
Similar to the checking prefix, we can extract the domain from each email address.
In each array, the second index holds the domain.
Useful regex resources
Using regex in Ruby requires a combination of regex knowledge and Ruby coding skills. There are great online resources where you can learn and test your regular expressions. Also, you can download text editing tools that support regex. Some suggested tools include:
- Regex101: This website provides a user interface where you can put text and regex. As you type your regex, it will match the text in real-time and run regex validation to check your regex syntax. It explains the syntax used and the steps on how it detects the patterns. You can try regex in different programming languages too.
- Sublime Text: This text editor is a flexible and versatile IDE for many languages. You can use it not only for coding but also for testing regular expressions with large text. You can replace, extract and find all that is matched by your regex.
- Notepad++: Notepad has been around us for many years yet it is one of the most popular text editing tools. It is compact and free. You can use it as IDE and run your regex against the text.
- HackerRank: Try to solve the regex quiz and get some credits for your profile. This provides simple regex problems and challenging ones. Using the website, you can develop patterns and apply them to solve quizzes. The problems are categorized by the difficulty level and success rate of other users.
We learned various email validation techniques in Ruby and the reasons why we want to validate email addresses. Using the regex, we validate, extract, and replace email addresses. All these can be performed by Ruby's standard library. Learning regex to capture emails can be an extremely handy skill in the Ruby program. Also being able to develop a regex pattern is one of the sought-after programming needs.
To be familiar with regex in Ruby, try the regex resources and hone your skills!
To boost your regex knowledge, we prepared three frequently asked questions regarding regex.
How can we express date format in regex?
Along with the email validation regex, the regex to validate or capture a date format is one of the most frequently used patterns. If the date format that you want to detect is as below:
You can use the following regex to capture that pattern:
This will not only validate the date format but also the values themselves. This will capture the birthdate between the years 1900 and 2020.
What is the regex pattern that can select the one between two strings?
Let's say you want to capture what is between '<' and '>'.
You can use the regex below to select "123-456-789".
If you have different anchor strings, you can change '<' and '>' to whatever strings you have.
What other languages support regex?