Regular Expressions Tutorial: From Beginner to Advanced

A hands-on guide to mastering regex patterns, syntax, and real-world applications across every major programming language.

By Michael Lip / Updated March 20, 2026 / 25 min read

97%
Languages with Regex
1951
Year Invented
4.2M+
SO Questions
12
Core Metacharacters
Wikipedia Definition

A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

Source: Wikipedia - Regular expression

Content verified March 20, 2026

1. What Regular Expressions Are and Why They Matter

Regular expressions are pattern-matching tools that let you describe text structures with a compact syntax. Instead of writing dozens of lines of conditional string logic, a single regex pattern can validate an email address, extract phone numbers from a document, or reformat dates across thousands of files.

The concept originated with mathematician Stephen Kleene in 1951, who described "regular sets" using his mathematical notation. Ken Thompson implemented regex in the QED text editor in 1968, and the tool became a core part of Unix through grep (Global Regular Expression Print) in 1973. Today, regex is embedded into virtually every programming language, database system, and text editor.

Consider a practical scenario. You have a log file with 500,000 lines and need to find every line where a response time exceeded 2,000 milliseconds. Without regex, you would need to parse each line, extract the numeric value, and compare it. With regex, a pattern like response_time=([2-9]\d{3}|\d{5,})ms handles this in a single pass.

Regex proficiency directly impacts developer productivity. A 2024 JetBrains developer survey found that 78% of professional developers use regex at least weekly, with search-and-replace and data validation being the top use cases. Yet the same survey found that only 34% considered themselves "comfortable" with regex syntax beyond basics.

This guide addresses that gap. We start with fundamental building blocks, advance through intermediate concepts like groups and quantifiers, and finish with advanced patterns including lookaheads, backreferences, and performance optimization.

2. Regex Fundamentals: Literal Characters and Metacharacters

Every regex pattern consists of two types of characters: literal characters that match themselves, and metacharacters that carry special meaning.

Literal Matching

The simplest regex is a plain string. The pattern hello matches the exact sequence h-e-l-l-o within any larger string. This match is case-sensitive by default, so hello does not match "Hello" unless you enable case-insensitive mode with the i flag.

The 12 Core Metacharacters

MetacharacterMeaningExampleMatches
.Any character (except newline)c.tcat, cot, c9t
^Start of string/line^Hello"Hello world" (start)
$End of string/lineend$"the end" (end)
*Zero or moreab*cac, abc, abbc
+One or moreab+cabc, abbc (not ac)
?Zero or onecolou?rcolor, colour
{}Exact count\d{3}123, 456, 789
[]Character class[aeiou]Any vowel
()Grouping/capture(ab)+ab, abab
|Alternation (OR)cat|dogcat or dog
\Escape character\.Literal dot
-Range (inside [])[a-z]Any lowercase letter

When you need to match a metacharacter literally, escape it with a backslash. To match a literal period, use \.. To match a literal backslash, use \\.

Shorthand Character Classes

Regex engines provide shorthand for common character sets:

ShorthandEquivalentDescription
\d[0-9]Any digit
\D[^0-9]Any non-digit
\w[a-zA-Z0-9_]Word character
\W[^a-zA-Z0-9_]Non-word character
\s[\t\n\r\f\v ]Whitespace
\S[^\t\n\r\f\v ]Non-whitespace
\b(none)Word boundary

3. Character Classes and Ranges

Character classes, written inside square brackets, match any single character from a defined set. The pattern [aeiou] matches any one lowercase vowel. You can negate a class with a caret: [^aeiou] matches any character that is not a lowercase vowel.

Ranges simplify long lists. Instead of [abcdefghijklmnopqrstuvwxyz], write [a-z]. Multiple ranges combine freely: [a-zA-Z0-9] matches any alphanumeric character. The hyphen only indicates a range when placed between two characters inside brackets; at the start or end, it matches literally.

Practical character class patterns you will use regularly:

[a-zA-Z]       -- Any letter (English alphabet)
[0-9a-fA-F]    -- Any hexadecimal digit
[^\s]          -- Any non-whitespace character (same as \S)
[-+]?\d+       -- Integer with optional sign
[A-Z][a-z]+    -- Capitalized word

POSIX character classes are available in some engines (notably grep and sed on Unix systems). These include [:alpha:] for letters, [:digit:] for numbers, and [:alnum:] for alphanumeric characters. They must be used inside an additional set of brackets: [[:alpha:]].

4. Quantifiers: Controlling Repetition

Quantifiers specify how many times the preceding element must occur for a match.

Basic Quantifiers

QuantifierMeaningExample PatternMatches
*0 or morebo*kbk, bok, book, boook
+1 or morebo+kbok, book, boook (not bk)
?0 or 1https?http, https
{n}Exactly n\d{4}1234, 5678
{n,}n or more\w{3,}Words with 3+ characters
{n,m}Between n and m\d{2,4}12, 123, 1234

Greedy vs. Lazy Quantifiers

By default, quantifiers are greedy: they match as many characters as possible. Appending ? makes them lazy, matching as few characters as possible.

This distinction matters most when patterns contain a wildcard followed by a specific delimiter. Given the HTML string <span>text</span><span>more</span>:

Greedy:  <span>.*</span>   matches "<span>text</span><span>more</span>"
Lazy:    <span>.*?</span>  matches "<span>text</span>"

Lazy quantifiers are essential when you want to match the shortest possible substring between delimiters.

Possessive Quantifiers

Some engines (Java, PCRE) support possessive quantifiers with ++, *+, and ?+. These behave like greedy quantifiers but never backtrack. The pattern \d++\w will never match because the \d++ consumes all digits and refuses to give any back. Possessive quantifiers improve performance when you know backtracking is unnecessary.

5. Groups, Backreferences, and Alternation

Capturing Groups

Parentheses serve two purposes: grouping elements for quantifiers and capturing matched text for later use. The pattern (\d{3})-(\d{4}) matches "555-1234" and captures "555" in group 1 and "1234" in group 2.

In replacement operations, captured groups are referenced with $1, $2, etc. (or \1, \2 depending on the engine). This enables powerful text transformations:

Pattern:     (\w+), (\w+)
Replacement: $2 $1
Input:       "Doe, John"
Output:      "John Doe"

Named Capturing Groups

Named groups improve readability in complex patterns. The syntax varies by engine:

Python/PCRE:    (?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})
JavaScript:     (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
.NET:           (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})

In JavaScript, named groups are accessible via match.groups.year, which is far more readable than match[1].

Non-Capturing Groups

When you need grouping for alternation or quantification but do not need the captured text, use (?:...). The pattern (?:https?|ftp):// groups the protocol options without creating a capture group, which is both cleaner and marginally faster.

Backreferences

Backreferences match the same text that was previously captured. The pattern (\w+)\s+\1 matches repeated words like "the the" or "is is". This is commonly used for finding duplicate words in text.

Alternation

The pipe character | works as a logical OR. The pattern cat|dog|bird matches any one of those three words. Alternation has the lowest precedence of all regex operators, so ^cat|dog$ means "string starting with cat OR string ending with dog". Use grouping to clarify: ^(cat|dog)$ means "string that is exactly cat or dog".

6. Anchors and Boundaries

Anchors match positions rather than characters. They are zero-width assertions: they do not consume input.

The caret ^ matches the start of a string (or line, in multiline mode). The dollar sign $ matches the end. These are critical for validation patterns. The regex \d+ matches digits anywhere in a string, but ^\d+$ only matches a string consisting entirely of digits.

Word Boundaries

The \b boundary matches the position between a word character and a non-word character. Searching for \bcat\b matches "cat" in "the cat sat" but not in "concatenate" or "caterpillar". This is one of the most underused regex features; many false matches in text processing are eliminated by adding word boundaries.

String vs. Line Anchors

The behavior of ^ and $ changes in multiline mode (the m flag). Without multiline mode, ^ matches only the very start of the entire string. With it, ^ also matches the start of each line after a newline character.

Some engines provide additional anchors: \A always matches the start of the string regardless of flags, and \z (or \Z) always matches the end.

7. Lookaheads and Lookbehinds

Lookaround assertions check for a pattern without including it in the match. They are powerful tools for expressing conditions about surrounding text.

TypeSyntaxDescriptionExample
Positive Lookahead(?=...)Followed by\w+(?=@) matches username before @
Negative Lookahead(?!...)Not followed by\d+(?!px) matches numbers not before "px"
Positive Lookbehind(?<=...)Preceded by(?<=\$)\d+ matches digits after $
Negative Lookbehind(?<!...)Not preceded by(?<!un)happy matches "happy" not preceded by "un"

Practical Lookaround Examples

Password validation is a classic use case. To require at least one uppercase letter, one lowercase letter, one digit, and one special character:

^(?=.*[A-Z])(?=.*[a-z])(?=.*\d)(?=.*[!@#$%^&*]).{8,}$

Each lookahead checks a condition independently, and all must be true for the overall pattern to match. The .{8,} at the end enforces the minimum length.

Another example: matching a number only when it appears inside parentheses but without including the parentheses in the match:

(?<=\()\d+(?=\))

Applied to "Call (555) 1234", this matches "555" alone.

8. Regex Across Programming Languages

While core regex syntax is consistent across languages, each language has its own API for working with patterns.

JavaScript

const pattern = /(\d{3})-(\d{4})/g;
const str = "Call 555-1234 or 666-5678";
const matches = [...str.matchAll(pattern)];
// matches[0][1] = "555", matches[0][2] = "1234"

// Named groups (ES2018+)
const datePattern = /(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})/;
const { year, month, day } = "2026-03-20".match(datePattern).groups;

Python

import re

pattern = re.compile(r'(\d{3})-(\d{4})')
matches = pattern.findall("Call 555-1234 or 666-5678")
# [('555', '1234'), ('666', '5678')]

# Named groups
date_pattern = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
m = date_pattern.match("2026-03-20")
m.group('year')  # '2026'

Java

import java.util.regex.*;

Pattern p = Pattern.compile("(\\d{3})-(\\d{4})");
Matcher m = p.matcher("Call 555-1234");
while (m.find()) {
    System.out.println(m.group(1) + " " + m.group(2));
}

Go

import "regexp"

re := regexp.MustCompile(`(\d{3})-(\d{4})`)
matches := re.FindAllStringSubmatch("Call 555-1234 or 666-5678", -1)
// matches[0][1] = "555", matches[0][2] = "1234"

Note that Go's regex engine (RE2) deliberately omits backreferences and lookarounds for guaranteed linear-time execution. If you need those features in Go, you would use a PCRE binding library.

9. Real-World Regex Patterns

Here are battle-tested patterns for common tasks. Each has been verified against edge cases.

Email Validation

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

Covers standard email formats. For production systems, consider supplementing regex with DNS MX record lookup, as the full email RFC is notoriously difficult to express in regex.

URL Extraction

https?://[^\s/$.?#].[^\s]*

Matches HTTP and HTTPS URLs within text. This intentionally avoids over-matching by stopping at whitespace.

IPv4 Address

^(?:(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)\.){3}(?:25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)$

Validates each octet is between 0 and 255. Many simpler patterns incorrectly accept values like 999.999.999.999.

Date (YYYY-MM-DD)

^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

Validates the ISO 8601 format with basic range checking on month and day values.

Credit Card Number (Luhn-compatible format)

^(?:4\d{15}|5[1-5]\d{14}|3[47]\d{13}|6(?:011|5\d{2})\d{12})$

Matches Visa (4), Mastercard (51-55), Amex (34/37), and Discover (6011/65) prefixes. Always supplement with Luhn algorithm validation on the server.

Remove HTML Tags

<[^>]+>

Strips HTML tags from text. This is suitable for simple sanitization but should not be relied upon for security-critical HTML parsing.

Try It Yourself

Test any of these patterns live with the Zovo Regex Tester. Paste your pattern and sample text, and see matches highlighted in real time.

10. Performance Optimization and Catastrophic Backtracking

Regex performance is usually not a concern for short strings, but poorly constructed patterns can cause exponential execution times on certain inputs. This is known as catastrophic backtracking.

How Backtracking Works

NFA-based regex engines (used by JavaScript, Python, Java, .NET, and most languages) try each branch of a pattern and backtrack when a branch fails. For the pattern (a+)+b applied to the string "aaaaaaaaac", the engine tries increasingly absurd combinations of the nested quantifiers before concluding there is no match. With 30 'a' characters, this can take billions of steps.

Identifying Dangerous Patterns

Red flags for catastrophic backtracking include:

Mitigation Strategies

  1. Use atomic groups (?>...) when available (PCRE, .NET, Java) to prevent backtracking into a group.
  2. Use possessive quantifiers a++ instead of a+ when backtracking is unnecessary.
  3. Rewrite nested quantifiers. Replace (a+)+ with a+.
  4. Set timeout limits. In .NET, use Regex.Match(input, pattern, options, timeout). In JavaScript, use the regex engine within a Web Worker with a timeout.
  5. Consider RE2-compatible syntax. Google's RE2 engine guarantees linear-time matching by disallowing backreferences and lookarounds.

In 2019, Cloudflare experienced a global outage caused by a single regex pattern with catastrophic backtracking in their WAF rules. The pattern consumed 100% CPU on their edge servers worldwide. This incident underscores why regex performance testing is not optional for production systems.

11. Recommended Tools

Put your regex knowledge into practice with these free tools:

Regex Tester Text Case Converter

The Zovo Regex Tester provides real-time match highlighting, capture group inspection, and support for JavaScript and PCRE2 syntax. Use the Text Case Converter for quick transformations that complement regex operations, such as converting extracted text between camelCase, snake_case, and other formats.

Stack Overflow Community Resources

The regex community on Stack Overflow has answered millions of questions. Here are three of the most referenced threads:

Video Resource

If you learn better with visual walkthroughs, search YouTube for "regular expressions crash course" or "regex tutorial for beginners" for video guides covering the fundamentals discussed here. Channels like Fireship, The Coding Train, and Traversy Media offer quality regex content.

12. Browser Compatibility for JavaScript Regex Features

FeatureChromeFirefoxSafariEdge
Named Capture Groups72+78+11.1+79+
Lookbehind Assertions62+78+16.4+79+
Unicode Property Escapes64+78+11.1+79+
dotAll Flag (s)62+78+11.1+79+
matchAll() Method73+67+13+79+
Regex 'v' Flag (Set Notation)112+Partial17+112+

Frequently Asked Questions

What is a regular expression?

A regular expression (regex) is a sequence of characters that defines a search pattern. It is used for matching, searching, and manipulating text in programming languages, text editors, and command-line tools. Regex provides a concise way to describe complex text patterns that would otherwise require many lines of procedural code.

What does the dot (.) mean in regex?

The dot matches any single character except a newline. When combined with the s (dotAll) flag, it matches newlines too. For example, a.b matches "acb", "a3b", "a b", and any other three-character string where the first character is 'a' and the last is 'b'.

How do I match an email address with regex?

A practical email pattern is ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$. This covers the vast majority of valid email addresses. The full RFC 5322 specification allows for obscure formats that are rarely encountered, so this simplified version is sufficient for most validation purposes.

What is the difference between greedy and lazy quantifiers?

Greedy quantifiers (*, +, ?) match as many characters as possible, while lazy quantifiers (*?, +?, ??) match as few as possible. The difference is most visible with patterns like ".+" vs ".+?" applied to a string with multiple quoted sections. Greedy matches from the first quote to the last; lazy matches from each opening quote to the nearest closing quote.

What are lookaheads and lookbehinds?

Lookaheads and lookbehinds are zero-width assertions that check for patterns ahead of or behind the current position without consuming characters. Positive lookahead (?=...) asserts that what follows matches, while negative lookahead (?!...) asserts the opposite. Lookbehinds work the same way but check backward: (?<=...) and (?<!...).

Which programming languages support regex?

Virtually all modern programming languages support regular expressions: JavaScript, Python, Java, C#, PHP, Ruby, Go, Rust, Perl, Swift, Kotlin, TypeScript, R, and MATLAB, among others. Each has its own regex engine and API, but the core syntax (character classes, quantifiers, groups) is largely consistent across implementations.

How do I validate a phone number with regex?

Phone validation depends on the expected format. A flexible US phone number pattern is ^\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}$. For international numbers, consider using a dedicated library like libphonenumber, which handles the complexity of country-specific formats better than regex alone.

What are capturing groups and how do I use them?

Capturing groups are created with parentheses (). They capture the matched text, which can be referenced later in the pattern (backreferences: \1, \2) or in replacement strings ($1, $2). Named groups like (?<name>...) let you reference captures by name instead of number, improving readability in complex patterns.

ML

Michael Lip

Building free developer and productivity tools at zovo.one. Focused on creating practical, accessible utilities that solve real problems without tracking or paywalls.

Related Guides

Update History

March 20, 2026 - Initial publication. Complete regex tutorial covering fundamentals through advanced topics.

Want a video tutorial? Search YouTube for step-by-step video guides on regex tutorial complete guide.

Quick Facts