Regular Expressions Tutorial for Beginners

Q: What is a regular expression?

A regular expression (regex or regexp) is a sequence of characters that defines a search pattern. It is used for matching, searching, and replacing text. Regular expressions are supported by virtually every programming language and many text editors. A regex pattern like [a-z]+ matches one or more lowercase letters, and \d{3}-\d{4} matches a pattern like 555-1234. Regex is powerful for validating input, extracting data, and transforming text.

Q: What is the difference between greedy and lazy matching?

Greedy matching (the default) matches as much text as possible while still allowing the overall pattern to succeed. Lazy matching (also called non-greedy or reluctant) matches as little text as possible. You make a quantifier lazy by adding a ? after it. For example, .* is greedy and matches everything, while .*? is lazy and matches the minimum. Given the text one two , the pattern .* greedily matches one two , while .*? lazily matches just one .

Q: How do I match a literal special character in regex?

To match a literal special character (such as . * + ? ^ $ { } [ ] ( ) | \), precede it with a backslash. For example, \. matches a literal period, \* matches a literal asterisk, and \( matches a literal opening parenthesis. Inside a character class (square brackets), most special characters lose their special meaning, so [.] also matches a literal period. The only characters that need escaping inside character classes are ] \ ^ and -.

Q: Are regular expressions the same across programming languages?

The core syntax is largely the same, but there are differences in advanced features. Basic patterns like character classes, quantifiers, and grouping work identically across JavaScript, Python, Java, PHP, and most other languages. Differences appear in features like lookbehind support (JavaScript added it in ES2018), named groups syntax (Python uses ?P while JavaScript uses ? ), Unicode support, and flags/modifiers. Always check the documentation for your specific language when using advanced features.

Q: Should I use regex for parsing HTML?

No. HTML is a nested, context-sensitive language that regular expressions cannot reliably parse. Regex cannot handle arbitrary nesting of tags, self-closing tags, attributes with quoted values containing angle brackets, CDATA sections, and other HTML complexities. Use a proper HTML parser instead (like DOMParser in JavaScript, BeautifulSoup in Python, or Jsoup in Java). Regex is appropriate for simple, well-defined text patterns but not for parsing structured languages like HTML or XML.

What Is a Regular Expression

A regular expression, commonly abbreviated as regex or regexp, is a sequence of characters that defines a search pattern. You can use regular expressions to search for specific text, validate that input matches an expected format, extract pieces of data from larger strings, and replace text based on patterns. Regular expressions are one of the most powerful text-processing tools available to programmers and are supported by virtually every programming language, text editor, and command-line tool.

The concept of regular expressions originated in formal language theory in the 1950s, with mathematician Stephen Kleene defining "regular sets" using his mathematical notation. Ken Thompson implemented regular expressions in the Unix text editor ed in the 1960s, and they have been a fundamental part of computing ever since. Today, most implementations follow the Perl Compatible Regular Expressions (PCRE) standard, though there are variations across languages.

Regular expressions look intimidating at first. A pattern like ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$ seems like random characters, but it is a structured pattern for matching email addresses. Once you understand the building blocks, you can read and write patterns like this with confidence. This guide walks you through every concept step by step.

To test regex patterns as you learn, use our Regex Tester tool, which lets you write patterns and see matches highlighted in real time. For a visual representation of how your pattern works, our Regex Visualizer generates a diagram showing the structure of any regex.

Literal Characters

The simplest regex is a literal string. The pattern cat matches the exact sequence of characters "c", "a", "t" appearing consecutively in the text. It would match "cat" in "The cat sat on the mat" and also in "concatenate" (matching the "cat" in the middle of the word).

Literal matching is case-sensitive by default. The pattern cat does not match "Cat" or "CAT". You can make matching case-insensitive by using a flag (covered in the Flags section).

Most characters in a regex are literal and match themselves. The letters a through z, A through Z, digits 0 through 9, and many symbols match themselves directly. However, certain characters have special meaning in regex and are called metacharacters. To match these characters literally, you need to escape them with a backslash.

Metacharacters and Special Characters

Metacharacters are characters that have a special function in regex rather than matching themselves literally. There are 12 metacharacters that you need to know:

Character	Meaning
.	Matches any single character except newline
^	Matches the start of a string (or line in multiline mode)
$	Matches the end of a string (or line in multiline mode)
*	Matches the preceding element zero or more times
+	Matches the preceding element one or more times
?	Matches the preceding element zero or one time
{}	Specifies exact repetition count or range
[]	Defines a character class
()	Groups elements and captures matched text
\|	Alternation (logical OR)
\	Escapes a metacharacter or introduces a shorthand class

To match any of these characters literally, precede them with a backslash. For example, \. matches a literal period, \* matches a literal asterisk, and \\ matches a literal backslash.

Escaping Metacharacters

Pattern: 3\.14

Matches: "3.14" (literal period)

Does not match: "3X14" (without the escape, . would match any character)

Pattern: \$100

Matches: "$100" (literal dollar sign)

Character Classes

A character class, defined with square brackets [], matches any single character from a set of characters. The pattern [abc] matches "a", "b", or "c" but only one of them at a time.

[abc]     Matches a, b, or c
[aeiou]   Matches any vowel
[0-9]     Matches any digit (range notation)
[a-z]     Matches any lowercase letter
[A-Z]     Matches any uppercase letter
[a-zA-Z]  Matches any letter
[a-zA-Z0-9]  Matches any letter or digit

The hyphen inside a character class creates a range. [a-z] matches any character from "a" to "z" based on Unicode code points. [0-9] matches any digit. [A-F0-9] matches any hexadecimal digit. To include a literal hyphen in a character class, place it at the start or end: [-abc] or [abc-].

A caret ^ at the beginning of a character class negates it. [^abc] matches any character that is not "a", "b", or "c". [^0-9] matches any character that is not a digit. [^\s] matches any character that is not whitespace.

[^aeiou]    Matches any character that is not a vowel
[^0-9]      Matches any non-digit character
[^a-zA-Z]   Matches any non-letter character

Inside a character class, most metacharacters lose their special meaning. You do not need to escape ., *, +, or ? inside square brackets. The only characters that retain special meaning inside a class are: ] (closes the class), \ (escapes), ^ (negation, only at the start), and - (range, only between characters).

Shorthand Character Classes

Regex provides shorthand notations for commonly used character classes. These are more concise and easier to read than writing out the full character class.

Shorthand	Equivalent	Description
\d	[0-9]	Any digit
\D	[^0-9]	Any non-digit
\w	[a-zA-Z0-9_]	Any word character (letter, digit, underscore)
\W	[^a-zA-Z0-9_]	Any non-word character
\s	[ \t\n\r\f\v]	Any whitespace character
\S	[^ \t\n\r\f\v]	Any non-whitespace character

The uppercase version of each shorthand is its negation. \d matches digits, \D matches non-digits. \w matches word characters, \W matches non-word characters.

The dot . is a special metacharacter that matches any single character except a newline. In single-line (dotall) mode, the dot also matches newlines. Think of it as a wildcard for any character.

c.t       Matches "cat", "cot", "cut", "c9t", "c t", etc.
..        Matches any two characters
\d\d\d    Matches exactly three digits (like "123")
\w+       Matches one or more word characters

Quantifiers

Quantifiers specify how many times the preceding element must occur for a match. Without quantifiers, each element matches exactly once.

Quantifier	Meaning	Example
*	Zero or more times	a* matches "", "a", "aaa"
+	One or more times	a+ matches "a", "aaa" but not ""
?	Zero or one time (optional)	colou?r matches "color" and "colour"
{n}	Exactly n times	\d{3} matches exactly three digits
{n,}	At least n times	\d{2,} matches two or more digits
{n,m}	Between n and m times	\d{2,4} matches 2, 3, or 4 digits

Quantifiers apply to the element immediately before them. In the pattern ab+, the + applies only to b, so it matches "ab", "abb", "abbb", etc. To apply a quantifier to a group of characters, wrap them in parentheses: (ab)+ matches "ab", "abab", "ababab", etc.

Quantifier Examples

\d{3}-\d{3}-\d{4} matches US phone numbers like "555-123-4567"

https?:// matches "http://" and "https://" (the s is optional)

[A-Z][a-z]* matches a capitalized word like "Hello" or "A"

\w{8,} matches any word of 8 or more characters

Anchors

Anchors match a position in the string rather than a character. They do not consume any characters in the input. They assert that the current position satisfies a certain condition.

^ matches the start of the string. In multiline mode, it matches the start of each line. The pattern ^Hello matches "Hello" only when it appears at the very beginning of the string.

$ matches the end of the string. In multiline mode, it matches the end of each line. The pattern world$ matches "world" only when it appears at the very end.

\b matches a word boundary, which is the position between a word character and a non-word character. The pattern \bcat\b matches the word "cat" as a whole word but does not match "cat" inside "concatenate" or "scatter".

\B matches a non-word boundary, the opposite of \b.

^start     Matches "start" only at the beginning
end$       Matches "end" only at the ending
^exact$    Matches the entire string "exact" and nothing else
\bword\b   Matches "word" as a whole word only

Tip: When validating user input, always use ^ and $ together to ensure the entire string matches your pattern. Without these anchors, the pattern \d+ would match "abc123xyz" because it finds digits somewhere in the string. The pattern ^\d+$ ensures the entire string consists of only digits.

Groups and Capturing

Parentheses () serve two purposes in regex: they group elements together so quantifiers and alternation apply to the whole group, and they capture the matched text for later reference.

Grouping

Grouping lets you apply quantifiers to sequences of characters. Without grouping, the quantifier applies only to the immediately preceding element.

ab+       Matches "ab", "abb", "abbb" (+ applies to b only)
(ab)+     Matches "ab", "abab", "ababab" (+ applies to the group)
(ha){3}   Matches "hahaha" exactly
(go\s?)+  Matches "go", "go go", "gogo go", etc.

Capturing Groups

When a group matches, the text it matched is captured and stored for later reference. Captures are numbered starting from 1, based on the position of the opening parenthesis from left to right. Capture group 0 always refers to the entire match.

// Pattern: (\d{4})-(\d{2})-(\d{2})
// Input:   "2026-03-19"
// Group 0: "2026-03-19" (entire match)
// Group 1: "2026" (year)
// Group 2: "03" (month)
// Group 3: "19" (day)

You can refer to captured groups in replacement strings. In most regex implementations, $1 or \1 refers to the first captured group. This is powerful for text transformations:

// Swap first and last name
// Pattern: (\w+) (\w+)
// Replace: $2 $1
// Input:   "John Smith"
// Result:  "Smith John"

Named Groups

Named capturing groups let you assign a name to a group instead of relying on its numeric position. The syntax varies slightly by language. In JavaScript and most modern implementations, use (?<name>...):

// Pattern: (?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})
// Access captures by name: match.groups.year, match.groups.month

Non-Capturing Groups

If you need grouping but do not need to capture the matched text, use a non-capturing group: (?:...). This is slightly more efficient and keeps your capture numbering clean.

(?:https?|ftp)://   Groups the protocol alternatives without capturing
(?:ab)+             Groups "ab" for the quantifier without capturing

Alternation

The pipe character | acts as a logical OR, matching either the pattern on its left or the pattern on its right.

cat|dog       Matches "cat" or "dog"
red|green|blue  Matches "red", "green", or "blue"
(cat|dog)s    Matches "cats" or "dogs"

Alternation has low precedence. The pattern abc|def matches the string "abc" or the string "def", not "ab" followed by "c or d" followed by "ef". Use parentheses to limit the scope of alternation: ab(c|d)ef matches "abcef" or "abdef".

The regex engine tries alternatives from left to right and stops at the first match. In the pattern cat|catfish, the engine will match "cat" even when the input is "catfish" because "cat" succeeds first. If you need the longer match, put it first: catfish|cat.

Greedy vs Lazy Matching

By default, quantifiers are greedy: they match as much text as possible while still allowing the overall pattern to succeed. This default behavior can cause unexpected results.

Greedy Matching Problem

Pattern: <.+>

Input: text

Expected match: 

Actual match: text (greedy .+ consumed everything it could)

Adding ? after a quantifier makes it lazy (also called non-greedy or reluctant). A lazy quantifier matches as little text as possible.

.*    Greedy: matches as much as possible
.*?   Lazy: matches as little as possible
.+    Greedy: one or more, as much as possible
.+?   Lazy: one or more, as little as possible
.{2,5}   Greedy: between 2 and 5, prefers 5
.{2,5}?  Lazy: between 2 and 5, prefers 2

Lazy Matching Solution

Pattern: <.+?>

Input: text

Match:  (lazy .+? stopped at the first >)

Lookahead and Lookbehind

Lookahead and lookbehind (collectively called lookaround) are zero-width assertions that check whether a pattern exists ahead of or behind the current position without consuming characters. They let you match text based on its context.

Positive Lookahead (?=...)

Matches if the pattern inside the lookahead exists ahead of the current position, but does not include the lookahead text in the match.

\d+(?= dollars)   Matches digits followed by " dollars"
                  In "100 dollars", matches "100" (not " dollars")

\w+(?=\()         Matches a word followed by an opening parenthesis
                  In "func()", matches "func" (not the parenthesis)

Negative Lookahead (?!...)

Matches if the pattern inside the lookahead does not exist ahead of the current position.

\d+(?! dollars)   Matches digits NOT followed by " dollars"
                  In "100 euros", matches "100"
                  In "100 dollars", does not match

foo(?!bar)        Matches "foo" not followed by "bar"
                  Matches "foo" in "foobaz" but not in "foobar"

Positive Lookbehind (?<=...)

Matches if the pattern inside the lookbehind exists behind the current position.

(?<=\$)\d+        Matches digits preceded by a dollar sign
                  In "$50", matches "50" (not the $)

(?<=@)\w+         Matches a word preceded by @
                  In "@username", matches "username"

Negative Lookbehind (?<!...)

Matches if the pattern inside the lookbehind does not exist behind the current position.

(?<!un)happy       Matches "happy" not preceded by "un"
                   Matches "happy" but not "unhappy"

Lookaround is powerful for password validation. A regex to check that a string contains at least one digit and one uppercase letter without constraining the order:

^(?=.*\d)(?=.*[A-Z]).{8,}$
// (?=.*\d)     Must contain a digit somewhere
// (?=.*[A-Z])  Must contain an uppercase letter somewhere
// .{8,}        Must be at least 8 characters
// ^...$        Must match the entire string

Flags and Modifiers

Flags (also called modifiers) change how the regex engine processes the pattern. The most commonly used flags are:

Flag	Name	Effect
i	Case-insensitive	Makes matching case-insensitive. /abc/i matches "abc", "ABC", "aBc"
g	Global	Find all matches, not just the first one
m	Multiline	^ and $ match the start/end of each line, not just the string
s	Dotall / Single-line	The dot . matches newline characters as well
u	Unicode	Enables full Unicode matching for characters and properties

In JavaScript, flags are placed after the closing slash: /pattern/gi. In Python, flags are passed as arguments: re.search(pattern, string, re.IGNORECASE | re.MULTILINE). In most other languages, flags are specified similarly as options or constants.

Common Regex Patterns

Here are practical regex patterns you can use as starting points for common validation and extraction tasks. Test each one in our Regex Tester to see how they work.

Email Address (Simple)

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$

This matches common email formats. Note that fully validating email addresses according to RFC 5322 requires an extremely complex regex. For most applications, this simple pattern is sufficient, and you should verify emails by sending a confirmation message rather than relying solely on regex.

URL

https?://[^\s/$.?#].[^\s]*

Matches HTTP and HTTPS URLs. A comprehensive URL regex is very long. This simple version works for most practical cases.

IP Address (IPv4)

^(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)$

Validates IPv4 addresses with correct range checking (0-255 per octet).

Date (YYYY-MM-DD)

^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01])$

Matches dates in ISO 8601 format. Note that this does not validate whether a date actually exists (it would accept February 31). Use date libraries for full validation.

Strong Password

^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[@$!%*?&])[A-Za-z\d@$!%*?&]{8,}$

Requires at least one lowercase letter, one uppercase letter, one digit, one special character, and a minimum of 8 characters.

Hex Color Code

^#([0-9A-Fa-f]{3}|[0-9A-Fa-f]{6})$

Matches 3-digit and 6-digit hex color codes with the # prefix.

For transforming text between different cases after extracting with regex, our Text Case Converter converts between camelCase, snake_case, kebab-case, and more. When you need to generate URL-friendly slugs from text, our Slug Generator handles the conversion. And for comparing text before and after regex replacements, our Diff Checker highlights the differences clearly.

Important: Do not use regex for parsing HTML, XML, or other structured languages. Regular expressions cannot handle the arbitrary nesting and context sensitivity of these formats. Use a proper parser instead. Regex is excellent for validating and extracting patterns from flat text, but it is the wrong tool for parsing hierarchical document structures.

Related Tools on Zovo Tools

Frequently Asked Questions

A regular expression (regex or regexp) is a sequence of characters that defines a search pattern. It is used for matching, searching, and replacing text. Regular expressions are supported by virtually every programming language and many text editors. A regex pattern like [a-z]+ matches one or more lowercase letters, and \d{3}-\d{4} matches a pattern like 555-1234. Regex is powerful for validating input, extracting data, and transforming text.

Greedy matching (the default) matches as much text as possible while still allowing the overall pattern to succeed. Lazy matching (also called non-greedy or reluctant) matches as little text as possible. You make a quantifier lazy by adding a ? after it. For example, .* is greedy and matches everything, while .*? is lazy and matches the minimum. Given the input with two HTML tags, a greedy pattern matches across both tags while a lazy pattern stops at the first closing tag.

To match a literal special character (such as . * + ? ^ $ { } [ ] ( ) | \), precede it with a backslash. For example, \. matches a literal period, \* matches a literal asterisk, and \( matches a literal opening parenthesis. Inside a character class (square brackets), most special characters lose their special meaning, so [.] also matches a literal period. The only characters that need escaping inside character classes are ] \ ^ and -.

The core syntax is largely the same, but there are differences in advanced features. Basic patterns like character classes, quantifiers, and grouping work identically across JavaScript, Python, Java, PHP, and most other languages. Differences appear in features like lookbehind support, named groups syntax, Unicode support, and flags or modifiers. Always check the documentation for your specific language when using advanced features.

No. HTML is a nested, context-sensitive language that regular expressions cannot reliably parse. Regex cannot handle arbitrary nesting of tags, self-closing tags, attributes with quoted values containing angle brackets, CDATA sections, and other HTML complexities. Use a proper HTML parser instead (like DOMParser in JavaScript, BeautifulSoup in Python, or Jsoup in Java). Regex is appropriate for simple, well-defined text patterns but not for parsing structured languages like HTML or XML.