Difference between revisions of "Regular expressions"

Revision as of 05:52, 1 July 2012

This article contains too little information, it should be expanded or updated.
Things you can do to help: add more content. update current content.

Regular Expressions (regex) are essentially a search engine for finding patterns in a text. While the syntax is a bit tricky to learn, regex will save tons of time and effort in the long run. Many of you are probably familiar with regex, even if only through the use of wildcards. Wildcard notation, such as *.html, matches to all html files in the given search directory. Regex takes this idea and expands on it dramatically, allowing for very complicated search patterns. A regular expression to find all html files in a given directory would be .*\.html$

Syntax

The following information pertains to the Perl 5 Regex engine. Different engines have slightly different syntax

Characters

Literals

The most basic regex is a literal character. A literal character, such as a matches a in the string alex. However, in a string such as adam, it will only match the first a, before the 'd', unless you tell the regex engine otherwise. Most text editors that have a 'find' function, also have a 'find next' function.

Similarly, a regex search for hat in the string Blackhat Academy will return 'hat' from the end of the first word. This is merely a string of literal characters, and the regex engine handles it the same way as it handles a single literal character.

Metacharacters

Regex wouldn't really be that useful if we just wanted to match literal strings of text. That's where special characters come in to play. Each metacharacter has it's own use, and in order to use any of them as literal characters, you need to escape them using the \ (backslash) key.

There are 11 metacharacters we'll use:

Opening square bracket [
Backslash \
Caret ^
Dollar Sign $
Dot .
Pipe |
Question mark ?
Asterik *
Plus +
Open round bracket (
Close round bracket )

Take 1+1=2 for example. The correct regex to match this string is 1\+1=2. It's important to note that 1+1=2 is still valid regex, but it wouldn't match 1+1=2. Instead, it would match 111=2 in the string 111=25*4+11. This is the a great example of how to properly escape special characters when you want to find their literal character counterpart, but how it might not always throw an error, even if you don't escape your special characters.

Non-Printable

Non-printable characters are any ASCII or ANSI character not represented in the standard A-z0-9 character set. Popular non-printable characters include the tab character, represented by \t or \x09, the carriage return (\r or \x0D), and the new line (\n or \0A). These are by no means the only non printable characters you can use in regular expressions. In fact, you can use any ASCII or ANSI character code for the character set you are working with.

Being able to use the ASCII character codes means that even if you're trying to match a character that isn't on your keyboard, you can still easily match it.

Protip: Windows uses \r\n to terminate lines, while UNIX derivatives terminate lines with just \n

Character Classes (Sets)

Character classes, put simply, allow regex engines to match only one out of several provided characters. A character class is denoted by square brackets around the character variations that you're looking for. For instance, if you want to find all instances of the word grey, both American and British, in a document, you could search for gr[ae]y. This would match either gray or grey. It will not, however, match graay or graey.

You can also use hyphens inside a character class to represent a range of characters, such as [0-9] to represent any digit. Similarly, [a-z] would match any single, lowercase letter, and [A-Z] would match any single, uppercase letter. More importantly, perhaps, is that you can chain these together. For instance, if I wanted to match any AlphaNumeric character, I could search for [a-zA-Z]. If I wanted to find any single hexadecimal digit, I could search for [a-fA-F0-9].

Negated Character Classes

Negating a character class is done by typing a caret ^ after the opening bracket. Negating a character class means it will match any character that's not in the defined character class. It's important to note that even negated character classes have to match some character. A regular expression foo[^b] does not mean "match any instance of foo not followed by b," but rather "match any instance of foo followed by any character that isn't a b.

Metacharacters

The only metacharacters inside character classes are the closing bracket ], the backslash \, the caret ^ and the hyphen -. All the other metacharacters are treated as literals inside a character class and do not need to be escaped with a backslash. For instance, if you wanted to find a + or a *, you could use regex that looks like [+*].

If you want to include a backslash in your character class, you need to escape it with another backslash, [\\+] matches a backslash or a plus sign. The other metacharacters that character classes use can be included as a literal inside the class as long as they are in a position that doesn't match it's purpose. For instance, to search for a closing bracket, a hyphen, or a caret, you could write a regular expression that looks like []^-]. This is a little confusing for some, and if you'd prefer, you can write it by escaping the metacharacters that you want to search for. This decreases readability for those less skilled, but increases consistency. To search for a closing bracket, a hyphen, or a caret, using escapes, you could write [\^\-\]]. Both methods are entirely valid, and it's entirely up to the user to decide how to write it.

The exception to using the backslash to escape metacharacters inside character classes occurs when using POSIX regular expressions. POSIX treats backslashes as a literal inside character classes, so to use the other metacharacters, you would want to use the ASCII method as mentioned earlier.

Shorthand

Negated Shorthand

Repeating Character Classes

Dot

Anchors

Word Boundaries

Alternation

Quantifiers

Tools

Utilities

Programming Languages

Gnulib
Java
JavaScript
.NET
Perl
PHP
PowerShell
Python
Ruby

Databases

@@ Line 42: / Line 42: @@
 ====Metacharacters====
+The only metacharacters inside character classes are the closing bracket <code>]</code>, the backslash <code>\</code>, the caret <code>^</code> and the hyphen <code>-</code>. All the other metacharacters are treated as literals inside a character class and do not need to be escaped with a backslash. For instance, if you wanted to find a <code>+</code> or a <code>*</code>, you could use regex that looks like <code>[+*]</code>.
+If you want to include a backslash in your character class, you need to escape it with another backslash, <code>[\\+]</code> matches a backslash or a plus sign. The other metacharacters that character classes use can be included as a literal inside the class as long as they are in a position that doesn't match it's purpose. For instance, to search for a closing bracket, a hyphen, or a caret, you could write a regular expression that looks like <code>[]^-]</code>. This is a little confusing for some, and if you'd prefer, you can write it by escaping the metacharacters that you want to search for. This decreases readability for those less skilled, but increases consistency. To search for a closing bracket, a hyphen, or a caret, using escapes, you could write <code>[\^\-\]]</code>. Both methods are entirely valid, and it's entirely up to the user to decide how to write it.
+The exception to using the backslash to escape metacharacters inside character classes occurs when using [[POSIX]] regular expressions. POSIX treats backslashes as a literal inside character classes, so to use the other metacharacters, you would want to use the [[ASCII]] method as mentioned earlier.
 ====Shorthand====
 ====Negated Shorthand====

Difference between revisions of "Regular expressions"

Revision as of 05:52, 1 July 2012

Contents

Syntax

Characters

Literals

Metacharacters

Non-Printable

Character Classes (Sets)

Negated Character Classes

Metacharacters

Shorthand

Negated Shorthand

Repeating Character Classes

Dot

Anchors

Word Boundaries

Alternation

Quantifiers

Tools

Utilities

Programming Languages

Databases

Navigation menu

Views

Personal tools

Wiki

Community

Search

Tools