Questions about this topic? Sign up to ask in the talk tab.

Difference between revisions of "Regular expressions"

From NetSec
Jump to: navigation, search
(Specials)
(Repetition)
 
(27 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 +
'''Regular Expressions''' (regex) are essentially a search engine for finding patterns in a text, useful in [[perl]], and many other [[programming]] lanaguages, it is even possible to perform [[sql injection with regular expressions]]. While the syntax is a bit tricky to learn, regex will save tons of time and effort in the long run. Many of you are probably familiar with regex, even if only through the use of wildcards. Wildcard notation, such as <code>*.html</code>, matches to all html files in the given search directory. Regex takes this idea and expands on it dramatically, allowing for very complicated search patterns. A regular expression to find all html files in a given directory would be <code>.*\.html$</code>
 +
 +
It's important to note that regular expressions rely very heavily on context. Each character has the potential to mean something entirely different depending on it's position in the expression. This is a relatively hard concept for beginners to understand. Just keep at it and, with practice, you'll be a keyboard cowboy in no time.
 +
{{wrongPerson}}{{cleanup}}
 
<center>{{Expand}}</center>
 
<center>{{Expand}}</center>
Regular Expressions (regex) are essentially a search engine for finding patterns in a text. While the syntax is a bit tricky to learn, regex will save tons of time and effort in the long run. Many of you are probably familiar with regex, even if only through the use of wildcards. Wildcard notation, such as <code>*.html</code>, matches to all html files in the given search directory. Regex takes this idea and expands on it dramatically, allowing for very complicated search patterns. A regular expression to find all html files in a given directory would be <code>.*\.html$</code>
 
 
==Syntax==
 
==Syntax==
 +
{{info|The following information pertains to the Perl 5 Regex engine. Different engines have slightly different syntax}}
 
===Characters===
 
===Characters===
 
====Literals====
 
====Literals====
 
The most basic regex is a literal character. A literal character, such as <code>a</code> matches <code>a</code> in the string <code>alex</code>. However, in a string such as <code>adam</code>, it will only match the first <code>a</code>, before the 'd', unless you tell the regex engine otherwise. Most text editors that have a 'find' function, also have a 'find next' function.
 
The most basic regex is a literal character. A literal character, such as <code>a</code> matches <code>a</code> in the string <code>alex</code>. However, in a string such as <code>adam</code>, it will only match the first <code>a</code>, before the 'd', unless you tell the regex engine otherwise. Most text editors that have a 'find' function, also have a 'find next' function.
  
Similarly, a regex search for <code>hat</code> in the string <code>Blackhat Academy</code> will return 'hat' from the end of the first word. This is merely a string of literal characters, and the regex engine handles it the same way as it handles a single literal character.
+
Similarly, a regex search for <code>Sec</code> in the string <code>NetSec</code> will return 'Sec' from the end of the first word. This is merely a string of literal characters, and the regex engine handles it the same way as it handles a single literal character.
  
 
====Metacharacters====
 
====Metacharacters====
Line 27: Line 31:
  
 
====Non-Printable====
 
====Non-Printable====
 +
Non-printable characters are any [[ASCII]] or ANSI character not represented in the standard A-z0-9 character set. Popular non-printable characters include the tab character, represented by <code>\t</code> or <code>\x09</code>, the carriage return (<code>\r</code> or <code>\x0D</code>), and the new line (<code>\n</code> or <code>\0A</code>). These are by no means the only non printable characters you can use in regular expressions. In fact, you can use any [[ASCII]] or ANSI character code for the character set you are working with.
 +
 +
Being able to use the ASCII character codes means that even if you're trying to match a character that isn't on your keyboard, you can still easily match it.
 +
 +
{{Protip|Windows uses <code>\r\n</code> to terminate lines, while UNIX derivatives terminate lines with just <code>\n</code>}}
 +
 
===Character Classes (Sets)===
 
===Character Classes (Sets)===
 +
Character classes, put simply, allow regex engines to match only one out of several provided characters. A character class is denoted by square brackets around the character variations that you're looking for. For instance, if you want to find all instances of the word grey, both American and British, in a document, you could search for <code>gr[ae]y</code>. This would match either <code>gray</code> or <code>grey</code>. It will not, however, match <code>graay</code> or <code>graey</code>.
 +
 +
You can also use hyphens inside a character class to represent a range of characters, such as <code>[0-9]</code> to represent any digit. Similarly, <code>[a-z]</code> would match any single, lowercase letter, and <code>[A-Z]</code> would match any single, uppercase letter. More importantly, perhaps, is that you can chain these together. For instance, if I wanted to match any AlphaNumeric character, I could search for <code>[a-zA-Z]</code>. If I wanted to find any single hexadecimal digit, I could search for <code>[a-fA-F0-9]</code>.
 
====Negated Character Classes====
 
====Negated Character Classes====
 +
Negating a character class is done by typing a caret <code>^</code> after the opening bracket. Negating a character class means it will match any character that's ''not'' in the defined character class. It's important to note that even negated character classes have to match some character. A regular expression <code>foo[^b]</code> does not mean "match any instance of <code>foo</code> not followed by <code>b</code>," but rather "match any instance of <code>foo</code> followed by any character that isn't a <code>b</code>.
 +
 
====Metacharacters====
 
====Metacharacters====
 +
The only metacharacters inside character classes are the closing bracket <code>]</code>, the backslash <code>\</code>, the caret <code>^</code> and the hyphen <code>-</code>. All the other metacharacters are treated as literals inside a character class and do not need to be escaped with a backslash. For instance, if you wanted to find a <code>+</code> or a <code>*</code>, you could use regex that looks like <code>[+*]</code>.
 +
 +
If you want to include a backslash in your character class, you need to escape it with another backslash, <code>[\\+]</code> matches a backslash or a plus sign. The other metacharacters that character classes use can be included as a literal inside the class as long as they are in a position that doesn't match it's purpose. For instance, to search for a closing bracket, a hyphen, or a caret, you could write a regular expression that looks like <code>[]^-]</code>. This is a little confusing for some, and if you'd prefer, you can write it by escaping the metacharacters that you want to search for. This decreases readability for those less skilled, but increases consistency. To search for a closing bracket, a hyphen, or a caret, using escapes, you could write <code>[\^\-\]]</code>. Both methods are entirely valid, and it's entirely up to the user to decide how to write it.
 +
 +
The exception to using the backslash to escape metacharacters inside character classes occurs when using [[POSIX]] regular expressions. POSIX treats backslashes as a literal inside character classes, so to use the other metacharacters, you would want to use the [[ASCII]] method as mentioned earlier.
 +
 
====Shorthand====
 
====Shorthand====
 +
Character classes are made easier by the shorthand that has been developed for them. Since many character classes use used frequently, such as <code>[0-9]</code>, a shorter notation is used like <code>[\d]</code>. The <code>\d</code> shorthand stands for the digits, 0-9. The <code>\w</code> shorthand stands for word, and contains the character class <code>[a-zA-Z0-9_]</code>. The shorthand <code>\s</code> contains the whitespace characters. This shorthand varies from flavor to flavor, but in the [[Perl]] engine, it matches <code>[ \t\r\n]</code>.
 +
 +
Shorthand can be used both inside and outside of character classes. <code>\s\d</code> matches a space followed by a digit, while <code>[\s\d]</code> will match either a whitespace character or a digit.
 +
 
====Negated Shorthand====
 
====Negated Shorthand====
 +
Shorthand also has negated versions. Just like you can negate a character class by using <code>^</code>, as in <code>[^a-z]</code>, you can negate the shorthand outside of character classes using <code>\D</code> (negates <code>\d</code>), <code>\W</code> (negates <code>\w</code>), and <code>\S</code> (negates <code>\s</code>). Be careful not to mix up <code>[^\s\d]</code> and <code>[\S\D]</code>. These are not the same things. <code>[^\s\d]</code> will match any character that is not whitespace and not a digit. But <code>[\S\D]</code> will match any character that's not whitespace or not a digit. Since whitespace is not a digit, and digits are not whitespace, this regex will match any character, whether it's digit, whitespace, or otherwise.
 +
 
====Repeating Character Classes====
 
====Repeating Character Classes====
 +
Repeating character classes can be useful depending on the situation. You can repeat a character class by using <code>?</code>, <code>*</code> or <code>+</code>. Say I wanted to isolate a set of numbers that are not 0 or 1 inside a larger file, without caring how long the string of numbers may be. The best way to do this is to repeat a character class, <code>[2-9]+</code>.
 +
 
===Dot===
 
===Dot===
 +
The dot (<code>.</code>) is the "Almost Anything" character of regular expressions. The dot matches a single character, including anything other than newlines. By default, the dot is essentially shorthand for the negated character class <code>[^\n]</code>. It is easily one of the most used metacharacters in regular expressions.
 +
 +
Since the dot doesn't normally match the new line character, modern tools and languages have an option to allow the dot to match new lines. The reason that the dot doesn't include new line characters by default is because in the original regex engines, files were read line-by-line to find your search, causing the engine to never encounter a newline character, as it only applied the regex to each individual line.
 +
 +
In [[Perl]], you can match new lines with the dot using a feature called single-line mode. This is not to be confused with multi-line mode, which only effects anchors.
  
 
===Anchors===
 
===Anchors===
 +
Anchors are a whole new concept in regular expressions. Instead of matching characters, they match positions. <code>^</code> matches the position before the first character in the line, while <code>$</code> matches the position after the last character in the line. For instance, searching for <code>^f</code> in the string <code>foobar</code> will match "f", but not any further. Applying <code>^b</code> to the same string would return no results, since b is not the first character in the string. Similarly, <code>r$</code> matches "r", but <code>o$</code> applied to the same string doesn't match anything.
 +
 +
====Permanent Anchors====
 +
While <code>^</code> and <code>$</code> match to the start and end of lines, respectively. <code>\A</code> matches the very start of the string, and <code>\Z</code> matches the very end of the string.
 +
 +
====Zero-Length Matches====
 +
Because anchors match a position rather than a character, it is possible to have a zero-length match by using only anchors in your expression. This can be very useful if you want to add something to the start or the end of any line.
  
 
===Word Boundaries===
 
===Word Boundaries===
 +
The metacharacter <code>\b</code> is an anchor, similar to <code>^</code> and <code>$</code>. It matches at the "word boundary."
 +
 +
The word boundary position can be one of three things:
 +
*Before the first character in a string, if the first character is a word character.
 +
*After the last character in the string, if the last character is a word character.
 +
*Between two characters in a string, where one is a word character and the other is not.
 +
 +
This allows for "whole words only" searches, by simply including a <code>\b</code> in front of and after your search term. To clarify, a word character is generally defined as <code>[a-zA-Z0-9_]</code>, represented by the shorthand <code>\w</code>
 +
 +
====Negated Word Boundaries====
 +
<code>\B</code> is the negated <code>\b</code>, meaning that it will match at every position where <code>\b</code> doesn't. This means that it effectively matches any position between two word characters, as well as any position between two non-word characters.
  
 
===Alternation===
 
===Alternation===
 +
Alternation is sort of like character classes on steroids. Where character classes match a single character out of several possible characters, alternation can match a single regular expression out of several possible expressions. If you wanted to find the literal string foo or the literal string bar, you can just separate them with a pipe (<code>|</code>): <code>foo|bar</code>
 +
 +
It's important to note that the alternation metacharacter has the lowest precedence of all regex operators. This means that it's either going to match everything on the left of the vertical bar, or everything on the right. This means that it will match everything after the first instance of what you search for, unless you specify a boundary first. For instance, if you want to search for either <code>white hat</code> or <code>black hat</code>, you could write an expression: <code>\b(white|black)\b hat</code>
 +
 +
===Repetition===
 +
There are three repetition operators, or quantifiers in regular expressions, each having it's own properties.
 +
 +
The question mark, <code>?</code> makes the preceding token in your expression optional. This means that <code>(Net)?Secy</code> will match both <code>NetSec</code> and <code>Sec</code>.
 +
 +
The asterisk, <code>*</code> tells the regex engine to attempt to match the previous token zero or more times. This means that it's okay if that token matches nothing.
 +
 +
The plus, <code>+</code> tells the engine to attempt to match the previous token one or more times. This means that your expression will fail if the token before the plus doesn't match anything.
 +
 +
====Limiting Repetition====
 +
Several regex engines have an additional quantifier that allows you to specify the number of times your token can be repeated. The syntax is <code>{min,max}</code>. The limits on this are <code>0 <= min <= max</code>, meaning min has to be at least 0, and less than max, while max has to be at least min. You can also omit the max argument, while leaving the comma, in order to repeat up to an infinite number of times.
 +
*<code>{0,}</code> = <code>*</code>
 +
*<code>{1,}</code> = <code>+</code>
  
===Quantifiers===
+
If you omit both the comma and the max, <code>{10}</code> for example, then you can match the exact number inside the curly braces.
  
 
==Tools==
 
==Tools==
Line 51: Line 120:
 
*Gnulib
 
*Gnulib
 
*Java
 
*Java
*JavaScript
+
*[[JavaScript]]
 
*.NET
 
*.NET
*Perl
+
*[[C]]
*PHP
+
*[[Perl]]
 +
*[[PHP]]
 
*PowerShell
 
*PowerShell
*Python
+
*[[Python]]
*Ruby
+
*[[Ruby]]
 +
 
 
===Databases===
 
===Databases===
 
*[http://dev.mysql.com/doc/refman/5.1/en/regexp.html MySQL]
 
*[http://dev.mysql.com/doc/refman/5.1/en/regexp.html MySQL]
 
*[http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm Oracle]
 
*[http://docs.oracle.com/cd/B19306_01/appdev.102/b14251/adfns_regexp.htm Oracle]
 
*[http://www.postgresql.org/docs/9.0/static/functions-matching.html PostgreSQL]
 
*[http://www.postgresql.org/docs/9.0/static/functions-matching.html PostgreSQL]
 +
 +
== chart ==
 +
 +
there's a little chart on the [[sqli]] page of regex stuff at [[Sqli#Using_Regular_Expressions_for_Boolean_enumeration]]

Latest revision as of 05:02, 29 May 2015

Regular Expressions (regex) are essentially a search engine for finding patterns in a text, useful in perl, and many other programming lanaguages, it is even possible to perform sql injection with regular expressions. While the syntax is a bit tricky to learn, regex will save tons of time and effort in the long run. Many of you are probably familiar with regex, even if only through the use of wildcards. Wildcard notation, such as *.html, matches to all html files in the given search directory. Regex takes this idea and expands on it dramatically, allowing for very complicated search patterns. A regular expression to find all html files in a given directory would be .*\.html$

It's important to note that regular expressions rely very heavily on context. Each character has the potential to mean something entirely different depending on it's position in the expression. This is a relatively hard concept for beginners to understand. Just keep at it and, with practice, you'll be a keyboard cowboy in no time.

This article was written using inappropriate person, but has otherwise good content. Please forgive (but preferrably correct) uses of I, we, us, you, etc.

This article contains too little information, it should be expanded or updated.
Things you can do to help:
  • add more content.
  • update current content.

Syntax

c3el4.png The following information pertains to the Perl 5 Regex engine. Different engines have slightly different syntax

Characters

Literals

The most basic regex is a literal character. A literal character, such as a matches a in the string alex. However, in a string such as adam, it will only match the first a, before the 'd', unless you tell the regex engine otherwise. Most text editors that have a 'find' function, also have a 'find next' function.

Similarly, a regex search for Sec in the string NetSec will return 'Sec' from the end of the first word. This is merely a string of literal characters, and the regex engine handles it the same way as it handles a single literal character.

Metacharacters

Regex wouldn't really be that useful if we just wanted to match literal strings of text. That's where special characters come in to play. Each metacharacter has it's own use, and in order to use any of them as literal characters, you need to escape them using the \ (backslash) key.

There are 11 metacharacters we'll use:

  • Opening square bracket [
  • Backslash \
  • Caret ^
  • Dollar Sign $
  • Dot .
  • Pipe |
  • Question mark ?
  • Asterik *
  • Plus +
  • Open round bracket (
  • Close round bracket )

Take 1+1=2 for example. The correct regex to match this string is 1\+1=2. It's important to note that 1+1=2 is still valid regex, but it wouldn't match 1+1=2. Instead, it would match 111=2 in the string 111=25*4+11. This is the a great example of how to properly escape special characters when you want to find their literal character counterpart, but how it might not always throw an error, even if you don't escape your special characters.

Non-Printable

Non-printable characters are any ASCII or ANSI character not represented in the standard A-z0-9 character set. Popular non-printable characters include the tab character, represented by \t or \x09, the carriage return (\r or \x0D), and the new line (\n or \0A). These are by no means the only non printable characters you can use in regular expressions. In fact, you can use any ASCII or ANSI character code for the character set you are working with.

Being able to use the ASCII character codes means that even if you're trying to match a character that isn't on your keyboard, you can still easily match it.


Protip: Windows uses \r\n to terminate lines, while UNIX derivatives terminate lines with just \n


Character Classes (Sets)

Character classes, put simply, allow regex engines to match only one out of several provided characters. A character class is denoted by square brackets around the character variations that you're looking for. For instance, if you want to find all instances of the word grey, both American and British, in a document, you could search for gr[ae]y. This would match either gray or grey. It will not, however, match graay or graey.

You can also use hyphens inside a character class to represent a range of characters, such as [0-9] to represent any digit. Similarly, [a-z] would match any single, lowercase letter, and [A-Z] would match any single, uppercase letter. More importantly, perhaps, is that you can chain these together. For instance, if I wanted to match any AlphaNumeric character, I could search for [a-zA-Z]. If I wanted to find any single hexadecimal digit, I could search for [a-fA-F0-9].

Negated Character Classes

Negating a character class is done by typing a caret ^ after the opening bracket. Negating a character class means it will match any character that's not in the defined character class. It's important to note that even negated character classes have to match some character. A regular expression foo[^b] does not mean "match any instance of foo not followed by b," but rather "match any instance of foo followed by any character that isn't a b.

Metacharacters

The only metacharacters inside character classes are the closing bracket ], the backslash \, the caret ^ and the hyphen -. All the other metacharacters are treated as literals inside a character class and do not need to be escaped with a backslash. For instance, if you wanted to find a + or a *, you could use regex that looks like [+*].

If you want to include a backslash in your character class, you need to escape it with another backslash, [\\+] matches a backslash or a plus sign. The other metacharacters that character classes use can be included as a literal inside the class as long as they are in a position that doesn't match it's purpose. For instance, to search for a closing bracket, a hyphen, or a caret, you could write a regular expression that looks like []^-]. This is a little confusing for some, and if you'd prefer, you can write it by escaping the metacharacters that you want to search for. This decreases readability for those less skilled, but increases consistency. To search for a closing bracket, a hyphen, or a caret, using escapes, you could write [\^\-\]]. Both methods are entirely valid, and it's entirely up to the user to decide how to write it.

The exception to using the backslash to escape metacharacters inside character classes occurs when using POSIX regular expressions. POSIX treats backslashes as a literal inside character classes, so to use the other metacharacters, you would want to use the ASCII method as mentioned earlier.

Shorthand

Character classes are made easier by the shorthand that has been developed for them. Since many character classes use used frequently, such as [0-9], a shorter notation is used like [\d]. The \d shorthand stands for the digits, 0-9. The \w shorthand stands for word, and contains the character class [a-zA-Z0-9_]. The shorthand \s contains the whitespace characters. This shorthand varies from flavor to flavor, but in the Perl engine, it matches [ \t\r\n].

Shorthand can be used both inside and outside of character classes. \s\d matches a space followed by a digit, while [\s\d] will match either a whitespace character or a digit.

Negated Shorthand

Shorthand also has negated versions. Just like you can negate a character class by using ^, as in [^a-z], you can negate the shorthand outside of character classes using \D (negates \d), \W (negates \w), and \S (negates \s). Be careful not to mix up [^\s\d] and [\S\D]. These are not the same things. [^\s\d] will match any character that is not whitespace and not a digit. But [\S\D] will match any character that's not whitespace or not a digit. Since whitespace is not a digit, and digits are not whitespace, this regex will match any character, whether it's digit, whitespace, or otherwise.

Repeating Character Classes

Repeating character classes can be useful depending on the situation. You can repeat a character class by using ?, * or +. Say I wanted to isolate a set of numbers that are not 0 or 1 inside a larger file, without caring how long the string of numbers may be. The best way to do this is to repeat a character class, [2-9]+.

Dot

The dot (.) is the "Almost Anything" character of regular expressions. The dot matches a single character, including anything other than newlines. By default, the dot is essentially shorthand for the negated character class [^\n]. It is easily one of the most used metacharacters in regular expressions.

Since the dot doesn't normally match the new line character, modern tools and languages have an option to allow the dot to match new lines. The reason that the dot doesn't include new line characters by default is because in the original regex engines, files were read line-by-line to find your search, causing the engine to never encounter a newline character, as it only applied the regex to each individual line.

In Perl, you can match new lines with the dot using a feature called single-line mode. This is not to be confused with multi-line mode, which only effects anchors.

Anchors

Anchors are a whole new concept in regular expressions. Instead of matching characters, they match positions. ^ matches the position before the first character in the line, while $ matches the position after the last character in the line. For instance, searching for ^f in the string foobar will match "f", but not any further. Applying ^b to the same string would return no results, since b is not the first character in the string. Similarly, r$ matches "r", but o$ applied to the same string doesn't match anything.

Permanent Anchors

While ^ and $ match to the start and end of lines, respectively. \A matches the very start of the string, and \Z matches the very end of the string.

Zero-Length Matches

Because anchors match a position rather than a character, it is possible to have a zero-length match by using only anchors in your expression. This can be very useful if you want to add something to the start or the end of any line.

Word Boundaries

The metacharacter \b is an anchor, similar to ^ and $. It matches at the "word boundary."

The word boundary position can be one of three things:

  • Before the first character in a string, if the first character is a word character.
  • After the last character in the string, if the last character is a word character.
  • Between two characters in a string, where one is a word character and the other is not.

This allows for "whole words only" searches, by simply including a \b in front of and after your search term. To clarify, a word character is generally defined as [a-zA-Z0-9_], represented by the shorthand \w

Negated Word Boundaries

\B is the negated \b, meaning that it will match at every position where \b doesn't. This means that it effectively matches any position between two word characters, as well as any position between two non-word characters.

Alternation

Alternation is sort of like character classes on steroids. Where character classes match a single character out of several possible characters, alternation can match a single regular expression out of several possible expressions. If you wanted to find the literal string foo or the literal string bar, you can just separate them with a pipe (|): foo|bar

It's important to note that the alternation metacharacter has the lowest precedence of all regex operators. This means that it's either going to match everything on the left of the vertical bar, or everything on the right. This means that it will match everything after the first instance of what you search for, unless you specify a boundary first. For instance, if you want to search for either white hat or black hat, you could write an expression: \b(white|black)\b hat

Repetition

There are three repetition operators, or quantifiers in regular expressions, each having it's own properties.

The question mark, ? makes the preceding token in your expression optional. This means that (Net)?Secy will match both NetSec and Sec.

The asterisk, * tells the regex engine to attempt to match the previous token zero or more times. This means that it's okay if that token matches nothing.

The plus, + tells the engine to attempt to match the previous token one or more times. This means that your expression will fail if the token before the plus doesn't match anything.

Limiting Repetition

Several regex engines have an additional quantifier that allows you to specify the number of times your token can be repeated. The syntax is {min,max}. The limits on this are 0 <= min <= max, meaning min has to be at least 0, and less than max, while max has to be at least min. You can also omit the max argument, while leaving the comma, in order to repeat up to an infinite number of times.

  • {0,} = *
  • {1,} = +

If you omit both the comma and the max, {10} for example, then you can match the exact number inside the curly braces.

Tools

Utilities

Programming Languages

Databases

chart

there's a little chart on the sqli page of regex stuff at Sqli#Using_Regular_Expressions_for_Boolean_enumeration