Difference between revisions of "Regular expressions"
AlizaLorenzo (Talk | contribs) (→Non-Printable) |
AlizaLorenzo (Talk | contribs) (→Non-Printable) |
||
Line 28: | Line 28: | ||
====Non-Printable==== | ====Non-Printable==== | ||
Non-printable characters are any [[ASCII]] or ANSI character not represented in the standard A-z0-9 character set. Popular non-printable characters include the tab character, represented by <code>\t</code> or <code>\x09</code>, the carriage return (<code>\r</code> or <code>\x0D</code>), and the new line (<code>\n</code> or <code>\0A</code>). These are by no means the only non printable characters you can use in regular expressions. In fact, you can use any [[ASCII]] or ANSI character code for the character set you are working with. | Non-printable characters are any [[ASCII]] or ANSI character not represented in the standard A-z0-9 character set. Popular non-printable characters include the tab character, represented by <code>\t</code> or <code>\x09</code>, the carriage return (<code>\r</code> or <code>\x0D</code>), and the new line (<code>\n</code> or <code>\0A</code>). These are by no means the only non printable characters you can use in regular expressions. In fact, you can use any [[ASCII]] or ANSI character code for the character set you are working with. | ||
+ | |||
+ | Being able to use the ASCII character codes means that even if you're trying to match a character that isn't on your keyboard, you can still easily match it. | ||
{{Protip|Windows uses <code>\r\n</code> to terminate lines, while UNIX derivatives terminate lines with just <code>\n</code>}} | {{Protip|Windows uses <code>\r\n</code> to terminate lines, while UNIX derivatives terminate lines with just <code>\n</code>}} |
Revision as of 01:13, 1 July 2012
This article contains too little information, it should be expanded or updated. |
---|
Things you can do to help:
|
Regular Expressions (regex) are essentially a search engine for finding patterns in a text. While the syntax is a bit tricky to learn, regex will save tons of time and effort in the long run. Many of you are probably familiar with regex, even if only through the use of wildcards. Wildcard notation, such as *.html
, matches to all html files in the given search directory. Regex takes this idea and expands on it dramatically, allowing for very complicated search patterns. A regular expression to find all html files in a given directory would be .*\.html$
Syntax
Characters
Literals
The most basic regex is a literal character. A literal character, such as a
matches a
in the string alex
. However, in a string such as adam
, it will only match the first a
, before the 'd', unless you tell the regex engine otherwise. Most text editors that have a 'find' function, also have a 'find next' function.
Similarly, a regex search for hat
in the string Blackhat Academy
will return 'hat' from the end of the first word. This is merely a string of literal characters, and the regex engine handles it the same way as it handles a single literal character.
Metacharacters
Regex wouldn't really be that useful if we just wanted to match literal strings of text. That's where special characters come in to play. Each metacharacter has it's own use, and in order to use any of them as literal characters, you need to escape them using the \
(backslash) key.
There are 11 metacharacters we'll use:
- Opening square bracket
[
- Backslash
\
- Caret
^
- Dollar Sign
$
- Dot
.
- Pipe
|
- Question mark
?
- Asterik
*
- Plus
+
- Open round bracket
(
- Close round bracket
)
Take 1+1=2
for example. The correct regex to match this string is 1\+1=2
. It's important to note that 1+1=2
is still valid regex, but it wouldn't match 1+1=2
. Instead, it would match 111=2
in the string 111=25*4+11
. This is the a great example of how to properly escape special characters when you want to find their literal character counterpart, but how it might not always throw an error, even if you don't escape your special characters.
Non-Printable
Non-printable characters are any ASCII or ANSI character not represented in the standard A-z0-9 character set. Popular non-printable characters include the tab character, represented by \t
or \x09
, the carriage return (\r
or \x0D
), and the new line (\n
or \0A
). These are by no means the only non printable characters you can use in regular expressions. In fact, you can use any ASCII or ANSI character code for the character set you are working with.
Being able to use the ASCII character codes means that even if you're trying to match a character that isn't on your keyboard, you can still easily match it.
\r\n
to terminate lines, while UNIX derivatives terminate lines with just \n
Character Classes (Sets)
Negated Character Classes
Metacharacters
Shorthand
Negated Shorthand
Repeating Character Classes
Dot
Anchors
Word Boundaries
Alternation
Quantifiers
Tools
Utilities
Programming Languages
- Gnulib
- Java
- JavaScript
- .NET
- Perl
- PHP
- PowerShell
- Python
- Ruby