Difference between revisions of "Perl regex"

Revision as of 22:30, 29 March 2006

This article explains regular expressions in terms understandable to mere mortals, and also how to use them in Perl.

Special characters in regex:

. = any character
* = 0 or more of previous character
^ = following string begins the line (except [^...] means "not these characters")
$ = preceding string ends the line
[] = list of characters which can satisfy the match at this position
{} = # of repetitions of previous character:
- {x} -> exactly x repetitions
- {x,y} -> minimum of x repetitions, maximum of y repetitions
| = alternatives
+ = 1 or more of previous character
? after +, *, or {} indicates non-greedy behavior, i.e. match the fewest characters, not the most
a-b = range of characters from a to b, e.g. "t-w" means any of t,u,v,w in that position
?= = lookahead (need explanation of how this works) a(?=b) returns "a, but only if it's followed by a b"; the a becomes part of the matched sequence, but the b does not
?<= = reverse lookahead (need explanation of how this works)

Operators used to invoke regex:

=~ returns TRUE if pattern matches
!~ returns FALSE if pattern matches
s/pattern/replacement/gi; replaces pattern with replacement
- g (global) means repeat the pattern search until there are no more matches
- i (insensitive) means alphabetic matches are checked case-insensitively
y/searchlist/replacelist/d: replaces each character found in searchlist with the corresponding character in replacelist
- d just deletes matching characters
tr/ is the same as y/

Replace "thingy" with "stuffs" in $string:
- $string =~ s/thingy/stuffs/;
Keep only the part of $string before the final "/" (using "|" as the delimiter instead of "/"):
- $string =~ s|(.*)/[^/]*|$1|;
...before the final "-":
- $string =~ s|(.*)-[^-]*|$1|;
...before the final ".":
- $string =~ s|(.*)\.[^\.]*|$1|;
...after the final "."
- $string =~ s|^.+\.(.+$)|$1|;

@@ Line 18: / Line 18: @@
 * '''?''' after '''+''', '''*''', or '''{}''' indicates non-greedy behavior, i.e. match the fewest characters, not the most
 * <u>a</u>'''-'''<u>b</u> = range of characters from <u>a</u> to <u>b</u>, e.g. "t-w" means any of t,u,v,w in that position
+* '''?=''' = lookahead (need explanation of how this works) <u>a</u>'''(?='''<u>b</u>''')''' returns "<u>a</u>, but only if it's followed by a <u>b</u>"; the <u>a</u> becomes part of the matched sequence, but the <u>b</u> does not
+* '''?&lt;=''' = reverse lookahead (need explanation of how this works)
 Operators used to invoke regex: