Previous topic | Next topic

Documentation / Regular Expressions / Repetitions * + ?

Repetitions * + ?

With repetitions, you can find something repeating several times.

Star * matches zero or more occurrences, + matches one or more occurrences, and ? matches zero or one occurrences:

=*   Nothing or several equality signs
.* The whole file
.+ The whole file (excluding empty files)
\w+ A word (several “word” characters)
0?2 Either 2 or 02
0*2 Two with optional leading zeros (e.g., 2, 02, or 00002)

The difference between * and + is that the first matches an empty string, and the second does not. For example, if you search for the pattern "\d*", Aba may find the empty quotes "", while searching for "\d+" will not find them.

The question mark ? is useful for making some part of your regular expression optional (e.g., skip leading zeros or “www” before a URL).

Here are more complex examples:

\w+ \d+ A word followed by a number (e.g., step 11)
http://[a-z0-9./_&=%?~#-]+ HTTP links (will not match international characters)
<a href="https?://[^"]+"> A link to external site (HTTP or HTTPS protocol)

https? means “http or https” (the letter “s” is optional).

The last two examples show two possible approaches to using repetitions:

When using the first approach, you need to specify allowable characters. There may be a lot of them; for example, URLs may include not only English letters, but also the letters with diacritic marks, Cyrillic and Greek letters, Chinese ideographs, etc. If you don't include them, you regexpr will not work for international URLs.

When using the second approach, you sometimes have another problem: the regular expression may capture more characters than needed. You should choose one of these approaches depending on the kind of text that you are working with.

All repetition operators are greedy, that is, they capture as much repetitions as possible.


Previous topic | Next topic