Searching with Regex
If you have a very specific or complex search to perform, and complex searches don’t give the right results, then you can try using Wildcards or Regex. These tools are much more complex, and the learning curve is steep, but they allow you to perform extremely precise searches. For example, you can search for word repetitions, word chunks or even character sequences of a certain size.
Wildcard Expressions
Wildcard Expressions allow you to mask individual characters or sequences of characters inside words. Say, to switch examples, you wish to find all occurrences of “test” and of “text”. Now, you could search for text test
(or text OR test
), and this would give you what you wanted, but you can also use the wildcard ?
and search for te?t
instead. This will find all words that consists of the two character “te”, plus one character which can be anything, and end up with the character “t”. It would also find “teat”.
In addition to ?
which requires one and one only character, you can also use *
which stands for any character zero, one or more times. If you search for te*t
, you are therefore likely to find many more words, among them “test” and “text”, but also “tempest”, “testament” and “tent”
You do not have to signal in any way that you are performing a wildcard search: just including ?
or *
is enough.
A wildcard character cannot occur first in a search expression, so you cannot find “Hamlet” with ?amlet
or *let
. If you wish to perform searches of this kind, you should use a regular expression.
Regular Expressions
Regular Expressions are also known as “regex” or “regexp”. They are a very powerful tool for searching text (and for replacing text, but this is not relevant here). Lucene only supports a restricted range of regex operators, but they should be sufficient for most uses.
You put the search engine into regex mode by enclosing your search term with slashes. So you would search e.g. for /.{3}let/
to find “Hamlet” (but also, e.g. “fillet”).
Match any character
The period .
can be used to represent any character (this is the same as the ?
wildcard).
In order to retrieve the string “snake”, the following expressions can be used:
/s.ake/
/.nak./
One-or-more
The plus sign +
can be used to repeat the preceding shortest pattern once or more times.
In order to retrieve the string “deer”, the following expression can be used:
/de+r/
Zero-or-more
The asterisk *
can be used to match the preceding shortest pattern zero-or-more times. Note that this applies to what comes before the asterisk – the wildcard *
stands for a character in itself (a wildcard *
amounts to a reg .*
)
In order to retrieve both the strings “weed” and “wed” (and “welcomed” and “westward” and so on), the following expression can be used:
/we*d/
Zero-or-one
The question mark ?
makes the preceding shortest pattern optional. It matches zero or one times. Note that in Lucene wildcard searches, ?
stands for a character in itself; in regex searches the question mark quantifies the immediately preceding character (or pattern).
In order to retrieve the strings “weed” and “wed”, the following expression can be used:
/wee?d/
Min-to-max
Curly brackets {}
can be used to specify a minimum and (optionally) a maximum number of times the preceding shortest pattern can repeat. The allowed forms are:
- {5} repeat exactly 5 times
- {2,5} repeat at least twice and at most 5 times
- {2,} repeat at least twice
In order to retrieve the string “weed”, the following expression can be used:
/we{2}d/
/we{2,}d/
/we{2,5}d/
Grouping
Parentheses ()
can be used to form sub-patterns. The quantity operators listed above operate on the shortest previous pattern, which can be a group.
In order to retrieve the string “weed”, the following expression can be used:
/w(..)+d/
/w(ee)*d/
/w(ee)?d/
Alternation
The pipe symbol |
acts as an OR operator. The match will succeed if the pattern on either the left-hand side or the right-hand side matches. This is of course equivalent to the OR operator in standard Lucene syntax.
In order to retrieve the strings “proportions” and “preparations”, the following expression can be used:
/(prepara|propor)tions/
Character classes
Character classes are very important, since they allow you to mask variation with more control than that offered by wildcards. You can thus use them to find words even though they are written differently, e.g. have either “e” or “o” in a certain position or have “a” and “e” in a certain position
Ranges of potential characters may be represented as character classes by enclosing them in square brackets []
. A leading caret ^
negates the character class, that is, all characters other than the ones following are signified.
The allowed forms are:
- [abc] ‘a’ or ‘b’ or ‘c’
- [a-c] ‘a’ to ‘c’. i.e. ‘a’ or ‘b’ or ‘c’
- [-abc] ‘b’ or ‘c’, but not ‘a’
- [abc-] ‘a’ or ‘b’ or ‘c’ or ‘-‘
- [^abc] any character except ‘a’ or ‘b’ or ‘c’
- [^a-c] any character except ‘a’ or ‘b’ or ‘c’
- [^-abc] any character except ‘-‘ or ‘a’ or ‘b’ or ‘c’
Note that the dash -
indicates a range of characters, unless it is the first character or if it is escaped with a backslash.
The caret ^
negates the following characters.
In order to retrieve the string “weed”, the following expression could be used:
/w[uiaeo]+d/
/w[uiaeo]*d/
/we[uiaeo]?d/
/w[a-u]*ed/
/we[^uiao]d/
The possibilities are endless.
There are plenty of regex tutorials. A good one can be found at regexlearn.com.
The exact definition of the regex possibilities in Lucene can be found in at lucene.apache.org.