Find & Replace It!
version 0.6.0

User Manual
Last update: August 2009
Table of Content
1 About Find & Replace It!
1.1 Summary
1.2 Main features
1.3 Supported platforms
1.4 Support and services
1.5 Known issues and limitations
2 Product activation
3 Getting started
3.1 Selecting files to search/replace in
3.2 Doing a backup of files before modification
3.3 Detecting and selecting files encoding
3.4 Converting files encoding
3.5 Searching for an expression
3.6 Replacing found expressions with fixed string
3.7 Using the 'Find and Replace Preview'
3.8 Using the 'Regular Expression Editor'
3.9 Advanced replacements
3.9.1 Using captured texts within replacement string
3.9.2 Processing captured texts
3.9.3 Processing replacement pattern on the fly by script
3.10 Using the output
4 Tips and tricks
4.1 Saving your work
4.2 Multi-document tabs
4.3 Working with text areas
4.3.1 Editing text
4.3.2 Undo/redo changes
4.3.3 Changing display
4.3.4 Searching for text
4.4 Multi file selection in the found files list
4.5 Get examples
4.6 Debugging script
5 Regular expressions
5.1 Introduction
5.2 Characters and abbreviations for sets of characters
5.3 Sets of characters
5.4 Quantifiers
5.5 Capturing text
5.6 Assertions
5.7 Wildcard matching
5.8 Notes for Perl users
5.9 Examples
Find
& Replace It! is a powerful, cross-platform
text file processor. It allows you to perform very complex batch
replacements inside text files of any size. It supports regular
expression syntax and dozens of encodings. It can batch processing
the encoding of files, as well as style of end-of-line.
The list below describes some of the most important characteristics of Find & Replace It!:
Can find fixed string expressions as well as wildcard and regular expressions
Allow multi-line matching
Handles text encoding while displaying, reading and writing file contents, including multi bytes encodings like Unicode
Preserves line endings while processing files
Preserves BOM while processing Unicode files
Allows you to perform dynamic replacements based on found expression captures
Provides built-in processing function for dynamic replacements (e.g. convert captured expressions to lower case, Base64 encoding, Hex encoding, UTF-8 encoding, etc.)
Provides a JavaScript like interface to customize replacements on the fly by script processing
Displays matched expressions reports for file search/replace operations
Full featured dynamic preview of matched expressions and replacements
Provides tools for converting text encoding
Provides tools for converting line endings (Windows, Unix, Macintosh, Unicode)
Detects text encoding and line endings of files
Provides advanced filtering options for selecting files that need to be processed, including file name filters and file path exclusion filters
Allows you to load and save expressions to find, replacement definitions and file filters
Handles huge files (> 10 GB)
Regular expression editor
Fully multithreaded for fast processing and responsiveness
Allows you to cancel long operations
GUI is totally modular
Cross-platform: Windows, Mac OS X and Linux
Here is a list of supported platforms:
MS Windows:
Windows 98 [not tested]
Windows NT 4.0 [not tested]
Windows 2000 [not tested]
Windows XP
Windows Vista
Mac OS X Intel:
Mac OS X 10.3.9 and up [not tested]
Mac OS X 10.4.x
Mac OS X 10.5.x
Mac OS X PPC:
Mac OS X 10.3.9 and up [not tested]
Mac OS X 10.4.x [not tested]
Mac OS X 10.5.x
Linux:
Ubuntu 8.04 (32 bits)
Ubuntu 9.04 (32 bits)
Fedora 11 (32 bits)
openSUSE 11.0 (32 bits)
Important notes about platform support:
The platforms noted as [not tested] in the list, are platforms on which the software should work but with no warranty. All development tools used in this software are certified on these non-tested platforms but we didn't check by ourself. Most of these platforms are pretty old, that's why there are not actively supported.
The Linux installer requires the following commands: bzip2, lzma, xdg-mime, xdg-desktop-menu, xdg-icon-resource. If one of them is missing, the installation will stop and you might have to manually remove the installation folder.
For general information, please visit our website at: http://www.dprog.ch
If you have any questions about pricing and/or license terms, don't hesitate to write to us at: order@dprog.ch
All support requests regarding software usage as well as general questions about demo version must be addressed to: support@dprog.ch
Please note that support is only available to registered customers who have a valid license for their software copies. Moreover we kindly request our customers to use the online form for posting support requests. This form is accessible through: http://www.dprog.ch/home/index.php/support
Finally, you might read the online Terms of Use statement for details about services provided by dProg – Philippe Docourt.
The text-encoding detection uses a simple heuristic that does not always provide accurate results except for all Unicode encodings (i.e.: UTF-8, UTF-16, UTF-32).
For performance reasons the Find and Replace Preview has a content limit of 5*1024*1024 characters.
For performance reasons it's not possible to search for an expression longer than 5*1024*1024 characters.
The HTML Viewer component might crash under Mac OS X when unloading flash content, if the Enable plugins option is marked. This problem does not occur under Windows and Linux.
The product activation requires an Internet connection.
Every time you start the application a dialog window will ask you to activate your copy of Find & Replace It!. This activation is necessary to access all features of the software and requires an Internet connection.
The activation window shows you an expiry date for activating the software. After this date, the software will not start any more without being activated first. Until this date, you can simply refuse the activation by pressing Cancel and use the software in demo mode. Once the product has been activated, the activation dialog will not any more show up at startup time.

When buying your copy of Find & Replace It! from http://www.dprog.ch/home/index.php/buy?product_id=1 you should receive an activation key by e-mail. To activate the product, follow these steps:
Enter your login for www.dprog.ch website;
Enter your e-mail (the one you used to register on www.dprog.ch and where you received the activation key);
Type or paste your activation key in the appropriate field;
Press OK and wait for the answer. In case of success, the activation window will be automatically closed. Otherwise, you might try again later. In case of problem contact our support.
Note: You cannot close the activation window with OK unless you successfully activate your copy of the software. To close the window without activation, press Cancel.
You can activate or check your activation key at any time through the Help/Activate product... menu as shown below:

There are two sections that work together for searching and selecting files. They are Search scope and Found files.

Select a folder to search in. This is the root path where you want to scan for files. You can type a path directly in the Search in text field or use the button on the right.
Choose to search files recursively into sub folders or not by toggling the Search in sub-directories check mark.
Enter zero, one or many file name filters within the File name filter field. These filters interpret wildcards characters like '*'. They must be comma separated.
Optionally
add one or more expressions to exclude some file paths when
searching for files. This can be achieve with the
button. These filters can use wildcard or regular expression syntax.
Note that all
file paths are described with '/' separator whatever the platform or
system locale is.
Select files that you want to process in the found files list. Unmarked files are not going to be read or touched. The content of this list is updated whenever you change search options. You can filter the content of this list through the file path filter above the list view. Click the column header to sort that column.
The File backup section allows you to create a backup for each modified files.
![]()
Mark
the backup check box as shown above;
Type a suffix for your backup files.
The Found files section shows the found files according to your current options for searching files. The column Encoding is the only one that is editable by the user. You can choose by hand the appropriate codec for each file with a double-click on the appropriate line. This will show up a drop down list of available codecs:

The “best codec” for processing text encoding of each file is detected using a simple heuristic and it is selected by default. Please note that this heuristic is only reliable for detecting Unicode encoding. For other encodings it will only give you some suggestions. There might are many “acceptable codecs” and they are all marked with a blue light on the left side of the drop down list. This is shown below:

If
the codec name is set to unknown
at a given row, that means that no codec seems to match the
associated file. When one or more codecs are detected as acceptable,
the preferred text encoding is selected by default when it is
available. The acceptable codec list is automatically determined
whenever a new file is displayed in the list but not when files are
changed (not yet). For refreshing the encoding detection, click on
.
This will detect the preferred codecs for all files.
If you need to look at the decoded content of a file, right-click on a file entry, then click on Open file in test preview. This will allow you to play with the codec used to decode the file.
Note: There is a special codec named System. This encoding varies with the current locale of your system. When not working with Unicode, this codec often appears to be a good default value for unknown encodings.
For converting the text encoding of a given set of files, follow these steps:
Select the files you want to convert with a check mark in the File path column, within the Found files section;
Select the current text encoding of these files if the auto-detected encodings are not accurate;
Select the target encoding for your set of files;
Optionally you can select the Generate Byte Order Mark (BOM) check box. This will insert the BOM (Byte Order Mark) at the beginning of the file when it is written. This option only apply to Unicode text encoding: UTF-8, UTF-16 and UTF-32. Note that this option may interfere with the target encoding. For instance, if you choose an Unicode encoding that does not allow the BOM, it will turn your target encoding to the closest Unicode encoding that allows it.
Optionally you might schedule a backup of modified files;
Click
on the button
to start the encoding conversion. If you want to stop the conversion
process, click on the button
.
The Expression to find section allows you to setup an expression to search for:

Enter
the expression to search into the Find
text field. Alternatively you might use the
button to edit your expression with the
Regular
Expression Editor.
You can choose the way your expression must be interpreted through the Syntax drop down list:
Simple text or fixed string: means that the pattern to be matched is interpreted as a plain string
Wildcard: is similar to the functionality found in command shells
![]()
Regular
expression: is a pattern for matching substrings in a text
Select the options that apply when matching against your expression:
The 'Minimal match' match option is only available when 'Regular expression' syntax is set. This turns the quantifiers in non-greedy mode.
Select the files you want to scan with a check mark in the File path column;
Select the current text encoding of these files if the auto-detected encodings are not accurate;
Optionally you can test your expression with the Find and replace preview ;
Click
on the
button to start searching your expression into the selected files.
If you need to stop the search, click again on the same button which
has been morphed into
once the search has been started.
Setup the expression you want to search;
![]()
Type
your replacement pattern for matched occurrences of your expression
in the Replace
with text field:
Optionally you might schedule a backup of modified files;
Click
on the
button to start searching your expression into the selected files.
If you need to stop the replacement, click again on the same button
which has been morphed into
once the replacement has been started.
When editing an expression to search for, it is convenient to match it against real data. In order to achieve this, let's take a tour of the Find and Replace Preview window. This is a multi-document editor with powerful features that allows you to:
check the impact of a given text encoding when applied to a file content;
edit a text sample against which you want to match an expression to find;
preview matched occurrences of an expression to find inside a given text sample;
preview resulting content of a text sample after the replacement of all occurrences of your expression with your replacement pattern;
preview both found expressions and processed replacements inside a text sample;
navigate through found occurrences of a specific expression or replacement pattern;
preview HTML documents.
The following screenshots illustrate some of the capabilities described before:

Highlight of replaced matched expressions within a file. The tool-tip shows information about occurrence location.

Highlight of replaced expressions within a file.

Highlight of both found and replaced expressions within a file. The tool-tip shows information about replacement location.
Note that all line breaks in the preview are internally represented by a Line Feed character (LF, U+00A). This is always true, whatever the original end-of-line used in the displayed file. If you want to search for a multi-line expression with another style of line break, we strongly advise you to use a regular expression with appropriate \s+ sequences in order to match any kind of end-of-line.
In addition to the
plain text preview, you can activate the HTML
Viewer through the
button. That will enable you to preview HTML documents, with either
their original or altered content, without having to save them.

The HTML preview in action with our own website.
The viewer is directly synchronized with the current content of Find and Replace Preview. That means the Display mode will also apply on the HTML content. Because the viewer cannot resolve relative links in the HTML document from the content of the preview, you might need to enter an appropriate URL. This URL is used to process all resources referenced by relative links within the document (i.e.: CSS, images, scripts, etc.). This is not required when all resources are given with absolute path.
The HTML Viewer provides rendering of HyperText Markup Language (HTML), Extensible HyperText Markup Language (XHTML) and Scalable Vector Graphics (SVG) documents, styled using Cascading Style Sheets (CSS) and scripted with JavaScript. Some common plugins are also supported through the Netscape Plugin API, provided you have appropriate binary files for those plugins installed on your computer. The following locations are searched for plugins:
|
Linux/Unix |
|---|
|
|
Windows |
|
|
Mac OS X |
|
When you need to write a multi-line expression to search, the Regular Expression Editor is your best friend. The multi-line edition is shown bellow with a simple text expression:

Note that all line break are internally represented by a Line Feed character (LF, U+00A). If you want to search for a multi-line expression with another style of line break, we strongly advise you to use a regular expression with appropriate '\s+' sequences in order to match any kind of end-of-line.
In addition, this editor simplifies the setup of regular expression. It provides tools to manage regular expression entities. On the right side of the text editor, there is a list of available regular expression entities (e.g.: special characters, grouping expression, etc.). If you leave the mouse over an item of this list, a explanatory tooltip will appear. This is show in the figure bellow:

The Regular Expression Editor has some nice features like syntax highlighting and scope matching (e.g.: matching scope for '()', '[]', '{}').
The Regular Expression Editor offers an automatic syntax check for wildcard and regular expressions:

As soon as your wildcard or regular expression pattern becomes invalid, it is underlined. A tooltip provides a brief description of the syntax error detected.
Advanced replacement covers three main features that make Find & Replace It! really powerful:
Injecting a fragment of the matched expression into the replacement text;
Transforming a fragment of the matched expression before injecting it into the replacement text;
Interact through a JavaScript interface with the replacement text.
Each of these points is described in the following chapters.
This feature requires regular expression syntax for the expression to find; furthermore, you should be familiar with captures within regular expressions. To learn more about these notions we recommend to read the Regular expressions chapter. If you are familiar with regexp, read on the following example.
Whenever you capture some text fragments with an expression to find,
![]()
you
can inject these captured fragments into your replacement pattern.
This is done with a %1,
%2,
…, %9
pattern. Where the number that follows the percent sign is the
capture index. %0
is a special, implicit capture that includes the full matched
expression.
![]()
Let's
imagine we have a CSV file containing contacts like in the following
snippet:
First Name: John; Family Name: Smith; Phone: ...
First Name: Mike; Family Name: Dupont; Phone: ...
We want to swap the first two columns. Here we have to capture two variable expressions (first name and family name) and move them around. Here is an easy way to do it.
Find:
(First Name: [^;]+); ( Family Name: [^;]+)
The parentheses in expression above will capture two fragments of every matched occurrences in the CSV file.
Replace with:
%2; %1
The replacement pattern above is a dynamic text that varies for every matched occurrences. In fact %1 will be replaced by the content matched by the first parentheses scope. Idem with %2 and the second parentheses scope.
This feature requires regular expression syntax for the expression to find; furthermore, you should be familiar with captures within regular expressions. To learn more about these notions we recommend to read the Regular expressions chapter. If you are familiar with regexp, read on the following example.
As shown in the previous chapter, it is possible to inject captured texts into your final replacement string, through a special syntax. It is also possible to apply an additional processing to captured strings before injecting them as a replacement expression. This can be handle with the Capture processing section:

Capture #1 and #2 are available but not capture #3. A distinct process has been attached to each capture, however it is not compulsory.
The left column is not editable. The check mark is toggled depending on the presence of captures and placeholders respectively within the expression to find and within the replacement pattern. The right column let you choose a transformation to apply to the capture, before injecting it as the replacement text at its placeholder location.
Let's imagine that we want to upper case the first letter that follows a ':' sign inside a file. A tedious solution might be to replace all : a with : A, : b with : B and so on. This will take some time. And then, what happens with accentuated letters or oriental characters ? What happens if a tab sometimes replaces the whitespace after the ':' sign? What if there no whitespace at all or many whitespaces due to a typing mistake? This solution is definitively inappropriate.
Here is a better way to handle this task:
Find:
:\s*(\w)
The expression above will match all ':' followed by any number of whitespace characters (including tabs and line breaks) and at at least a word character.
In the Capture processing section, select To upper case as process for the first capture. This will transform to upper case the content matched by the first pair of parentheses, before injecting it as %1 in the replacement pattern.
Finally we replace with:
: %1
If the built-in capture processes are not sufficient, you might try the scripting interface.
Find & Replace It! provides a JavaScript like interface to customize replacements on the fly by script. This is especially useful when you need some logic to interpret an expression and generate a replacement pattern based on it (e.g. find all numbers in a text and divide them by a given factor to convert units). To achieve this you only have to type your JavaScript in the script editor.
our script will be called at 3 different occasions. All of them can be accumulated as desired:
Once with the original replacement pattern and the full matched expression;
Once for each captured text in your expression if the capture process has been set to Apply script;
Once at the very end of the process, after all capture processing, with the resulting replacement pattern.

A dummy script outputting invariable properties from the scripting context.
The scripting interface provides a simple way to access the current matched context as well as the replacement pattern through the global replaceCommand object. The table below summarizes the context information made available to the script:
|
Properties of replaceCommand |
Object or Data Type |
Description |
|
Invariable properties: these variables do not change during the process of a given matched expression |
||
|
findExpression |
RegExp |
Matched regular expression. |
|
startOffset |
Number |
Starting offset of matched expression in the full text. |
|
endOffset |
Number |
Ending offset of matched expression in the full text. |
|
captureCount |
Number |
Number of captures contained in the expression to find. |
|
capturedTexts |
Array |
Array of captured strings within the full matched expression. |
|
Variable properties: these variables vary depending on the capture for which the script is called |
||
|
captureIndex |
Number |
Index of the current capture for wich the script has been called. This index is included between 0 and 'captureCount + 1'. |
|
captureText |
String |
Captured text at current capture index. This property returns the full matched expression when captureIndex is zero and returns an empty string when captureIndex equals 'captureCount + 1'. Otherwise it returns the captured string at the given index starting from 1. |
|
replacementLength |
Number |
Length of the current replacement text. This length takes in account already replaced placeholders (i.e. %i). |
|
replacementText |
String |
Current replacement text. This text contains already replaced placeholders (i.e. %i). |
Let's imagine we have a text file containing numerous numerical values. All of these values represent a length given in millimeters. We would like to convert all these distances from millimeters to inches. Sounds tricky to you? Script processing enables you to handle that task like any other replacement tasks. First we setup an expression that matches all numbers (integers and floating point):
(-?(\d+)\.?(\d*))
As a replacement we simply inject the full captured number:
%1
In order to apply some script on the replacement pattern activate capture processing (i. e. check mark), set process to Apply script, then activate the script processing editor option as shown below:

Finally copy and paste the following script within the script editor:
// Function to convert millimeters to inches
function convertMillimetersToInches(valueInMilimeters) {
return valueInMilimeters/25.4;
}
// We want to output a custom replacement for capture at index 1
if(replaceCommand.captureIndex==1) {
// Convert the numerical value captured at current
// index (i.e 1) from millimeters to inches
var convertedValue =
convertMillimetersToInches(replaceCommand.captureText);
// Get number of digits to format output
var digitCount =
replaceCommand.capturedTexts[2].toString().length;
// Take in account that the converted value is about ten times
// smaller, therefore we subtract one digit to represent it
// with same accuracy
digitCount -= 1;
// Get the number of captured decimals (original decimal count)
digitCount += replaceCommand.capturedTexts[3].toString().length;
// Ouput the formatted value with an equivalent number of digits
convertedValue.toPrecision(digitCount);
} else {
// For the other indexes (i.e. 0, 2, 3, 4) we simply output
// the current replacement text
replaceCommand.replacementText;
}
That's it! As an alternative we might remove the outer parentheses, then use %0 as replacement pattern instead of %1 and adjust the script accordingly.
The Output window is a multi-document preview for find and replace reports. Every time you search for an expression within files, a report will be outputted to the active console tab. The report includes links to files that contains searched/replaced expressions. A simple click on this link will open the file within Find and Replace Preview. A double-click on this link will open the file with a suitable application.

The console showing a report for found expressions. The report gives you some statistics.
All sections containing this button
support persistent serialization. Therefore, you can save an
expression alongside with its replacement pattern and script, as well
as all components for searching and filtering files. Of course the
twin button
enables you to load a file previously saved.
The File
menu provides a way of saving and loading the full content of the
interface. When saving, this is almost equivalent to concatenating
all files generated by all
buttons of all sub-windows.
Finally, if you open a fri
file from a sub-window
button (e.g.: Find
and Replace Control Panel), only content
related to this sub-window will be loaded. That means you can easily
load a part of a file saved through the File/Save
menu.
All saved files are automatically suffixed with fri. They are called UI files (i.e.: User Interface File) for Find & Replace It!.
Some components use tabs to display
multiple documents. To manually open a new document/tab, use this
button
.
To close a document/tab, click on the button
located on the corresponding tab.
This is the list of key bindings which are implemented for editing:
|
Keypresses |
Action |
|---|---|
|
Backspace |
Deletes the character to the left of the cursor. |
|
Delete |
Deletes the character to the right of the cursor. |
|
Ctrl+C |
Copy the selected text to the clipboard. |
|
Ctrl+Insert |
Copy the selected text to the clipboard. |
|
Ctrl+K |
Deletes to the end of the line. |
|
Ctrl+V |
Pastes the clipboard text into text edit. |
|
Shift+Insert |
Pastes the clipboard text into text edit. |
|
Ctrl+X |
Deletes the selected text and copies it to the clipboard. |
|
Shift+Delete |
Deletes the selected text and copies it to the clipboard. |
|
Ctrl+Z |
Undoes the last operation. |
|
Ctrl+Y |
Redoes the last operation. |
|
Left |
Moves the cursor one character to the left. |
|
Ctrl+Left |
Moves the cursor one word to the left. |
|
Right |
Moves the cursor one character to the right. |
|
Ctrl+Right |
Moves the cursor one word to the right. |
|
Up |
Moves the cursor one line up. |
|
Down |
Moves the cursor one line down. |
|
PageUp |
Moves the cursor one page up. |
|
PageDown |
Moves the cursor one page down. |
|
Home |
Moves the cursor to the beginning of the line. |
|
Ctrl+Home |
Moves the cursor to the beginning of the text. |
|
End |
Moves the cursor to the end of the line. |
|
Ctrl+End |
Moves the cursor to the end of the text. |
|
Alt+Wheel |
Scrolls the page horizontally (the Wheel is the mouse wheel). |
To select (mark) text hold down the Shift key whilst pressing one of the movement keystrokes, for example,?Shift+Right?will select the character to the right, and?Shift+Ctrl+Right?will select the word to the right, etc.
It is possible to undo and redo any change made in a text area when this area is editable. On the Edit tool bar or in the Edit menu, simply click on:
Undo (Ctrl+Z)
Redo (Ctrl+Y)
The commands located in the Text display menu let you change the appearance of text zone content:
Zoom out (Ctrl++)
Zoom in (Ctrl+-)
Select a font
All these commands are available for all text area int the software but they only apply to the last area that has been activated. Therefore you might have to click somewhere inside a text area to get a result.
It is possible to search for text
within any text area of the graphical interface. Simply click on
in the Edit
tool bar or in the Edit
menu. This will show up the Find
Text window:
![]()
The search will occur in the last area that has been activated. The background is colorized in green when there is a match, in red when there is no match.

To operate on many files at once it is possible to use the context menu on the Found files list. This menu let you act on the the current selection. As shown below, possible actions on selected files are: toggling check marks, selecting encoding, loading files in the preview. Applying an action on a selection of files. Note that 'Open file in test preview' will open all selected files as distinct documents within Find and Replace Preview.
There are some sample files shipped with Find Replace It!. Under Windows and Linux these files are located within the following directory:
On Mac OS X, these files are located in:
The files suffixed with fri
are UI
files (i.e.: User Interface
File) for Find
& Replace It!. Such files contain stored
user interface data and can be opened with
buttons.
The files suffixed with .txt are sample data provided for convenience, in order to easily test the capabilities of Find & Replace It!.
When willing to script some replacement texts it is convenient to debug the script. Find & Replace It! comes with an integrated debugger. To start the debugger, click on the Execute in debugger button located under the script editor. The debugger will show up:

Script debugger in action. On the right we can see the 'Locals' dock window which displays the current context provided by the 'replaceCommand' object.
A user manual for script debugger is available at http://doc.trolltech.com/4.5/qtscriptdebugger-manual.html
A regular expression, or "regexp", is a pattern for matching substrings in a text. This is useful in many contexts, e.g.:
|
Searching |
A regexp provides more powerful pattern matching than simple substring matching, e.g., match one of the words mail, letter or correspondence, but none of the words email, mailman, mailer, letterbox, etc. |
|
Search and Replace |
A regexp can replace all occurrences of a substring with a different substring, e.g., replace all occurrences of &with & except where the & is already followed by an amp;. |
The Find & Replace It! regexp is modeled on Perl's regexp language. It fully supports Unicode. The regexp can also be used in a simpler, Wildcard mode that is similar to the functionality found in command shells. The syntax rules used by regexp can be changed through the Syntax combo box. In particular, the pattern syntax can be set to Simple text, which means the pattern to be matched is interpreted as a plain string, i.e., special characters (e.g., backslash) are not escaped.
A good text on regexps is Mastering Regular Expressions (Third Edition) by Jeffrey E. F. Friedl, ISBN 0-596-52812-4.
Regexps are built up from expressions, quantifiers, and assertions. The simplest expression is a character, e.g. x or 5. An expression can also be a set of characters enclosed in square brackets. [ABCD] will match an A or a B or a C or a D. We can write this same expression as [A-D], and an experession to match any captital letter in the English alphabet is written as [A-Z].
A quantifier specifies the number of occurrences of an expression that must be matched. x{1,1} means match one and only one x. x{1,5} means match a sequence of x characters that contains at least one x but no more than five.
Note that in general regexps cannot be used to check for balanced brackets or tags. For example, a regexp can be written to match an opening html and its closing , if the tags are not nested, but if the tags are nested, that same regexp will match an opening tag with the wrong closing . For the fragment bold bolder, the first would be matched with the first , which is not correct. However, it is possible to write a regexp that will match nested brackets or tags correctly, but only if the number of nesting levels is fixed and known. If the number of nesting levels is not fixed and known, it is impossible to write a regexp that will not fail.
Suppose we want a regexp to match integers in the range 0 to 99. At least one digit is required, so we start with the expression [0-9]{1,1}, which matches a single digit exactly once. This regexp matches integers in the range 0 to 9. To match integers up to 99, increase the maximum number of occurrences to 2, so the regexp becomes [0-9]{1,2}. This regexp satisfies the original requirement to match integers from 0 to 99, but it will also match integers that occur in the middle of strings. If we want the matched integer to be the whole string, we must use the anchor assertions, ^ (caret) and $ (dollar). When ^ is the first character in a regexp, it means the regexp must match from the beginning of the string. When $ is the last character of the regexp, it means the regexp must match to the end of the string. The regexp becomes ^[0-9]{1,2}$. Note that assertions, e.g. ^ and $, do not match characters but locations in the string.
If you have seen regexps described elsewhere, they may have looked different from the ones shown here. This is because some sets of characters and some quantifiers are so common that they have been given special symbols to represent them.[0-9] can be replaced with the symbol \d. The quantifier to match exactly one occurrence, {1,1}, can be replaced with the expression itself, i.e. x{1,1} is the same as x. So our 0 to 99 matcher could be written as ^\d{1,2}$. It can also be written ^\d\d{0,1}$, i.e. From the start of the string, match a digit, followed immediately by 0 or 1 digits. In practice, it would be written as ^\d\d?$. The ? is shorthand for the quantifier {0,1}, i.e. 0 or 1 occurrences. ? makes an expression optional. The regexp^\d\d?$ means From the beginning of the string, match one digit, followed immediately by 0 or 1 more digit, followed immediately by end of string.
To write a regexp that matches one of the words mail or letter or correspondence but does not match words that contain these words, e.g., email, mailman, mailer, and letterbox, start with a regexp that matches mail. Expressed fully, the regexp is m{1,1}a{1,1}i{1,1}l{1,1}, but because a character expression is automatically quantified by {1,1}, we can simplify the regexp to mail, i.e., an m followed by an a followed by an i followed by an l. Now we can use the vertical bar |, which means “or”, to include the other two words, so our regexp for matching any of the three words becomes mail|letter|correspondence. Match mail or letter or correspondence. While this regexp will match one of the three words we want to match, it will also match words we don't want to match, e.g., email. To prevent the regexp from matching unwanted words, we must tell it to begin and end the match at word boundaries. First we enclose our regexp in parentheses, (mail|letter|correspondence). Parentheses group expressions together, and they identify a part of the regexp that we wish to capture. Enclosing the expression in parentheses allows us to use it as a component in more complex regexps. It also allows us to examine which of the three words was actually matched. To force the match to begin and end on word boundaries, we enclose the regexp in \b word boundary assertions: \b(mail|letter|correspondence)\b. Now the regexp means: Match a word boundary, followed by the regexp in parentheses, followed by a word boundary. The \b assertion matches a position in the regexp, not a character. A word boundary is any non-word character, e.g., a space, newline, or the beginning or ending of a string.
If we want to replace ampersand characters with the HTML entity &, the regexp to match is simply &. But this regexp will also match ampersands that have already been converted to HTML entities. We want to replace only ampersands that are not already followed by amp;. For this, we need the negative lookahead assertion, (?!__). The regexp can then be written as &(?!amp;), i.e. Match an ampersand that is not followed by amp;.
If we want to count all the occurrences of Eric and Eirik in a string, two valid solutions are \b(Eric|Eirik)\b and \bEi?ri[ck]\b. The word boundary assertion \b is required to avoid matching words that contain either name, e.g. Ericsson. Note that the second regexp matches more spellings than we want: Eric, Erik, Eiric and Eirik.
Some of the examples discussed above are implemented in the examples section.
Regexps can match case insensitively using the Case sensitive check box, and can use non-greedy matching when the Minimal match mark is checked.
|
Element |
Meaning |
|---|---|
|
c |
A character represents itself unless it has a special regexp meaning. e.g. c matches the character c. |
|
\c |
A character that follows a backslash matches the character itself, except as specified below. e.g., To match a literal caret at the beginning of a string, write \^. |
|
\a |
Matches the ASCII bell (BEL, 0x07). |
|
\f |
Matches the ASCII form feed (FF, 0x0C). |
|
\n |
Matches the ASCII line feed (LF, 0x0A, Unix newline). |
|
\r |
Matches the ASCII carriage return (CR, 0x0D). |
|
\t |
Matches the ASCII horizontal tab (HT, 0x09). |
|
\v |
Matches the ASCII vertical tab (VT, 0x0B). |
|
\xhhhh |
Matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). |
|
\0ooo (i.e., \zero ooo) |
matches the ASCII/Latin1 character for the octal number ooo (between 0 and 0377). |
|
. (dot) |
Matches any character (including newline). |
|
\d |
Matches a digit. |
|
\D |
Matches a non-digit. |
|
\s |
Matches a whitespace character including line separators. |
|
\S |
Matches a non-whitespace character. |
|
\w |
Matches a word character (letters, numbers, marks and '_'). |
|
\W |
Matches a non-word character. |
|
\n |
The n-th backreference, e.g. \1, \2, etc. |
Square brackets mean match any character contained in the square brackets. The character set abbreviations described above can appear in a character set in square brackets. Except for the character set abbreviations and the following two exceptions, characters do not have special meanings in square brackets.
|
^ |
The caret negates the character set if it occurs as the first character (i.e. immediately after the opening square bracket).[abc] matches a or b or c, but [^abc] matches anything but a or b or c. |
|
- |
The dash indicates a range of characters. [W-Z] matches W or X or Y or Z. |
Using the predefined character set abbreviations is more portable than using character ranges across platforms and languages. For example, [0-9] matches a digit in Western alphabets but \d matches a digit in any alphabet.
Note: In other regexp documentation, sets of characters are often called "character classes".
By default, an expression is automatically quantified by {1,1}, i.e. it should occur exactly once. In the following list, E stands for expression. An expression is a character, or an abbreviation for a set of characters, or a set of characters in square brackets, or an expression in parentheses.
|
E? |
Matches zero or one occurrences of E. This quantifier means The previous expression is optional, because it will match whether or not the expression is found. E? is the same as E{0,1}. e.g., dents? matches dent or dents. |
|
E+ |
Matches one or more occurrences of E. E+ is the same as E{1,}. e.g., 0+ matches 0, 00, 000, etc. |
|
E* |
Matches zero or more occurrences of E. It is the same as E{0,}. The * quantifier is often used in error where +should be used. For example, if \s*$ is used in an expression to match strings that end in whitespace, it will match every string because \s*$ means Match zero or more whitespaces followed by end of string. The correct regexp to match strings that have at least one trailing whitespace character is \s+$. |
|
E{n} |
Matches exactly n occurrences of E. E{n} is the same as repeating E n times. For example, x{5} is the same as xxxxx. It is also the same as E{n,n}, e.g. x{5,5}. |
|
E{n,} |
Matches at least n occurrences of E. |
|
E{,m} |
Matches at most m occurrences of E. E{,m} is the same as E{0,m}. |
|
E{n,m} |
Matches at least n and at most m occurrences of E. |
To apply a quantifier to more than just the preceding character, use parentheses to group characters together in an expression. For example, tag+ matches a t followed by an a followed by at least one g, whereas (tag)+ matches at least one occurrence of tag.
Note: Quantifiers are normally "greedy". They always match as much text as they can. For example, 0+ matches the first zero it finds and all the consecutive zeros after the first zero. Applied to 20005, it matches20005. Quantifiers can be made non-greedy through the check box Minimal match.
Parentheses allow us to group elements together so that we can quantify and capture them. For example if we have the expression mail|letter|correspondence that matches a string we know that one of the words matched but not which one. Using parentheses allows us to "capture" whatever is matched within their bounds, so if we used(mail|letter|correspondence) and matched this regexp against the string I sent you some email we can use the %x replacement pattern to extract the matched characters, in this case mail.
We can use captured text within the regexp itself. To refer to the captured text we use “backreferences” which are indexed from 1, the same as for %x. For example we could search for duplicate words in a string using \b(\w+)\W+\1\b which means match a word boundary followed by one or more word characters followed by one or more non-word characters followed by the same text as the first parenthesized expression followed by a word boundary.
If we want to use parentheses purely for grouping and not for capturing we can use the non-capturing syntax, e.g. (?:green|blue). Non-capturing parentheses begin (?: and end ). In this example we match either green or blue but we do not capture the match so we only know whether or not we matched but not which color we actually found. Using non-capturing parentheses is more efficient than using capturing parentheses since the regexp engine has to do less book-keeping.
Captured text can be accessed in replacement pattern using %0 which returns the full matched expression, or using %i (with 1
Both capturing and non-capturing parentheses may be nested.
Assertions make some statement about the text at the point where they occur in the regexp but they do not match any characters. In the following list E stands for any expression.
|
^ |
The caret signifies the beginning of the string. If you wish to match a literal ^ you must escape it by writing \\^. For example, ^#include will only match strings which begin with the characters #include. (When the caret is the first character of a character set it has a special meaning, see Sets of Characters.) |
|
$ |
The dollar signifies the end of the string. For example \d\s*$ will match strings which end with a digit optionally followed by whitespace. If you wish to match a literal $ you must escape it by writing \\$. |
|
\b |
A word boundary. For example the regexp \bOK\b means match immediately after a word boundary (e.g. start of string or whitespace) the letter O then the letter K immediately before another word boundary (e.g. end of string or whitespace). But note that the assertion does not actually match any whitespace so if we write (\bOK\b) and we have a match it will only contain OK even if the string is It's OK now. |
|
\B |
A non-word boundary. This assertion is true wherever \b is false. For example if we searched for \Bon\B in "Left on" the match would fail (space and end of string aren't non-word boundaries), but it would match in tonne. |
|
(?=E) |
Positive lookahead. This assertion is true if the expression matches at this point in the regexp. For example, const(?=\s+char) matches const whenever it is followed by char, as in static const char *. (Compare with const\s+char, which matches static const char *). |
|
(?!E) |
Negative lookahead. This assertion is true if the expression does not match at this point in the regexp. For example, const(?!\s+char) matches const except when it is followed by char. |
Most command shells such as bash or cmd.exe support "file globbing", the ability to identify a group of files by using wildcards. The Syntax combo box is used to switch between regexp and wildcard mode. Wildcard matching is much simpler than full regexps and has only four features:
|
c |
Any character represents itself apart from those mentioned below. Thus c matches the character c. |
|
? |
Matches any single character. It is the same as . in full regexps. |
|
* |
Matches zero or more of any characters. It is the same as .* in full regexps. |
|
[...] |
Sets of characters can be represented in square brackets, similar to full regexps. Within the character class, like outside, backslash has no special meaning. |
For example if we are in wildcard mode and have strings which contain filenames we could identify HTML files with *.html. This will match zero or more characters followed by a dot followed by h, t, m and l.
Wildcard matching can be convenient because of its simplicity, but any wildcard regexp can be defined using full regexps, e.g. .*\.html?$. Notice that we can't match both .html and .htm files with a wildcard unless we use *.htm* which will also match test.html.bak. A full regexp gives us the precision we need, .*\.html?$.
Most of the character class abbreviations supported by Perl are supported by regexp's, see characters and abbreviations for sets of characters.
In regexps, apart from within character classes, ^ always signifies the start of the string, so carets must always be escaped unless used for that purpose. In Perl the meaning of caret varies automagically depending on where it occurs so escaping it is rarely necessary. The same applies to $ which in regexps always signifies the end of the string.
Regexp's quantifiers are the same as Perl's greedy quantifiers. Non-greedy matching cannot be applied to individual quantifiers, but can be applied to all the quantifiers in the pattern. For example, to match the Perl regexp ro+?m requires: ro+m and Minimal match=true
The equivalent of Perl's /i option is Case sensitive check box turned on.
In regexp . matches any character, therefore all regexps have the equivalent of Perl's /s option. Regexp does not have an equivalent to Perl's /m option, but this can be emulated in various ways for example by splitting the input into lines or by looping with a regexp that searches for newlines.
Because regexp is string oriented, there are no \A, \Z, or \z assertions. The \G assertion is not supported but can be emulated in a loop.
Perl's $& is %0. There are no regexp equivalents for $`, $' or $+. Perl's capturing variables, $1, $2, ... correspond to \1, \2 inside search pattern and %1, %2 inside replacement pattern, etc.
Perl's extended /x syntax is not supported, nor are directives, e.g. (?i), or regexp comments, e.g. (?#comment).
Both zero-width positive and zero-width negative look-ahead assertions (?=pattern) and (?!pattern) are supported with the same syntax as Perl. Perl's look-behind assertions, "independent" sub-expressions and conditional expressions are not supported.
Non-capturing parentheses are also supported, with the same (?:pattern) syntax.
|
^\d\d?$ |
|
|
Match integers from 0 to 99 |
|
|
123 |
Do not match |
|
-6 |
Do not match |
|
6 |
Match |
The third string matches 6. This is a simple validation regexp for integers in the range 0 to 99.
|
^\S+$ |
|
|
Match strings without whitespaces |
|
|
Hello world |
Do not match |
|
This_is-OK |
Match |
The second string matches This_is-OK. We've used the character set abbreviation \S (non-whitespace) and the anchors to match strings which contain no whitespace.
In the following example we match strings containing mail or letter or correspondence but only match whole words i.e. not email.
|
\b(mail|letter|correspondence)\b |
|
|
Match words mail, letter and correspondence |
|
|
I sent you an email |
Do not match |
|
Please write the letter |
Match |
The second string matches Please write the letter. The word letter is also captured (because of the parentheses). We can see what text we've captured like this: %1 = \1 = letter
This will capture the text from the first set of capturing parentheses (counting capturing left parentheses from left to right). The parentheses are counted from 1 since %0 (\0) is the whole matched regexp (equivalent to & in most regexp engines).
|
&(?!amp;) |
|
|
Match ampersands but not & |
|
|
This & that |
Match one occurrence at index 6 |
|
His & hers & theirs |
Match one occurrence at index 16 |
1This chapter is taken from the Qt ® documentation from Nokia ®, available under LGPL. It as been adapted to fit the purpose of this manual.