Sortix 1.1dev ports manual
This manual documents Sortix 1.1dev ports. You can instead view this document in the latest official manual.
PCREAPI(3) | Library Functions Manual | PCREAPI(3) |
NAME
PCRE - Perl-compatible regular expressionsPCRE NATIVE API BASIC FUNCTIONS
pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, const unsigned char *tableptr);pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr);pcre_extra *pcre_study(const pcre *code, int options, const char **errptr);void pcre_free_study(pcre_extra *extra);int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize);int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, int *workspace, int wscount);
PCRE NATIVE API STRING EXTRACTION FUNCTIONS
int pcre_copy_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, char *buffer, int buffersize);int pcre_copy_substring(const char *subject, int *ovector, int stringcount, int stringnumber, char *buffer, int buffersize);int pcre_get_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, const char **stringptr);int pcre_get_stringnumber(const pcre *code, const char *name);int pcre_get_stringtable_entries(const pcre *code, const char *name, char **first, char **last);int pcre_get_substring(const char *subject, int *ovector, int stringcount, int stringnumber, const char **stringptr);int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, const char ***listptr);void pcre_free_substring(const char *stringptr);void pcre_free_substring_list(const char **stringptr);
PCRE NATIVE API AUXILIARY FUNCTIONS
int pcre_jit_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, pcre_jit_stack *jstack);pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);void pcre_jit_stack_free(pcre_jit_stack *stack);void pcre_assign_jit_stack(pcre_extra *extra, pcre_jit_callback callback, void *data);const unsigned char *pcre_maketables(void);int pcre_fullinfo(const pcre *code, const pcre_extra *extra, int what, void *where);int pcre_refcount(pcre *code, int adjust);int pcre_config(int what, void *where);const char *pcre_version(void);int pcre_pattern_to_host_byte_order(pcre *code, pcre_extra *extra, const unsigned char *tables);
PCRE NATIVE API INDIRECTED FUNCTIONS
void *(*pcre_malloc)(size_t);void (*pcre_free)(void *);void *(*pcre_stack_malloc)(size_t);void (*pcre_stack_free)(void *);int (*pcre_callout)(pcre_callout_block *);int (*pcre_stack_guard)(void);
PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
As well as support for 8-bit character strings, PCRE also supports 16-bit strings (from release 8.30) and 32-bit strings (from release 8.32), by means of two additional libraries. They can be built as well as, or instead of, the 8-bit library. To avoid too much complication, this document describes the 8-bit versions of the functions, with only occasional references to the 16-bit and 32-bit libraries. The 16-bit and 32-bit functions operate in the same way as their 8-bit counterparts; they just use different data types for their arguments and results, and their names start with pcre16_ or pcre32_ instead of pcre_. For every option that has UTF8 in its name (for example, PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8 replaced by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the 16-bit and 32-bit option names define the same bit values. References to bytes and UTF-8 in this document should be read as references to 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data units and UTF-32 when using the 32-bit library, unless specified otherwise. More details of the specific differences for the 16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.PCRE API OVERVIEW
PCRE has its own native API, which is described in this document. There are also some wrapper functions (for the 8-bit library only) that correspond to the POSIX regular expression API, but they do not give access to all the functionality. They are described in the pcreposix documentation. Both of these APIs define a set of C function calls. A C++ wrapper (again for the 8-bit library only) is also distributed with PCRE. It is documented in the pcrecpp page. The native API C function prototypes are defined in the header file pcre.h, and on Unix-like systems the (8-bit) library itself is called libpcre. It can normally be accessed by adding -lpcre to the command for linking an application that uses PCRE. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release numbers for the library. Applications can use these to include support for different releases of PCRE. In a Windows environment, if you want to statically link an application program against a non-dll pcre.a file, you must define PCRE_STATIC before including pcre.h or pcrecpp.h, because otherwise the pcre_malloc() and pcre_free() exported functions will be declared __declspec(dllimport), with unwanted results. The functions pcre_compile(), pcre_compile2(), pcre_study(), and pcre_exec() are used for compiling and matching regular expressions in a Perl-compatible manner. A sample program that demonstrates the simplest way of using them is provided in the file called pcredemo.c in the PCRE source distribution. A listing of this program is given in the pcredemo documentation, and the pcresample documentation describes how to compile and run it. Just-in-time compiler support is an optional feature of PCRE that can be built in appropriate hardware environments. It greatly speeds up the matching performance of many patterns. Simple programs can easily request that it be used if available, by setting an option that is ignored when it is not relevant. More complicated programs might need to make use of the functions pcre_jit_stack_alloc(), pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to control the JIT code's memory usage. From release 8.32 there is also a direct interface for JIT execution, which gives improved performance. The JIT-specific functions are discussed in the pcrejit documentation. A second matching function, pcre_dfa_exec(), which is not Perl-compatible, is also provided. This uses a different algorithm for the matching. The alternative algorithm finds all possible matches (at a given point in the subject), and scans the subject just once (unless there are lookbehind assertions). However, this algorithm does not return captured substrings. A description of the two matching algorithms and their advantages and disadvantages is given in the pcrematching documentation. In addition to the main compiling and matching functions, there are convenience functions for extracting captured substrings from a subject string that is matched by pcre_exec(). They are:pcre_copy_substring()
pcre_copy_named_substring()
pcre_get_substring()
pcre_get_named_substring()
pcre_get_substring_list()
pcre_get_stringnumber()
pcre_get_stringtable_entries()
NEWLINES
PCRE supports five different conventions for indicating line breaks in strings: a single CR (carriage return) character, a single LF (linefeed) character, the two-character sequence CRLF, any of the three preceding, or any Unicode newline sequence. The Unicode newline sequences are the three just mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029). Each of the first three conventions is used by at least one operating system as its standard newline sequence. When PCRE is built, a default can be specified. The default default is LF, which is the Unix standard. When PCRE is run, the default can be overridden, either when a pattern is compiled, or when it is matched. At compile time, the newline convention can be specified by the options argument of pcre_compile(), or it can be specified by special text at the start of the pattern itself; this overrides any other settings. See the pcrepattern page for details of the special character sequences. In the PCRE documentation the word "newline" is used to mean "the character or pair of characters that indicate a line break". The choice of newline convention affects the handling of the dot, circumflex, and dollar metacharacters, the handling of #-comments in /x mode, and, when CRLF is a recognized line ending sequence, the match position advancement for a non-anchored pattern. There is more detail about this in the section on pcre_exec() options below. The choice of newline convention does not affect the interpretation of the \n or \r escape sequences, nor does it affect what \R matches, which is controlled in a similar way, but by separate options.MULTITHREADING
The PCRE functions can be used in multi-threading applications, with the proviso that the memory management functions pointed to by pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the callout and stack-checking functions pointed to by pcre_callout and pcre_stack_guard, are shared by all threads. The compiled form of a regular expression is not altered during matching, so the same compiled pattern can safely be used by several threads at once. If the just-in-time optimization feature is being used, it needs separate memory stack areas for each thread. See the pcrejit documentation for more details.SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a later time, possibly by a different program, and even on a host other than the one on which it was compiled. Details are given in the pcreprecompile documentation, which includes a description of the pcre_pattern_to_host_byte_order() function. However, compiling a regular expression with one version of PCRE for use with a different version is not guaranteed to work and may cause crashes.CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where); The function pcre_config() makes it possible for a PCRE client to discover which optional features have been compiled into the PCRE library. The pcrebuild documentation has more details about these optional features. The first argument for pcre_config() is an integer, specifying which information is required; the second argument is a pointer to a variable into which the information is placed. The returned value is zero on success, or the negative error code PCRE_ERROR_BADOPTION if the value in the first argument is not recognized. The following information is available:PCRE_CONFIG_UTF8
PCRE_CONFIG_UTF16
PCRE_CONFIG_UTF32
PCRE_CONFIG_UNICODE_PROPERTIES
PCRE_CONFIG_JIT
PCRE_CONFIG_JITTARGET
PCRE_CONFIG_NEWLINE
PCRE_CONFIG_BSR
PCRE_CONFIG_LINK_SIZE
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
PCRE_CONFIG_PARENS_LIMIT
PCRE_CONFIG_MATCH_LIMIT
PCRE_CONFIG_MATCH_LIMIT_RECURSION
PCRE_CONFIG_STACKRECURSE
COMPILING A PATTERN
pcre *pcre_compile(const char *pattern, int options, const char **errptr, int *erroffset, const unsigned char *tableptr);Either of the functions pcre_compile() or pcre_compile2() can be called to compile a pattern into an internal form. The only difference between the two interfaces is that pcre_compile2() has an additional argument, errorcodeptr, via which a numerical error code can be returned. To avoid too much repetition, we refer just to pcre_compile() below, but the information applies equally to pcre_compile2(). The pattern is a C string terminated by a binary zero, and is passed in the pattern argument. A pointer to a single block of memory that is obtained via pcre_malloc is returned. This contains the compiled code and related data. The pcre type is defined for the returned block; this is a typedef for a structure whose contents are not externally defined. It is up to the caller to free the memory (via pcre_free) when it is no longer required. Although the compiled code of a PCRE regex is relocatable, that is, it does not depend on memory location, the complete pcre data block is not fully relocatable, because it may contain a copy of the tableptr argument, which is an address (see below). The options argument contains various bit settings that affect the compilation. It should be zero if no options are required. The available options are described below. Some of them (in particular, those that are compatible with Perl, but some others as well) can also be set and unset from within the pattern (see the detailed description in the pcrepattern documentation). For those options that can be different in different parts of the pattern, the contents of the options argument specifies their settings at the start of compilation and execution. The PCRE_ANCHORED, PCRE_BSR_ xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and PCRE_NO_START_OPTIMIZE options can be set at the time of matching as well as at compile time. If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, if compilation of a pattern fails, pcre_compile() returns NULL, and sets the variable pointed to by errptr to point to a textual error message. This is a static string that is part of the library. You must not try to free it. Normally, the offset from the start of the pattern to the data unit that was being processed when the error was discovered is placed in the variable pointed to by erroffset, which must not be NULL (if it is, an immediate error is given). However, for an invalid UTF-8 or UTF-16 string, the offset is that of the first data unit of the failing character. Some errors are not detected until the whole pattern has been scanned; in these cases, the offset passed back is the length of the pattern. Note that the offset is in data units, not characters, even in a UTF mode. It may sometimes point into the middle of a UTF-8 or UTF-16 character. If pcre_compile2() is used instead of pcre_compile(), and the errorcodeptr argument is not NULL, a non-zero error code number is returned via this argument in the event of an error. This is in addition to the textual error message. Error codes and messages are listed below. If the final argument, tableptr, is NULL, PCRE uses a default set of character tables that are built when PCRE is compiled, using the default C locale. Otherwise, tableptr must be an address that is the result of a call to pcre_maketables(). This value is stored with the compiled pattern, and used again by pcre_exec() and pcre_dfa_exec() when the pattern is matched. For more discussion, see the section on locale support below. This code fragment shows a typical straightforward call to pcre_compile():pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr);
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
PCRE_ANCHORED
PCRE_AUTO_CALLOUT
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
PCRE_CASELESS
PCRE_DOLLAR_ENDONLY
PCRE_DOTALL
PCRE_DUPNAMES
PCRE_EXTENDED
PCRE_EXTRA
PCRE_FIRSTLINE
PCRE_JAVASCRIPT_COMPAT
PCRE_MULTILINE
PCRE_NEVER_UTF
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
PCRE_NO_AUTO_CAPTURE
PCRE_NO_AUTO_POSSESS
PCRE_NO_START_OPTIMIZE
PCRE_UCP
PCRE_UNGREEDY
PCRE_UTF8
PCRE_NO_UTF8_CHECK
COMPILATION ERROR CODES
The following table lists the error codes than may be returned by pcre_compile2(), along with the error messages that may be returned by both compiling functions. Note that error messages are always 8-bit ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed, some error codes have fallen out of use. To avoid confusion, they have not been re-used.0 no error
1 \ at end of pattern
2 \c at end of pattern
3 unrecognized character follows \
4 numbers out of order in {} quantifier
5 number too big in {} quantifier
6 missing terminating ] for character class
7 invalid escape sequence in character class
8 range out of order in character class
9 nothing to repeat
10 [this code is not in use]
11 internal error: unexpected repeat
12 unrecognized character after (? or (?-
13 POSIX named classes are supported only within a class
14 missing )
15 reference to non-existent subpattern
16 erroffset passed as NULL
17 unknown option bit(s) set
18 missing ) after comment
19 [this code is not in use]
20 regular expression is too large
21 failed to get memory
22 unmatched parentheses
23 internal error: code overflow
24 unrecognized character after (?<
25 lookbehind assertion is not fixed length
26 malformed number or name after (?(
27 conditional group contains more than two branches
28 assertion expected after (?(
29 (?R or (?[+-]digits must be followed by )
30 unknown POSIX class name
31 POSIX collating elements are not supported
32 this version of PCRE is compiled without UTF support
33 [this code is not in use]
34 character value in \x{} or \o{} is too large
35 invalid condition (?(0)
36 \C not allowed in lookbehind assertion
37 PCRE does not support \L, \l, \N{name}, \U, or \u
38 number after (?C is > 255
39 closing ) for (?C expected
40 recursive call could loop indefinitely
41 unrecognized character after (?P
42 syntax error in subpattern name (missing terminator)
43 two named subpatterns have the same name
44 invalid UTF-8 string (specifically UTF-8)
45 support for \P, \p, and \X has not been compiled
46 malformed \P or \p sequence
47 unknown property name after \P or \p
48 subpattern name is too long (maximum 32 characters)
49 too many named subpatterns (maximum 10000)
50 [this code is not in use]
51 octal value is greater than \377 in 8-bit non-UTF-8 mode
52 internal error: overran compiling workspace
53 internal error: previously-checked referenced subpattern
not found
54 DEFINE group contains more than one branch
55 repeating a DEFINE group is not allowed
56 inconsistent NEWLINE options
57 \g is not followed by a braced, angle-bracketed, or quoted
name/number or by a plain number
58 a numbered reference must not be zero
59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
60 (*VERB) not recognized or malformed
61 number is too big
62 subpattern name expected
63 digit expected after (?+
64 ] is an invalid data character in JavaScript compatibility mode
65 different names for subpatterns of the same number are
not allowed
66 (*MARK) must have an argument
67 this version of PCRE is not compiled with Unicode property
support
68 \c must be followed by an ASCII character
69 \k is not followed by a braced, angle-bracketed, or quoted name
70 internal error: unknown opcode in find_fixedlength()
71 \N is not supported in a class
72 too many forward references
73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
74 invalid UTF-16 string (specifically UTF-16)
75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
76 character value in \u.... sequence is too large
77 invalid UTF-32 string (specifically UTF-32)
78 setting UTF is disabled by the application
79 non-hex character in \x{} (closing brace missing?)
80 non-octal character in \o{} (closing brace missing?)
81 missing opening brace after \o
82 parentheses are too deeply nested
83 invalid range in character class
84 group name must start with a non-digit
85 parentheses are too deeply nested (stack check)
STUDYING A PATTERN
pcre_extra *pcre_study(const pcre *code, int options, const char **errptr);If a compiled pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for matching. The function pcre_study() takes a pointer to a compiled pattern as its first argument. If studying the pattern produces additional information that will help speed up matching, pcre_study() returns a pointer to a pcre_extra block, in which the study_data field points to the results of the study. The returned value from pcre_study() can be passed directly to pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also contains other fields that can be set by the caller before the block is passed; these are described below in the section on matching a pattern. If studying the pattern does not produce any useful information, pcre_study() returns NULL by default. In that circumstance, if the calling program wants to pass any of the other fields to pcre_exec() or pcre_dfa_exec(), it must set up its own pcre_extra block. However, if pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it returns a pcre_extra block even if studying did not find any additional information. It may still return NULL, however, if an error occurs in pcre_study(). The second argument of pcre_study() contains option bits. There are three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
PCRE_STUDY_JIT_COMPILE
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
int rc;
pcre *re;
pcre_extra *sd;
re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
sd = pcre_study(
re, /* result of pcre_compile() */
0, /* no options */
&error); /* set to NULL or points to a message */
rc = pcre_exec( /* see below for details of pcre_exec() options */
re, sd, "subject", 7, 0, 0, ovector, 30);
...
pcre_free_study(sd);
pcre_free(re);
LOCALE SUPPORT
PCRE handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character code point. When running in UTF-8 mode, or in the 16- or 32-bit libraries, this applies only to characters with code points less than 256. By default, higher-valued code points never match escapes such as \w or \d. However, if PCRE is built with Unicode property support, all characters can be tested with \p and \P, or, alternatively, the PCRE_UCP option can be set when a pattern is compiled; this causes \w and friends to use Unicode property support instead of the built-in tables. The use of locales with Unicode is discouraged. If you are handling characters with code points greater than 128, you should either use Unicode support, or use locales, but not try to mix the two. PCRE contains an internal set of tables that are used when the final argument of pcre_compile() is NULL. These are sufficient for many applications. Normally, the internal tables recognize only ASCII characters. However, when PCRE is built, it is possible to cause the internal tables to be rebuilt in the default "C" locale of the local system, which may cause them to be different. The internal tables can always be overridden by tables supplied by the application that calls PCRE. These may be created in a different locale from the default. As more and more applications change to using Unicode, the need for this locale support is expected to die away. External tables are built by calling the pcre_maketables() function, which has no arguments, in the relevant locale. The result can then be passed to pcre_compile() as often as necessary. For example, to build and use tables that are appropriate for the French locale (where accented characters with values greater than 128 are treated as letters), the following code could be used:setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, int what, void *where);The pcre_fullinfo() function returns information about a compiled pattern. It replaces the pcre_info() function, which was removed from the library at version 8.30, after more than 10 years of obsolescence. The first argument for pcre_fullinfo() is a pointer to the compiled pattern. The second argument is the result of pcre_study(), or NULL if the pattern was not studied. The third argument specifies which piece of information is required, and the fourth argument is a pointer to a variable to receive the data. The yield of the function is zero for success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL
the argument where was NULL
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
endianness
PCRE_ERROR_BADOPTION the value of what was invalid
PCRE_ERROR_UNSET the requested field is not set
int rc;
size_t length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
sd, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
PCRE_INFO_BACKREFMAX
PCRE_INFO_CAPTURECOUNT
PCRE_INFO_DEFAULT_TABLES
PCRE_INFO_FIRSTBYTE (deprecated)
PCRE_INFO_FIRSTCHARACTER
PCRE_INFO_FIRSTCHARACTERFLAGS
PCRE_INFO_FIRSTTABLE
PCRE_INFO_HASCRORLF
PCRE_INFO_JCHANGED
PCRE_INFO_JIT
PCRE_INFO_JITSIZE
PCRE_INFO_LASTLITERAL
PCRE_INFO_MATCH_EMPTY
PCRE_INFO_MATCHLIMIT
PCRE_INFO_MAXLOOKBEHIND
PCRE_INFO_MINLENGTH
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )
00 01 d a t e 00 ??
00 05 d a y 00 ?? ??
00 04 m o n t h 00
00 02 y e a r 00 ??
PCRE_INFO_OKPARTIAL
PCRE_INFO_OPTIONS
^ unless PCRE_MULTILINE is set
\A always
\G always
.* if PCRE_DOTALL is set and there are no back
references to the subpattern in which .* appears
PCRE_INFO_RECURSIONLIMIT
PCRE_INFO_SIZE
PCRE_INFO_STUDYSIZE
PCRE_INFO_REQUIREDCHARFLAGS
PCRE_INFO_REQUIREDCHAR
REFERENCE COUNTS
int pcre_refcount(pcre *code, int adjust); The pcre_refcount() function is used to maintain a reference count in the data block that contains a compiled pattern. It is provided for the benefit of applications that operate in an object-oriented manner, where different parts of the application may be using the same compiled pattern, but you want to free the block when they are all done. When a pattern is compiled, the reference count field is initialized to zero. It is changed only by calling this function, whose action is to add the adjust value (which may be positive or negative) to it. The yield of the function is the new value. However, the value of the count is constrained to lie between 0 and 65535, inclusive. If the new value is outside these limits, it is forced to the appropriate limit value. Except when it is zero, the reference count is not correctly preserved if a pattern is compiled on one host and then transferred to a host whose byte-order is different. (This seems a highly unlikely scenario.)MATCHING A PATTERN: THE TRADITIONAL FUNCTION
int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize);The function pcre_exec() is called to match a subject string against a compiled pattern, which is passed in the code argument. If the pattern was studied, the result of the study should be passed in the extra argument. You can call pcre_exec() with the same code and extra arguments as many times as you like, in order to match different subject strings with the same pattern. This function is the main matching facility of the library, and it operates in a Perl-like manner. For specialist use there is also an alternative matching function, which is described below in the section about the pcre_dfa_exec() function. In most applications, the pattern will have been compiled (and optionally studied) in the same process that calls pcre_exec(). However, it is possible to save compiled patterns and study data, and then use them later in different processes, possibly even on different hosts. For a discussion about this, see the pcreprecompile documentation. Here is an example of a simple call to pcre_exec():
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring information */
30); /* number of elements (NOT size in bytes) */
Extra data for pcre_exec()
If the extra argument is not NULL, it must point to a pcre_extra data block. The pcre_study() function returns such a block (when it doesn't return NULL), but you can also create one for yourself, and pass additional information in it. The pcre_extra block contains the following fields (not necessarily in this order):unsigned long int flags;
void * study_data;
void * executable_jit;
unsigned long int match_limit;
unsigned long int match_limit_recursion;
void * callout_data;
const unsigned char * tables;
unsigned char ** mark;
PCRE_EXTRA_CALLOUT_DATA
PCRE_EXTRA_EXECUTABLE_JIT
PCRE_EXTRA_MARK
PCRE_EXTRA_MATCH_LIMIT
PCRE_EXTRA_MATCH_LIMIT_RECURSION
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_TABLES
(*LIMIT_MATCH=d)
(*LIMIT_RECURSION=d)
Option bits for pcre_exec()
The unused bits of the options argument for pcre_exec() must be zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_ xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If the pattern was successfully studied with one of the just-in-time (JIT) compile options, the only supported options for JIT execution are PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an unsupported option is used, JIT execution is disabled and the normal interpretive code in pcre_exec() is run.PCRE_ANCHORED
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
PCRE_NEWLINE_CR
PCRE_NEWLINE_LF
PCRE_NEWLINE_CRLF
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
PCRE_NOTBOL
PCRE_NOTEOL
PCRE_NOTEMPTY
a?b?
PCRE_NOTEMPTY_ATSTART
PCRE_NO_START_OPTIMIZE
(*COMMIT)ABC
(*MARK:A)(X|Y)
PCRE_NO_UTF8_CHECK
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
The string to be matched by pcre_exec()
The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library. If startoffset is negative or greater than the length of the subject, pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is zero, the search for a match starts at the beginning of the subject, and this is by far the most common case. In UTF-8 or UTF-16 mode, the offset must point to the start of a character, or the end of the subject (in UTF-32 mode, one data unit equals one character, so all offsets are valid). Unlike the pattern string, the subject may contain binary zeroes. A non-zero starting offset is useful when searching for another match in the same subject by calling pcre_exec() again after a previous success. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind. For example, consider the pattern\Biss\B
How pcre_exec() returns captured substrings
In general, a pattern matches a certain portion of the subject, and in addition, further substrings from the subject may be picked out by parts of the pattern. Following the usage in Jeffrey Friedl's book, this is called "capturing" in what follows, and the phrase "capturing subpattern" is used for a fragment of a pattern that picks out a substring. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured. Captured substrings are returned to the caller via a vector of integers whose address is passed in ovector. The number of elements in the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes. The first two-thirds of the vector is used to pass back captured substrings, each substring using a pair of integers. The remaining third of the vector is used as workspace by pcre_exec() while matching capturing subpatterns, and is not available for passing back information. The number passed in ovecsize should always be a multiple of three. If it is not, it is rounded down. When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. These values are always data unit offsets, even in UTF mode. They are byte offsets in the 8-bit library, 16-bit data item offsets in the 16-bit library, and 32-bit data item offsets in the 32-bit library. Note: they are not character counts. The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings have been captured, the returned value is 3. If there are no capturing subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set. If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned. If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the function returns a value of zero. If neither the actual string matched nor any captured substrings are of interest, pcre_exec() may be called with ovector passed as NULL and ovecsize as zero. However, if the pattern contains back references and the ovector is not big enough to remember the related substrings, PCRE has to get additional memory for use during matching. Thus it is usually advisable to supply an ovector of reasonable size. There are some cases where zero is returned (indicating vector overflow) when in fact the vector is exactly the right size for the final match. For example, consider the pattern(a)(?:(b)c|bd)
Error return values from pcre_exec()
If pcre_exec() fails, it returns a negative number. The following are defined in the header file:PCRE_ERROR_NOMATCH (-1)
PCRE_ERROR_NULL (-2)
PCRE_ERROR_BADOPTION (-3)
PCRE_ERROR_BADMAGIC (-4)
PCRE_ERROR_UNKNOWN_OPCODE (-5)
PCRE_ERROR_NOMEMORY (-6)
PCRE_ERROR_NOSUBSTRING (-7)
PCRE_ERROR_MATCHLIMIT (-8)
PCRE_ERROR_CALLOUT (-9)
PCRE_ERROR_BADUTF8 (-10)
PCRE_ERROR_BADUTF8_OFFSET (-11)
PCRE_ERROR_PARTIAL (-12)
PCRE_ERROR_BADPARTIAL (-13)
PCRE_ERROR_INTERNAL (-14)
PCRE_ERROR_BADCOUNT (-15)
PCRE_ERROR_RECURSIONLIMIT (-21)
PCRE_ERROR_BADNEWLINE (-23)
PCRE_ERROR_BADOFFSET (-24)
PCRE_ERROR_SHORTUTF8 (-25)
PCRE_ERROR_RECURSELOOP (-26)
PCRE_ERROR_JIT_STACKLIMIT (-27)
PCRE_ERROR_BADMODE (-28)
PCRE_ERROR_BADENDIANNESS (-29)
PCRE_ERROR_JIT_BADOPTION
PCRE_ERROR_BADLENGTH (-32)
Reason codes for invalid UTF-8 strings
This section applies only to the 8-bit library. The corresponding information for the 16-bit and 32-bit libraries is given in the pcre16 and pcre32 pages. When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORTUTF8, and the size of the output vector ( ovecsize) is at least 2, the offset of the start of the invalid UTF-8 character is placed in the first output vector element ( ovector[0]) and a reason code is placed in the second element ( ovector[1]). The reason codes are given names in the pcre.h header file:PCRE_UTF8_ERR1
PCRE_UTF8_ERR2
PCRE_UTF8_ERR3
PCRE_UTF8_ERR4
PCRE_UTF8_ERR5
PCRE_UTF8_ERR6
PCRE_UTF8_ERR7
PCRE_UTF8_ERR8
PCRE_UTF8_ERR9
PCRE_UTF8_ERR10
PCRE_UTF8_ERR11
PCRE_UTF8_ERR12
PCRE_UTF8_ERR13
PCRE_UTF8_ERR14
PCRE_UTF8_ERR15
PCRE_UTF8_ERR16
PCRE_UTF8_ERR17
PCRE_UTF8_ERR18
PCRE_UTF8_ERR19
PCRE_UTF8_ERR20
PCRE_UTF8_ERR21
PCRE_UTF8_ERR22
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_copy_substring(const char *subject, int *ovector, int stringcount, int stringnumber, char *buffer, int buffersize);Captured substrings can be accessed directly by using the offsets returned by pcre_exec() in ovector. For convenience, the functions pcre_copy_substring(), pcre_get_substring(), and pcre_get_substring_list() are provided for extracting captured substrings as new, separate, zero-terminated strings. These functions identify substrings by number. The next section describes functions for extracting named substrings. A substring that contains a binary zero is correctly extracted and has a further zero added on the end, but the result is not, of course, a C string. However, you can process such a string by referring to the length that is returned by pcre_copy_substring() and pcre_get_substring(). Unfortunately, the interface to pcre_get_substring_list() is not adequate for handling strings containing binary zeros, because the end of the final string is not independently indicated. The first three arguments are the same for all three of these functions: subject is the subject string that has just been successfully matched, ovector is a pointer to the vector of integer offsets that was passed to pcre_exec(), and stringcount is the number of substrings that were captured by the match, including the substring that matched the entire regular expression. This is the value returned by pcre_exec() if it is greater than zero. If pcre_exec() returned zero, indicating that it ran out of space in ovector, the value passed as stringcount should be the number of elements in the vector divided by three. The functions pcre_copy_substring() and pcre_get_substring() extract a single substring, whose number is given as stringnumber. A value of zero extracts the substring that matched the entire pattern, whereas higher values extract the captured substrings. For pcre_copy_substring(), the string is placed in buffer, whose length is given by buffersize, while for pcre_get_substring() a new block of memory is obtained via pcre_malloc, and its address is returned via stringptr. The yield of the function is the length of the string, not including the terminating zero, or one of these error codes:int pcre_get_substring(const char *subject, int *ovector, int stringcount, int stringnumber, const char **stringptr);int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, const char ***listptr);
PCRE_ERROR_NOMEMORY (-6)
PCRE_ERROR_NOSUBSTRING (-7)
PCRE_ERROR_NOMEMORY (-6)
EXTRACTING CAPTURED SUBSTRINGS BY NAME
int pcre_get_stringnumber(const pcre *code, const char *name);To extract a substring by name, you first have to find associated number. For example, for this patternint pcre_copy_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, char *buffer, int buffersize);int pcre_get_named_substring(const pcre *code, const char *subject, int *ovector, int stringcount, const char *stringname, const char **stringptr);
(a+)b(?<xxx>\d+)...
DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code, const char *name, char **first, char **last);When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns are not required to be unique. (Duplicate names are always allowed for subpatterns with the same number, created by using the (?| feature. Indeed, if such subpatterns are named, they are required to use the same names.) Normally, patterns with duplicate names are such that in any one match, only one of the named subpatterns participates. An example is shown in the pcrepattern documentation. When duplicates are present, pcre_copy_named_substring() and pcre_get_named_substring() return the first substring corresponding to the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING (-7) is returned; no data is returned. The pcre_get_stringnumber() function returns one of the numbers that are associated with the name, but it is not defined which it is. If you want to get full details of all captured substrings for a given name, you must use the pcre_get_stringtable_entries() function. The first argument is the compiled pattern, and the second is the name. The third and fourth are pointers to variables which are updated by the function. After it has run, they point to the first and last entries in the name-to-number table for the given name. The function itself returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is described above in the section entitled Information about a pattern above. Given all the relevant entries for the name, you can extract each of their numbers, and hence the captured data, if any.
FINDING ALL POSSIBLE MATCHES
The traditional matching function uses a similar algorithm to Perl, which stops when it finds the first match, starting at a given point in the subject. If you want to find all possible matches, or the longest possible match, consider using the alternative matching function (see below) instead. If you cannot use the alternative function, but still need to find all possible matches, you can kludge it up by making use of the callout facility, which is described in the pcrecallout documentation. What you have to do is to insert a callout right at the end of the pattern. When your callout function is called, extract and save the current matched substring. Then return 1, which forces pcre_exec() to backtrack and try other alternatives. Ultimately, when it runs out of matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.OBTAINING AN ESTIMATE OF STACK USAGE
Matching certain patterns using pcre_exec() can use a lot of process stack, which in certain environments can be rather limited in size. Some users find it helpful to have an estimate of the amount of stack that is used by pcre_exec(), to help them set recursion limits, as described in the pcrestack documentation. The estimate that is output by pcretest when called with the -m and -C options is obtained by calling pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its first five arguments. Normally, if its first argument is NULL, pcre_exec() immediately returns the negative error code PCRE_ERROR_NULL, but with this special combination of arguments, it returns instead a negative number whose absolute value is the approximate stack frame size in bytes. (A negative number is used so that it is clear that no match has happened.) The value is approximate because in some cases, recursive calls to pcre_exec() occur when there are one or two additional variables on the stack. If PCRE has been compiled to use the heap instead of the stack for recursion, the value returned is the size of each block that is obtained from the heap.MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int pcre_dfa_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize, int *workspace, int wscount);The function pcre_dfa_exec() is called to match a subject string against a compiled pattern, using a matching algorithm that scans the subject string just once, and does not backtrack. This has different characteristics to the normal algorithm, and is not compatible with Perl. Some of the features of PCRE patterns are not supported. Nevertheless, there are times when this kind of matching can be useful. For a discussion of the two matching algorithms, and a list of features that pcre_dfa_exec() does not support, see the pcrematching documentation. The arguments for the pcre_dfa_exec() function are the same as for pcre_exec(), plus two extras. The ovector argument is used in a different way, and this is described below. The other common arguments are used in the same way as for pcre_exec(), so their description is not repeated here. The two additional arguments provide workspace for the function. The workspace vector should contain at least 20 elements. It is used for keeping track of multiple paths through the pattern tree. More workspace will be needed for patterns and subjects where there are a lot of potential matches. Here is an example of a simple call to pcre_dfa_exec():
int rc;
int ovector[10];
int wspace[20];
rc = pcre_dfa_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector of integers for substring information */
10, /* number of elements (NOT size in bytes) */
wspace, /* working space vector */
20); /* number of elements (NOT size in bytes) */
Option bits for pcre_dfa_exec()
The unused bits of the options argument for pcre_dfa_exec() must be zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_ xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last four of these are exactly the same as for pcre_exec(), so their description is not repeated here.PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
PCRE_DFA_SHORTEST
PCRE_DFA_RESTART
Successful returns from pcre_dfa_exec()
When pcre_dfa_exec() succeeds, it may have matched more than one substring in the subject. Note, however, that all the matches from one run of the function start at the same point in the subject. The shorter matches are all initial substrings of the longer matches. For example, if the pattern<.*>
This is <something> <something else> <something further> no more
<something>
<something> <something else>
<something> <something else> <something further>
Error returns from pcre_dfa_exec()
The pcre_dfa_exec() function returns a negative number when it fails. Many of the errors are the same as for pcre_exec(), and these are described above. There are in addition the following errors that are specific to pcre_dfa_exec():PCRE_ERROR_DFA_UITEM (-16)
PCRE_ERROR_DFA_UCOND (-17)
PCRE_ERROR_DFA_UMLIMIT (-18)
PCRE_ERROR_DFA_WSSIZE (-19)
PCRE_ERROR_DFA_RECURSE (-20)
PCRE_ERROR_DFA_BADRESTART (-30)
SEE ALSO
pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).AUTHOR
Philip Hazel University Computing Service Cambridge CB2 3QH, England.
REVISION
Last updated: 18 December 2015 Copyright (c) 1997-2015 University of Cambridge.
18 December 2015 | PCRE 8.39 |