lexdoc - documentation for lex, fast lexical analyzer generator
lex [-78BbcdFfhIiLlnpsTtVvw] [-C[aefFmr]] [-Pprefix]
[-Sskeleton] [filename ...]
The lex(1) utility is a tool for generating scanners, which are programs that recognized lexical patterns in text. The lex(1) utility reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. The lex(1) utility generates as output a C source file, lex.yy.c, which defines the routine yylex(). This file is compiled and linked with the -ll library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.
These simple examples illustrate how to use lex(1).
The following lex(1) input specifies a scanner that, whenever it encounters the string "username", will replace it with the user's login name:
%%
username printf( "%s", getlogin() );
By default, any text not matched by a lex(1) scanner is copied to the output. The effect of this scanner is to copy its input file to its output with each occurrence of "username" expanded. In this input, there is one rule. The element "username" is the pattern, and "printf" is the action. The "%%" marks the beginning of the rules.
Another simple example follows:
int num_lines = 0, num_chars = 0;
%%
\n ++num_lines; ++num_chars;
%%
main()
{
yylex();
printf( "# of lines = %d, # of chars = %d\n",
num_lines, num_chars );
}
This scanner counts the number of characters and the number of lines in its input. It produces no output other than the final report on the counts. The first line declares two globals, "num_lines" and "num_chars", which are accessible both inside yylex() and in the main() routine declared after the second "%%". There are two rules, one that matches a newline ("\n") and increments both the line count and the character count, and one that matches any character other than a newline (indicated by the "." regular expression).
The next example is slightly more complicated:
/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT [0-9]
ID [a-z][a-z0-9]*
%%
{DIGIT}+ {
printf( "An integer: %s (%d)\n", yytext,
atoi( yytext ) );
}
{DIGIT}+"."{DIGIT}* {
printf( "A float: %s (%g)\n", yytext,
atof( yytext ) );
}
if|then|begin|end|procedure|function {
printf( "A keyword: %s\n", yytext );
}
{ID} printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/" printf( "An operator: %s\n", yytext );
"{"[^}\n]*"}" /* consume one-line comments */
[ \t\n]+ /* consume white space */
%%
main( argc, argv )
int argc;
char **argv;
{
++argv, --argc; /* skip over program name */
if ( argc > 0 )
yyin = fopen( argv[0], "r" );
else
yyin = stdin;
yylex();
}
This is the beginning of a simple scanner for a language like Pascal. It identifies different types of tokens and reports what it has seen.
The details of this example will be explained in the following sections.
The lex(1) input file consists of three sections, separated by a line with just %% in it:
definitions %% rules %% user code
The definitions section contains declarations of simple name definitions to simplify the scanner specification, and declarations of start conditions, which are explained in a later section.
Name definitions have the form:
name definition
The name is a word beginning with a letter or an underscore ('_'), followed by zero or more letters, digits, underscores ('_'), or dashes ('-'). The definition is taken to begin at the first non-white-space character following the name and continuing to the end of the line. The definition can subsequently be referred to using "{name}", which will expand to "(definition)". For example,
DIGIT [0-9]
ID [a-z][a-z0-9]*
defines "DIGIT" to be a regular expression that matches a single digit, and "ID" to be a regular expression that matches a letter followed by zero or more letters or digits. A subsequent reference to:
{DIGIT}+"."{DIGIT}*
is identical to:
([0-9])+"."([0-9])*
and matches one or more digits followed by a '.' that is followed
by zero or more digits.
The rules section of the lex(1) input contains a series of rules of the form:
pattern action
where the pattern must be unindented and the action must begin on
the same line.
Patterns and actions are described in more detail later in this topic.
Finally, the user code section is simply copied to lex.yy.c verbatim. It is used for companion routines that call or are called by the scanner. The presence of this section is optional; if it is missing, the second %% in the input file can also be skipped.
In the definitions and rules sections, any indented text or text enclosed by %{ and %} is copied verbatim to the output (with instances of %{} removed). The instances of %{} must appear unindented on lines by themselves.
In the rules section, any indented or %{} text appearing before the first rule can be used to declare variables that are local to the scanning routine and (after the declarations) code that will be executed whenever the scanning routine is entered. Other indented or %{} text in the rule section is still copied to the output, but its meaning is not well defined, and it might cause compile-time errors (this feature is present for POSIX compliance).
In the definitions section (but not in the rules section), an unindented comment (that is, a line beginning with /*) is also copied verbatim to the output up to the next */.
Patterns in the input are written using an extended set of regular expressions. These are described in the following table:
Pattern | Matches |
---|---|
x | Match the character 'x'. |
. | Any character except newline. |
[xyz] | A "character class"; in this case, the pattern matches either an 'x', 'y', or 'z'. |
[abj-oZ] | A "character class" that contains a range; matches an 'a', 'b', any letter from 'j' through 'o', or a 'Z'. |
[^A-Z] | A "negated character class" (that is, any character except those in the class. In this case, any character except an uppercase letter). |
[^A-Z\n] | Any character except an uppercase letter or a newline. |
r* | Zero or more instances of r, where r is any regular expression. |
r+ | One or more instances of r. |
r? | Zero or one instance ofr (that is, an optional r). |
r{2,5} | Two to five instances of r. |
r{2,} | Two or more instances of r. |
r{4} | Exactly four instances of r. |
{name} | The expansion of the name definition. |
"[xyz]\"star" | The literal string: [xyz]"star. |
\X | If X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', the ANSI C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*'). |
\123 | The character with octal value 123. |
\x2a | The character with hexadecimal value 2a |
(r) | Match an r; parentheses are used to override precedence (discussed later in this topic). |
rs | "Concatenation:" the regular expression r followed by the regular expression s. |
r|s | Either an r or an s. |
r/s | An r, but only if it is followed by an s. The s is not part of the matched text. This type of pattern is called "trailing context". |
^r | An r, but only at the beginning of a line. |
r$ | An r, but only at the end of a line. Equivalent to "r/\n". |
<s>r | An r, but only in start condition s. (See the discussion of start conditions later in this topic.) |
<s1,s2,s3>r | Same, but in any of start conditions s1, s2, or s3. |
<*>r | An r in any start condition, even an exclusive one. |
<<EOF>> | An end-of-file. |
<s1,s2><<EOF>> | An end-of-file when in start condition s1 or s2. |
Note that inside a character class, all regular-expression operators lose their special meaning except escape ('\') and the character class operators, '-', ']', and, at the beginning of the class, '^'.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence. For example,
cat|dog*
is the same as
(cat)|(do(g*))
because the '*' operator has higher precedence than concatenation,
and concatenation higher than alternation ('|').Therefore, this
pattern matches either the string "cat" or the string "do"
followed by zero or more instances of g. To match "cat" or zero or
more instances of "dog", use:
cat|(dog)*
and to match zero or more instances of "cat" or "dog" use:
(cat|dog)*
Some notes on patterns:
cat/dog$
<sc1>cat<sc2>dog
You can, however, write the first of these as:
cat/dog\n
cat|(dog$)
cat|^dog
If you want a "cat", or you want a dog followed by a newline, the following could be used (the special '|' action is explained below):
cat |
dog$ /* action goes here */
A similar trick will work for matching a cat or matching a dog at the beginning of a line.
When the generated scanner is run, it analyzes its input looking for strings that match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing-context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the lex(1) input file is chosen.
Once the match is determined, the text corresponding to the match (called the I token ) is made available in the global-character pointer yytext and its length in the global integer yyleng. The action corresponding to the matched pattern is then executed (a more detailed description of actions follows), and then the remaining input is scanned for another match.
If no match is found, the default rule is executed: the next character in the input is considered matched and copied to the standard output. Thus, the simplest correct lex(1) input is:
%%
which generates a scanner that simply copies its input (one
character at a time) to its output.
Note that yytext can be defined in two ways: either as a character pointer or as a character array. You can control which definition lex(1) uses by including one of the special directives %pointer or %array in the first (definitions) section of your lex input. The default is %pointer, unless you use the -l AT&T lex compatibility option, in which case yytext will be an array. The advantage of using %pointer is that it provides substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify yytext (see the next section), and calls to the input() and unput() functions destroy the present contents of yytext. This can make moving between different lex(1) versions difficult.
The advantage of %array is that you can modify yytext as much as you want, and calls to input() and unput() do not destroy yytext. Furthermore, existing AT&T lex(1) programs sometimes access yytext externally using declarations of the form:
extern char yytext[];
This definition is erroneous when used with %pointer, but
correct for %array.
%array defines yytext to be an array of YYLMAX characters, which defaults to a fairly large value. You can change the size by simply defining YYLMAX to a different value in the first section of your lex(1) input. As mentioned above, with %pointer yytext grows dynamically to accommodate large tokens. Although your %pointer scanner can accommodate very large tokens (such as matching entire blocks of comments), each time the scanner must resize yytext, it also must rescan the entire token from the beginning, so matching such tokens can prove slow. AT present, yytext does not dynamically grow if a call to unput() results in too much text being pushed back. Instead, a run-time error results.
Each pattern in a rule has a corresponding action, which can be any arbitrary C statement. The pattern ends at the first non-escaped white-space character. The remainder of the line is its action. If the action is empty, when the pattern is matched, the input token is simply discarded. For example, here is the specification for a program that deletes all occurrences of "delete this string" from its input:
%%
"delete this string"
(It will copy all other characters in the input to the output since
they will be matched by the default rule.)
The following program compresses multiple blanks and tabs down to a single blank, and throws away whites pace found at the end of a line:
%%
[ \t]+ putchar( ' ' );
[ \t]+$ /* ignore this token */
If the action contains a '{', the action spans the point up to which the balancing '}' is found; the action can cross multiple lines. The lex(1) utility knows about C strings and comments and will not be mislead by braces found within them. It allows actions to begin with %{ and will consider the action to be all the text up to the next %} (regardless of ordinary braces inside the action).
An action consisting solely of a vertical bar ('|') means "same as the action for the next rule."
Actions can include arbitrary C code, including return statements to return a value to the routine that called yylex(). Each time yylex() is called it continues processing tokens from the point at which it last left off until it either reaches the end of the file or executes a return.
Actions can modify yytext, except for lengthening it (adding characters to its end--these will overwrite later characters in the input stream). Modifying the final character of yytext might alter whether rules anchored with '^' are active when scanning resumes. Specifically, changing the final character of yytext to a newline will activate such rules on the next scan, and changing it to anything else will deactivate the rules. Users should not rely on this behavior being present in future releases. Finally, note that none of this paragraph applies when using %array.
Actions can to modify yyleng except they should not do so if the action also includes use of yymore().
There are a number of special directives that can be included within an action:
int word_count = 0;
%%
frob special(); REJECT;
[^ \t\n]+ ++word_count;
Without the REJECT, any instances of "frob" in the input
would not be counted as words, since the scanner normally executes
only one action per token. Multiple occurrences of REJECT
are allowed, each one finding the next best choice to the currently
active rule. For example, when the following scanner scans the
token "abcd", it will write "abcdabcaba" to the output:
%%
a |
ab |
abc |
abcd ECHO; REJECT;
.|\n /* consume any unmatched character */
(The first three rules share the action of the fourth because they
use the special '|' action.) REJECT is a particularly
expensive feature in terms scanner performance; if it is used in
any of the scanner's actions it will slow down all of the scanner's
matching. Furthermore, REJECT cannot be used with the
-Cf or -CF options (discussed later in this topic).
Note also that unlike the other special actions, REJECT is a branch; code immediately following it in the action will not be executed.
yytext
rather than replacing it. For example, given
the input "hot-sun" the following will write "hot-hot-sun" to the
output:
%%
hot- ECHO; yymore();
sun ECHO;
First "hot-" is matched and echoed to the output. Then "sun" is
matched, but the previous "hot-" is still hanging around at the
beginning of yytext so the ECHO for the "sun" rule
will actually write "hot-sun". The presence of yymore() in
the scanner's action entails a minor performance penalty in the
scanner's matching speed.yytext
and
yyleng
are adjusted appropriately (for example,
yyleng
will now be equal to n ). On the input
"catdog" the following will write out "catdogdog":
%%
catdog ECHO; yyless(3);
[a-z]+ ECHO;
An argument of 0 to yyless() will cause the entire current
input string to be scanned again. Unless you have changed how the
scanner will subsequently process its input (using BEGIN,
for example), this will result in an endless loop.
Note that yyless is a macro and can only be used in the lex input file, not from other source files.
{
int i;
unput( ')' );
for ( i = yyleng - 1; i >= 0; --i )
unput( yytext[i] );
unput( '(' );
}
Note that since each unput() puts the given character back
at the beginning of the input stream, pushing back strings must be
done back-to-front. Also note that you cannot put back EOF
to attempt to mark the input stream with an end-of-file.
%%
"/*" {
register int c;
for ( ; ; )
{
while ( (c = input()) != '*' &&
c != EOF )
; /* consume text of comment */
if ( c == '*' )
{
while ( (c = input()) == '*' )
;
if ( c == '/' )
break; /* found the end */
}
if ( c == EOF )
{
error( "EOF in comment" );
break;
}
}
}
(Note that if the scanner is compiled using C++ then
input() is instead referred to as yyinput(), in order
to avoid a name clash with the C++ stream by the name of
inputThe output of lex(1) is the file lex.yy.c, which contains the scanning routine yylex(), a number of tables used by it for matching tokens, and a number of auxiliary routines and macros. By default, yylex() is declared as follows:
int yylex()
{
... various definitions and the actions are placed here ...
}
(If your environment supports function prototypes, it will be "int yylex( void )".) This definition can be changed by defining the "YY_DECL" macro. For example, you could use:
#define YY_DECL float lexscan( a, b ) float a, b;
to give the scanning routine the name lexscan(1), returning
a float, and taking two floats as arguments. Note that if you give
arguments to the scanning routine using a K&R-style,
non-prototyped function declaration, you must terminate the
definition with a semicolon (;).
Whenever yylex() is called, it scans tokens from the global input file yyin (which defaults to stdin). It continues until it either reaches an end-of-file (at which point it returns the value 0) or one of its actions executes a return statement.
If the scanner reaches an end-of-file, subsequent calls are undefined unless either yyin is pointed at a new input file (in which case scanning continues from that file), or yyrestart() is called. yyrestart() takes one argument, a FILE * pointer, and initializes yyin for scanning from that file. Essentially there is no difference between just assigning yyin to a new input file or using yyrestart() to do so; the latter is available for compatibility with previous versions of lex,(1) and because it can be used to switch input files in the middle of scanning. It can also be used to throw away the current input buffer, by calling it with an argument of yyin.
If yylex() stops scanning due to executing a return statement in one of the actions, the scanner can be called again, and it will resume scanning where it left off.
By default (and for purposes of efficiency), the scanner uses block-reads rather than simple getc(3) calls to read characters from yyin. The nature of how it gets its input can be controlled by defining the YY_INPUT macro. YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its action is to place up to max_size characters in the character array buf and return in the integer variable result either the number of characters read or the constant YY_NULL (traditionally 0) to indicate EOF. The default YY_INPUT reads from the global file pointer "yyin".
A sample definition of YY_INPUT (in the definitions section of the input file):
%{
#define YY_INPUT(buf,result,max_size) \
{ \
int c = getchar(); \
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}
This definition will change the input processing to occur one character at a time.
You can also use this approach to add things, like keeping track of the input line number. If you do so, however, your scanner might not go very fast.
When the scanner receives an end-of-file indication from YY_INPUT, it checks the yywrap() function. If yywrap() returns false (zero), it is assumed that the function has set up yyin to point to another input file, and scanning continues. If it returns true (non-zero), the scanner terminates, returning 0 to its caller.
The default yywrap() always returns 1.
The scanner writes its ECHO output to the yyout global (default, stdout), which may be redefined by the user simply by assigning it to some other FILE pointer.
The lex(1) utility provides a mechanism for conditionally activating rules. Any rule whose pattern is prefixed with "<sc>" will be active only when the scanner is in the start condition named "sc". For example,
<STRING>[^"]* { /* consume the string body ... */
...
}
will be active only when the scanner is in the "STRING" start
condition, and
<INITIAL,STRING,QUOTE>\. { /* handle an escape ... */
...
}
will be active only when the current start condition is either
"INITIAL", "STRING", or "QUOTE".
Start conditions are declared in the definitions (first) section of the input using unindented lines beginning with either %s or %x followed by a list of names. The former declares inclusive start conditions, the latter exclusive start conditions. A start condition is activated using the BEGIN action. Until the next BEGIN action is executed, rules with the given start condition will be active, and rules with other start conditions will be inactive. If the start condition is inclusive, rules with no start conditions will also be active. If it is exclusive, only rules qualified with the start condition will be active. A set of rules contingent on the same exclusive start condition describe a scanner that is independent of any of the other rules in the lex(1) input. Because of this, exclusive start conditions make it easy to specify "mini-scanners" which scan portions of the input that are syntactically different from the rest (such as comments).
The following example illustrates the connection between inclusive and exclusive start conditions. The set of rules:
%s example
%%
<example>cat /* do something */
is equivalent to
%x example
%%
<INITIAL,example>cat /* do something */
Also note that the special start-condition specifier <*> matches every start condition. Thus, the above example could also have been written;
%x example
%%
<*>cat /* do something */
The default rule (to ECHO any unmatched character) remains active in start conditions.
BEGIN(0) returns to the original state where only rules with no start conditions are active. This state can also be referred to as the start-condition "INITIAL", so BEGIN(INITIAL) is equivalent to BEGIN(0). (The parentheses around the start condition name are not required but are considered good style.)
BEGIN actions can also be given as indented code at the beginning of the rules section. For example, the following will cause the scanner to enter the "SPECIAL" start condition whenever yylex() is called and the global variable
enter_special
is true:
int enter_special;
%x SPECIAL
%%
if ( enter_special )
BEGIN(SPECIAL);
<SPECIAL>something here
...more rules follow...
The following scanner illustrates the uses of start conditions. It gives two different interpretations of a string like "123.456". By default, it will treat it as as three tokens, the integer "123", a dot ('.'), and the integer "456". But if the string is preceded earlier in the line by the string "expect-floats" it will treat it as a single token, the floating-point number 123.456:
%{
#include <math.h>
%}
%s expect
%%
expect-floats BEGIN(expect);
<expect>[0-9]+"."[0-9]+ {
printf( "found a float, = %f\n",
atof( yytext ) );
}
<expect>\n {
/* that's the end of the line, so
* we need another "expect-number"
* before we'll recognize any more
* numbers
*/
BEGIN(INITIAL);
}
[0-9]+ {
printf( "found an integer, = %d\n",
atoi( yytext ) );
}
"." printf( "found a dot\n" );
The following scanner recognizes (and discards) C comments while maintaining a count of the current input line:
%x comment
%%
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]* /* consume anything that is not a '*' */
<comment>"*"+[^*/\n]* /* consume occurrences of '*' not followed by a '/' */
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
This scanner will try to match as much text as possible with each rule. In general, when attempting to write a high-speed scanner, it is advisable to try to match as much possible in each rule.
Note that start-condition names are really integer values and can be stored as such. Thus, the above could be extended in the following fashion:
%x comment cat
%%
int line_num = 1;
int comment_caller;
"/*" {
comment_caller = INITIAL;
BEGIN(comment);
}
...
<cat>"/*" {
comment_caller = cat;
BEGIN(comment);
}
<comment>[^*\n]* /* consume anything that is not a '*' */
<comment>"*"+[^*/\n]* /* consume '*'s not followed by '/'s */
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(comment_caller);
You can also access the current start condition using the
integer-valued YY_START macro. For example, the above assignments
to comment_caller
could instead be written
comment_caller = YY_START;
Note that start conditions do not have their own name space; %s and %x declare names the same as #define does.
The following example illustrates how to match C-style quoted strings using exclusive start conditions, including expanded escape sequences (but not including checking for a string that is too long):
%x str
%%
char string_buf[MAX_STR_CONST];
char *string_buf_ptr;
\" string_buf_ptr = string_buf; BEGIN(str);
<str>\" { /* saw closing quote - all done */
BEGIN(INITIAL);
*string_buf_ptr = '\0';
/* return string constant token type and
* value to parser
*/
}
<str>\n {
/* error - unterminated string constant */
/* generate error message */
}
<str>\\[0-7]{1,3} {
/* octal escape sequence */
int result;
(void) sscanf( yytext + 1, "%o", &result );
if ( result > 0xff )
/* error, constant is out-of-bounds */
*string_buf_ptr++ = result;
}
<str>\\[0-9]+ {
/* generate error - bad escape sequence; something
* like '\48' or '\0777777'
*/
}
<str>\\n *string_buf_ptr++ = '\n';
<str>\\t *string_buf_ptr++ = '\t';
<str>\\r *string_buf_ptr++ = '\r';
<str>\\b *string_buf_ptr++ = '\b';
<str>\\f *string_buf_ptr++ = '\f';
<str>\\(.|\n) *string_buf_ptr++ = yytext[1];
<str>[^\\\n\"]+ {
char *yytext_ptr = yytext;
while ( *yytext_ptr )
*string_buf_ptr++ = *yytext_ptr++;
}
Some scanners (such as those that support include files) require reading from several input streams. Because lex(1) scanners do a large amount of buffering, one cannot control where the next input will be read from by simply writing a YY_INPUT that is sensitive to the scanning context. YY_INPUT is only called when the scanner reaches the end of its buffer. This can be a long time after scanning a statement such as an "include" that requires switching the input source.
To alleviate these types of problems, lex(1) provides a mechanism for creating and switching between multiple input buffers. An input buffer is created by using:
YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
This takes a FILE pointer and a size and creates a buffer
associated with the given file that is large enough to hold
size characters (when in doubt, use YY_BUF_SIZE for the
size). It returns a YY_BUFFER_STATE handle, which can be passed to
other routines:
void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
switches the scanner's input buffer so subsequent tokens will come
from new_buffer Note that yy_switch_to_buffer() can
be used by yywrap() to set things up for continued scanning,
instead of opening a new file and pointing yyin at it.
void yy_delete_buffer( YY_BUFFER_STATE buffer )
is used to reclaim the storage associated with a buffer.
yy_new_buffer() is an alias for yy_create_buffer(), and is provided for compatibility with the C++ use of new and delete for creating and destroying dynamic objects.
Finally, the YY_CURRENT_BUFFER macro returns a YY_BUFFER_STATE handle to the current buffer.
The following example illustrates how to use these features to write a scanner that expands include files (the <<EOF>> feature is discussed later in this topic):
/* the "incl" state is used for picking up the name
* of an include file
*/
%x incl
%{
#define MAX_INCLUDE_DEPTH 10
YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
int include_stack_ptr = 0;
%}
%%
include BEGIN(incl);
[a-z]+ ECHO;
[^a-z\n]*\n? ECHO;
<incl>[ \t]* /*consume the white space */
<incl>[^ \t\n]+ { /* got the include file name */
if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
{
fprintf( stderr, "Includes nested too deeply" );
exit( 1 );
}
include_stack[include_stack_ptr++] =
YY_CURRENT_BUFFER;
yyin = fopen( yytext, "r" );
if ( ! yyin )
error( ... );
yy_switch_to_buffer(
yy_create_buffer( yyin, YY_BUF_SIZE ) );
BEGIN(INITIAL);
}
<<EOF>> {
if ( --include_stack_ptr < 0 )
{
yyterminate();
}
else
{
yy_delete_buffer( YY_CURRENT_BUFFER );
yy_switch_to_buffer(
include_stack[include_stack_ptr] );
}
}
The special rule "<<EOF>>" indicates actions to be taken when an end-of-file is encountered and yywrap() returns non-zero (that is, indicates no further files to process). The action must finish by doing one of four things:
<<EOF>> rules cannot be used with other patterns; they can only be qualified with a list of start conditions. If an unqualified <<EOF>> rule is given, it applies to all start conditions that do not already have <<EOF>> actions. To specify an <<EOF>> rule for only the initial start condition, use
<INITIAL><<EOF>>
These rules are useful for catching things like unclosed comments, as in the following example:
%x quote
%%
...other rules for dealing with quotes...
<quote><<EOF>> {
error( "unterminated quote" );
yyterminate();
}
<<EOF>> {
if ( *++filelist )
yyin = fopen( *filelist, "r" );
else
yyterminate();
}
The macro YY_USER_ACTION can be defined to provide an action that is always executed prior to the action of the matched rule. For example, it could be defined (using #define) to call a routine to convert yytext to lowercase.
The macro YY_USER_INIT can be defined to provide an action that is always executed before the first scan (and before the scanner's internal initializations are done). For example, it could be used to call a routine to read in a data table or open a logging file.
In the generated scanner, the actions are all gathered in one large switch statement and separated using YY_BREAK, which may be redefined. By default, it is simply a "break", to separate each rule's action from action of the following rule. Redefining YY_BREAK allows, for example, C++ users to #define YY_BREAK to do nothing (while being very careful that every rule ends with a "break" or a "return") to avoid getting unreachable statement warnings, where, because a rule's action ends with "return", the YY_BREAK is inaccessible.
One of the ways to use lex(1) is as a companion to the yacc(1) parser-generator. The yacc(1) parsers expect to call a routine named yylex() to find the next input token. The routine is supposed to return the type of the next token, as well as putting any associated value in the global yylval To use lex(1) with yacc(1), one specifies the -d option to yacc(1). This instructs yacc(1) to generate the file y.tab.h, which contains definitions of all the %tokens appearing in the yacc(1) input. This file is then included in the lex(1) scanner. For example, if one of the tokens is "TOK_NUMBER", part of the scanner might look like:
%{
#include "y.tab.h"
%}
%%
[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
The lex(1) utility has the following options:
The options -Cf or -CF and -Cm do not make sense together; there is no opportunity for meta-equivalence classes if the table is not being compressed. Otherwise, the options can be freely mixed.
The default setting is -Cem, which specifies that lex(1) should generate equivalence classes and meta-equivalence classes. This setting provides the highest degree of table compression. You can trade off faster-executing scanners at the cost of larger tables with the following generally being true:
Slowest and smallest | Fastest and largest |
---|
-C options are cumulative.
--accepting rule at line 53 ("the matched text")
The line number refers to the location of the rule in the file
defining the scanner (that is, the file that was provided to
lex). Messages are also generated when the scanner backs up,
accepts the default rule, reaches the end of its input buffer (or
encounters a NUL; to the scanner's concerned, the two look
identical), or reaches an end-of-file.This option is equivalent to -CFr (discussed later in this topic).
Note that -I cannot be used in conjunction with full or fast tables; that is, the -f, -F, -Cf, or -CF flags. For other table-compression options, -I is the default.
yy
prefix used by lex(1)
to be prefix instead. See lexdoc(1) for a description
of all the global variables and file names that this affects.The main design goal of lex(1) is to generate high-performance scanners. It has been optimized for dealing well with large sets of rules. Aside from the effects on scanner speed of the table-compression -C options outlined above, there are a number of options and actions that degrade performance. These are provided below, from most to lease expensive, (with the first three having considerable impact on performance, and the last two having little impact on performance):
Note also that unput() is implemented as a routine call that potentially does quite a bit of work, while the yyless() macro has little impact on performance; so if you are simply putting back some excess text you scanned, use yyless().
Avoid REJECT at all costs when performance is important. It is a particularly expensive option.
Eliminating backing up is difficult and can be an enormous amount of work for a complicated scanner. In principal, one begins by using the -b flag to generate a lex.backup file. For example, on the input
%%
cat return TOK_KEYWORD;
catdog return TOK_KEYWORD;
the file appears as follows:
State #6 is non-accepting -
associated rule line numbers:
2 3
out-transitions: [ o ]
jam-transitions: EOF [ \001-n p-\177 ]
State #8 is non-accepting -
associated rule line numbers:
3
out-transitions: [ a ]
jam-transitions: EOF [ \001-' b-\177 ]
State #9 is non-accepting -
associated rule line numbers:
3
out-transitions: [ r ]
jam-transitions: EOF [ \001-q s-\177 ]
Compressed tables always back up.
The first few lines indicate that a scanner state exists in which it can make a transition on an 'a' but not on any other character. These lines also indicate that, in that state, the currently scanned text does not match any rule. The state occurs when trying to match the rules found at lines 2 and 3 in the input file. If the scanner is in that state and then reads something other than an 'a', it will have to back up to find a rule that is matched. It becomes apparent that this must be the state the scanner is in when it has seen "ca". When this has happened, if anything other than another 'a' is seen, the scanner will have to back up to simply match the 'c' (by the default rule).
The comment regarding State #8 indicates that there is a problem when "catd" has been scanned. Indeed, on any character other than an 'o', the scanner will have to back up to accept "cat". Similarly, the comment for State #9 concerns when "catdo" has been scanned and a 'g' does not follow.
The final comment is a reminder that it is not useful to remove backing up from the rules unless -Cf or -CF is being used, since there is no gain in performance doing so with compressed scanners.
The way to remove the backing up is to add "error" rules:
%%
cat return TOK_KEYWORD;
catdog return TOK_KEYWORD;
catdo |
catd |
ca {
/* false alarm, not really a keyword */
return TOK_ID;
}
Eliminating backing up among a list of keywords can also be done using a "catch-all" rule:
%%
cat return TOK_KEYWORD;
catdog return TOK_KEYWORD;
[a-z]+ return TOK_ID;
This is usually the best solution when appropriate.
Backing up messages tend to cascade. With a complicated set of rules, it is not uncommon to get hundreds of messages. If one can decipher them, however, it often only takes about a dozen rules to eliminate the backing up. (It is easy, however, to make a mistake and have an error rule accidentally match a valid token. A possible future lex(1) feature might automatically add rules to eliminate backing up).
Variable trailing context (where both the leading and trailing parts do not have a fixed length) entails almost the same performance loss as REJECT (that is, substantial). So when possible a rule like:
%%
mouse|rat/(cat|dog) run();
is better written:
%%
mouse/cat|dog run();
rat/cat|dog run();
or as
%%
mouse|rat/cat run();
mouse|rat/dog run();
Note that here the special '|' action does not have a positive impact on performance, and could even have a negative impact.
A final note regarding performance: dynamically resizing yytext to accommodate huge tokens is a slow process because it presently requires that the (huge) token be rescanned from the beginning. If performance is vital, you should attempt to match "large" quantities of text but not "huge" quantities, where the cutoff between the two is at about 8 KB characters/token.
Another area where you can increase a scanner's performance (and one that is easier to implement) arises from the fact that the longer the tokens that are matched, the faster the scanner will run. This is because with long tokens the processing of most input characters takes place in the (short) inner scanning loop, and does not often have to go through the additional work of setting up the scanning environment (such as yytext for the action. Recall the scanner for C comments:
%x comment
%%
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]*
<comment>"*"+[^*/\n]*
<comment>\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
This could be sped up by writing it as:
%x comment
%%
int line_num = 1;
"/*" BEGIN(comment);
<comment>[^*\n]*
<comment>[^*\n]*\n ++line_num;
<comment>"*"+[^*/\n]*
<comment>"*"+[^*/\n]*\n ++line_num;
<comment>"*"+"/" BEGIN(INITIAL);
Instead of each newline requiring the processing of another action, the recognition of the newlines is "distributed" over the other rules to keep the matched text as long as possible. Note that adding rules does not slow down the scanner. The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as '*' and '|'.
The following is a final example for speeding up a scanner. It presumes that you want to scan through a file that contains identifiers and keywords, one per line and with no other extraneous characters, recognizing all the keywords. The example provides a logical first approach:
%%
asm |
auto |
break |
... etc ...
volatile |
while /* it is a keyword */
.|\n /* it is not a keyword */
To eliminate the back-tracking, introduce a catch-all rule:
%%
asm |
auto |
break |
... etc ...
volatile |
while /* it is a keyword */
[a-z]+ |
.|\n /* it is not a keyword */
If it is certain that there is exactly one word per line, the total number of matches can be reduced by half by merging in the recognition of newlines with that of the other tokens:
%%
asm\n |
auto\n |
break\n |
... etc ...
volatile\n |
while\n /* it is a keyword */
[a-z]+\n |
.|\n /* it is not a keyword */
Caution should be used as backing up into the scanner has been reintroduced. In particular, while you might know that there will never be any characters in the input stream other than letters or newlines, lex(1) cannot know this; it will plan for possibly needing to back up when it has scanned a token like "auto" and the next character is something other than a newline or a letter. Previously, it would then just match the "auto" rule and be done, but now it has no "auto" rule, only a "auto\n" rule. To eliminate the possibility of backing up, you could either duplicate all rules without final newlines, or, because you never expect to encounter such input, and therefore do not know how it is classified, you could introduce one more catch-all rule, this one, which does not include a newline:
%%
asm\n |
auto\n |
break\n |
... etc ...
volatile\n |
while\n /* it is a keyword */
[a-z]+\n |
[a-z]+ |
.|\n /* it is not a keyword */
Compiled with -Cf, this is about as fast as one can get a
lex(1) scanner to go for this particular problem.
A final note: lex(1) is slow when matching NULs, particularly when a token contains multiple NULs. It is best to write rules that match short amounts of text if you anticipate that the text will often include NULs.
This lex(1) is a rewrite of the AT&T lex(1) tool (the two implementations do not share any code, however), with some extensions and incompatibilities, both of which are of concern to those who want to write scanners acceptable to either implementation. The POSIX lex(1) specification is closer to the behavior of this implementation of lex(1) than to the original AT&T lex(1) implementation. Some incompatibilities remain, however, between this lex(1) and POSIX. The intent is that, ultimately, this lex(1) will be fully POSIX-conformant. This section discusses all of the known areas of incompatibility.
The lex(1) -l option turns on maximum compatibility with the original AT&T lex(1) implementation, at the cost of a major loss in the generated scanner's performance. The incompatibilities that can be overcome using the -l option are discussed later in this section.
The lex(1) utility is fully compatible with AT&T lex(1) with the following exceptions:
fatal lex scanner internal error--end of buffer
missed
To reenter the scanner, first use
yyrestart( yyin );
Note that this call will discard any buffered input; usually this is not a problem with an interactive scanner.
NAME [A-Z][A-Z0-9]*
%%
cat{NAME}? printf( "Found it\n" );
%%
will not match the string "cat" because when the macro is expanded
the rule is equivalent to "cat[A-Z][A-Z0-9]*?" and the precedence
is such that the '?' is associated with "[A-Z0-9]*". With this
lex(1), the rule will be expanded to "cat([A-Z][A-Z0-9]*)?"
and so the string "cat" will match.The -l option eliminates this incompatibility.
The following lex(1) features are not included in AT&T lex(1) or the POSIX specification:
cat handle_cat(); ++num_cats_seen;
is (rather surprisingly) truncated to
cat handle_cat();
The lex(1) utility does not truncate the action. Actions that are not enclosed in braces are simply terminated at the end of the line.
The lex(1) utility produces the following diagnostic messages:
[a-z]+ got_identifier();
dog got_dog();
Using REJECT in a scanner suppresses this warning.
yyrestart( yyin );
See lex(1).
See lex(1).
Vern Paxson, with the help of many ideas and much inspiration from Van Jacobson. Original version by Jef Poskanzer. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
Thanks to the many lex(1) beta-testers, feedbackers, and contributors, especially Francois Pinard, Casey Leedom, Nelson H.F. Beebe, benson@odi.com, Peter A. Bigot, Keith Bostic, Frederic Brehm, Nick Christopher, Jason Coughlin, Bill Cox, Dave Curtis, Scott David Daniels, Chris G. Demetriou, Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Jon Forrest, Kaveh R. Ghazi, Eric Goldman, Ulrich Grepel, Jan Hajic, Jarkko Hietaniemi, Eric Hughes, John Interrante, Ceriel Jacobs, Jeffrey R. Jones, Henry Juengst, Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Marq Kole, Ronald Lamprecht, Greg Lee, Craig Leres, John Levine, Steve Liddle, Mohamed el Lozy, Brian Madsen, Chris Metcalf, Luke Mewburn, Jim Meyering, G.T. Nicol, Landon Noll, Marc Nozell, Richard Ohnemus, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Frederic Raimbault, Rick Richardson, Kevin Rodgers, Jim Roskind, Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Mike Stump, Paul Stuart, Dave Tallman, Chris Thewalt, Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken Yap, Nathan Zelle, David Zuhn, and those whose names have slipped my marginal mail-archiving skills but whose contributions are appreciated all the same.
Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various distribution headaches.
Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to Eric Hughes for support of multiple buffers.
This work was primarily done when I was with the Real Time Systems Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there for the support I received.
Send comments to:
Vern Paxson Systems Engineering Bldg. 46A, Room 1123 Lawrence Berkeley Laboratory University of California Berkeley, CA 94720 vern@ee.lbl.gov
lex(1)
yacc(1)
sed(1)
awk(1)
M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator