lexdoc

NAME

lexdoc - documentation for lex, fast lexical analyzer generator

SYNOPSIS

lex [-78BbcdFfhIiLlnpsTtVvw] [-C[aefFmr]] [-Pprefix]
	[-Sskeleton] [filename ...]

DESCRIPTION

The lex(1) utility is a tool for generating scanners, which are programs that recognized lexical patterns in text. The lex(1) utility reads the given input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. The lex(1) utility generates as output a C source file, lex.yy.c, which defines the routine yylex(). This file is compiled and linked with the -ll library to produce an executable. When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

SIMPLE EXAMPLES

These simple examples illustrate how to use lex(1).

The following lex(1) input specifies a scanner that, whenever it encounters the string "username", will replace it with the user's login name:

%%
username	printf( "%s", getlogin() );

By default, any text not matched by a lex(1) scanner is copied to the output. The effect of this scanner is to copy its input file to its output with each occurrence of "username" expanded. In this input, there is one rule. The element "username" is the pattern, and "printf" is the action. The "%%" marks the beginning of the rules.

Another simple example follows:

	int num_lines = 0, num_chars = 0;
%%
\n	++num_lines; ++num_chars;
%%
main()
		{
		yylex();
		printf( "# of lines = %d, # of chars = %d\n",
				num_lines, num_chars );
	}

This scanner counts the number of characters and the number of lines in its input. It produces no output other than the final report on the counts. The first line declares two globals, "num_lines" and "num_chars", which are accessible both inside yylex() and in the main() routine declared after the second "%%". There are two rules, one that matches a newline ("\n") and increments both the line count and the character count, and one that matches any character other than a newline (indicated by the "." regular expression).

The next example is slightly more complicated:

/* scanner for a toy Pascal-like language */
%{
/* need this for the call to atof() below */
#include <math.h>
%}
DIGIT	[0-9]
ID	 [a-z][a-z0-9]*
%%
{DIGIT}+	{
			printf( "An integer: %s (%d)\n", yytext,
					atoi( yytext ) );
		}
{DIGIT}+"."{DIGIT}*		{
			printf( "A float: %s (%g)\n", yytext,
					atof( yytext ) );
		}
if|then|begin|end|procedure|function		{
			printf( "A keyword: %s\n", yytext );
		}
{ID}		printf( "An identifier: %s\n", yytext );
"+"|"-"|"*"|"/"   printf( "An operator: %s\n", yytext );
"{"[^}\n]*"}"	 /* consume one-line comments */
[ \t\n]+		/* consume white space */
%%
main( argc, argv )
int argc;
char **argv;
	{
	++argv, --argc;  /* skip over program name */
	if ( argc > 0 )
			yyin = fopen( argv[0], "r" );
	else
			yyin = stdin;
	yylex();
}

This is the beginning of a simple scanner for a language like Pascal. It identifies different types of tokens and reports what it has seen.

The details of this example will be explained in the following sections.

FORMAT OF THE INPUT FILE

The lex(1) input file consists of three sections, separated by a line with just %% in it:

definitions %% rules %% user code

The definitions section contains declarations of simple name definitions to simplify the scanner specification, and declarations of start conditions, which are explained in a later section.

Name definitions have the form:

name definition

The name is a word beginning with a letter or an underscore ('_'), followed by zero or more letters, digits, underscores ('_'), or dashes ('-'). The definition is taken to begin at the first non-white-space character following the name and continuing to the end of the line. The definition can subsequently be referred to using "{name}", which will expand to "(definition)". For example,

DIGIT	[0-9]
ID	 [a-z][a-z0-9]*

defines "DIGIT" to be a regular expression that matches a single digit, and "ID" to be a regular expression that matches a letter followed by zero or more letters or digits. A subsequent reference to:

{DIGIT}+"."{DIGIT}*
is identical to:
([0-9])+"."([0-9])*
and matches one or more digits followed by a '.' that is followed by zero or more digits.

The rules section of the lex(1) input contains a series of rules of the form:

pattern   action
where the pattern must be unindented and the action must begin on the same line.

Patterns and actions are described in more detail later in this topic.

Finally, the user code section is simply copied to lex.yy.c verbatim. It is used for companion routines that call or are called by the scanner. The presence of this section is optional; if it is missing, the second %% in the input file can also be skipped.

In the definitions and rules sections, any indented text or text enclosed by %{ and %} is copied verbatim to the output (with instances of %{} removed). The instances of %{} must appear unindented on lines by themselves.

In the rules section, any indented or %{} text appearing before the first rule can be used to declare variables that are local to the scanning routine and (after the declarations) code that will be executed whenever the scanning routine is entered. Other indented or %{} text in the rule section is still copied to the output, but its meaning is not well defined, and it might cause compile-time errors (this feature is present for POSIX compliance).

In the definitions section (but not in the rules section), an unindented comment (that is, a line beginning with /*) is also copied verbatim to the output up to the next */.

PATTERNS

Patterns in the input are written using an extended set of regular expressions. These are described in the following table:

Pattern Matches
x Match the character 'x'.
. Any character except newline.
[xyz] A "character class"; in this case, the pattern matches either an 'x', 'y', or 'z'.
[abj-oZ] A "character class" that contains a range; matches an 'a', 'b', any letter from 'j' through 'o', or a 'Z'.
[^A-Z] A "negated character class" (that is, any character except those in the class. In this case, any character except an uppercase letter).
[^A-Z\n] Any character except an uppercase letter or a newline.
r* Zero or more instances of r, where r is any regular expression.
r+ One or more instances of r.
r? Zero or one instance ofr (that is, an optional r).
r{2,5} Two to five instances of r.
r{2,} Two or more instances of r.
r{4} Exactly four instances of r.
{name} The expansion of the name definition.
"[xyz]\"star" The literal string: [xyz]"star.
\X If X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', the ANSI C interpretation of \x. Otherwise, a literal 'X' (used to escape operators such as '*').
\123 The character with octal value 123.
\x2a The character with hexadecimal value 2a
(r) Match an r; parentheses are used to override precedence (discussed later in this topic).
rs "Concatenation:" the regular expression r followed by the regular expression s.
r|s Either an r or an s.
r/s An r, but only if it is followed by an s. The s is not part of the matched text. This type of pattern is called "trailing context".
^r An r, but only at the beginning of a line.
r$ An r, but only at the end of a line. Equivalent to "r/\n".
<s>r An r, but only in start condition s. (See the discussion of start conditions later in this topic.)
<s1,s2,s3>r Same, but in any of start conditions s1, s2, or s3.
<*>r An r in any start condition, even an exclusive one.
<<EOF>> An end-of-file.
<s1,s2><<EOF>> An end-of-file when in start condition s1 or s2.

Note that inside a character class, all regular-expression operators lose their special meaning except escape ('\') and the character class operators, '-', ']', and, at the beginning of the class, '^'.

The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence. For example,

cat|dog*
is the same as
(cat)|(do(g*))
because the '*' operator has higher precedence than concatenation, and concatenation higher than alternation ('|').Therefore, this pattern matches either the string "cat" or the string "do" followed by zero or more instances of g. To match "cat" or zero or more instances of "dog", use:
cat|(dog)*
and to match zero or more instances of "cat" or "dog" use:
(cat|dog)*

Some notes on patterns:

HOW INPUT IS MATCHED

When the generated scanner is run, it analyzes its input looking for strings that match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing-context rules, this includes the length of the trailing part, even though it will then be returned to the input). If it finds two or more matches of the same length, the rule listed first in the lex(1) input file is chosen.

Once the match is determined, the text corresponding to the match (called the I token ) is made available in the global-character pointer yytext and its length in the global integer yyleng. The action corresponding to the matched pattern is then executed (a more detailed description of actions follows), and then the remaining input is scanned for another match.

If no match is found, the default rule is executed: the next character in the input is considered matched and copied to the standard output. Thus, the simplest correct lex(1) input is:

%%
which generates a scanner that simply copies its input (one character at a time) to its output.

Note that yytext can be defined in two ways: either as a character pointer or as a character array. You can control which definition lex(1) uses by including one of the special directives %pointer or %array in the first (definitions) section of your lex input. The default is %pointer, unless you use the -l AT&T lex compatibility option, in which case yytext will be an array. The advantage of using %pointer is that it provides substantially faster scanning and no buffer overflow when matching very large tokens (unless you run out of dynamic memory). The disadvantage is that you are restricted in how your actions can modify yytext (see the next section), and calls to the input() and unput() functions destroy the present contents of yytext. This can make moving between different lex(1) versions difficult.

The advantage of %array is that you can modify yytext as much as you want, and calls to input() and unput() do not destroy yytext. Furthermore, existing AT&T lex(1) programs sometimes access yytext externally using declarations of the form:

extern char yytext[];
This definition is erroneous when used with %pointer, but correct for %array.

%array defines yytext to be an array of YYLMAX characters, which defaults to a fairly large value. You can change the size by simply defining YYLMAX to a different value in the first section of your lex(1) input. As mentioned above, with %pointer yytext grows dynamically to accommodate large tokens. Although your %pointer scanner can accommodate very large tokens (such as matching entire blocks of comments), each time the scanner must resize yytext, it also must rescan the entire token from the beginning, so matching such tokens can prove slow. AT present, yytext does not dynamically grow if a call to unput() results in too much text being pushed back. Instead, a run-time error results.

ACTIONS

Each pattern in a rule has a corresponding action, which can be any arbitrary C statement. The pattern ends at the first non-escaped white-space character. The remainder of the line is its action. If the action is empty, when the pattern is matched, the input token is simply discarded. For example, here is the specification for a program that deletes all occurrences of "delete this string" from its input:

%%
"delete this string"
(It will copy all other characters in the input to the output since they will be matched by the default rule.)

The following program compresses multiple blanks and tabs down to a single blank, and throws away whites pace found at the end of a line:

%%
[ \t]+		putchar( ' ' );
[ \t]+$	 /* ignore this token */

If the action contains a '{', the action spans the point up to which the balancing '}' is found; the action can cross multiple lines. The lex(1) utility knows about C strings and comments and will not be mislead by braces found within them. It allows actions to begin with %{ and will consider the action to be all the text up to the next %} (regardless of ordinary braces inside the action).

An action consisting solely of a vertical bar ('|') means "same as the action for the next rule."

Actions can include arbitrary C code, including return statements to return a value to the routine that called yylex(). Each time yylex() is called it continues processing tokens from the point at which it last left off until it either reaches the end of the file or executes a return.

Actions can modify yytext, except for lengthening it (adding characters to its end--these will overwrite later characters in the input stream). Modifying the final character of yytext might alter whether rules anchored with '^' are active when scanning resumes. Specifically, changing the final character of yytext to a newline will activate such rules on the next scan, and changing it to anything else will deactivate the rules. Users should not rely on this behavior being present in future releases. Finally, note that none of this paragraph applies when using %array.

Actions can to modify yyleng except they should not do so if the action also includes use of yymore().

There are a number of special directives that can be included within an action:

ECHO
Copies yytext to the scanner's output.
BEGIN
Followed by the name of a start condition places the scanner in the corresponding start condition (see later in this topic).
REJECT
Directs the scanner to proceed on to the "second best" rule that matched the input (or a prefix of the input). The rule is chosen as described above in "How input is Matched"; yytext and yyleng set up appropriately. It can either be one which that as much text as the originally chosen rule but came later in the lex(1) input file, or one that matched less text. For example, the following will count the words in the input and call the routine special() whenever "frob" is seen:
	int word_count = 0;
%%
frob		special(); REJECT;
[^ \t\n]+   ++word_count;
Without the REJECT, any instances of "frob" in the input would not be counted as words, since the scanner normally executes only one action per token. Multiple occurrences of REJECT are allowed, each one finding the next best choice to the currently active rule. For example, when the following scanner scans the token "abcd", it will write "abcdabcaba" to the output:
%%
a		|
ab	 |
abc	|
abcd	 ECHO; REJECT;
.|\n	 /* consume any unmatched character */
(The first three rules share the action of the fourth because they use the special '|' action.) REJECT is a particularly expensive feature in terms scanner performance; if it is used in any of the scanner's actions it will slow down all of the scanner's matching. Furthermore, REJECT cannot be used with the -Cf or -CF options (discussed later in this topic).

Note also that unlike the other special actions, REJECT is a branch; code immediately following it in the action will not be executed.

yymore()
Tells the scanner that the next time it matches a rule, the corresponding token should be appended onto the current value of yytext rather than replacing it. For example, given the input "hot-sun" the following will write "hot-hot-sun" to the output:
%%
hot-   ECHO; yymore();
sun   ECHO;
First "hot-" is matched and echoed to the output. Then "sun" is matched, but the previous "hot-" is still hanging around at the beginning of yytext so the ECHO for the "sun" rule will actually write "hot-sun". The presence of yymore() in the scanner's action entails a minor performance penalty in the scanner's matching speed.
yyless(n)
Returns all but the first n characters of the current token back to the input stream, where they will be rescanned when the scanner looks for the next match. yytext and yyleng are adjusted appropriately (for example, yyleng will now be equal to n ). On the input "catdog" the following will write out "catdogdog":
%%
catdog	ECHO; yyless(3);
[a-z]+	ECHO;
An argument of 0 to yyless() will cause the entire current input string to be scanned again. Unless you have changed how the scanner will subsequently process its input (using BEGIN, for example), this will result in an endless loop.

Note that yyless is a macro and can only be used in the lex input file, not from other source files.

unput(c)
Puts the character c back onto the input stream. It will be the next character scanned. The following action will take the current token and cause it to be rescanned enclosed in parentheses.
{
int i;
unput( ')' );
for ( i = yyleng - 1; i >= 0; --i )
	unput( yytext[i] );
unput( '(' );
}
Note that since each unput() puts the given character back at the beginning of the input stream, pushing back strings must be done back-to-front. Also note that you cannot put back EOF to attempt to mark the input stream with an end-of-file.
input()
Reads the next character from the input stream. For example, the following is one way to consume C comments:
%%
"/*"		{
			register int c;
			for ( ; ; )
				{
				while ( (c = input()) != '*' &&
						c != EOF )
					; /* consume text of comment */
			if ( c == '*' )
					{
					while ( (c = input()) == '*' )
						;
					if ( c == '/' )
						break; /* found the end */
				}
			if ( c == EOF )
					{
					error( "EOF in comment" );
					break;
				}
			}
		}
(Note that if the scanner is compiled using C++ then input() is instead referred to as yyinput(), in order to avoid a name clash with the C++ stream by the name of input
yyterminate()
Can be used instead of a return statement in an action. It terminates the scanner and returns a 0 to the scanner's caller, indicating "all done". By default, yyterminate() is also called when an end-of-file is encountered. It is a macro and can be redefined.

THE GENERATED SCANNER

The output of lex(1) is the file lex.yy.c, which contains the scanning routine yylex(), a number of tables used by it for matching tokens, and a number of auxiliary routines and macros. By default, yylex() is declared as follows:

int yylex()
	{
	... various definitions and the actions are placed here ...
}

(If your environment supports function prototypes, it will be "int yylex( void )".) This definition can be changed by defining the "YY_DECL" macro. For example, you could use:

#define YY_DECL float lexscan( a, b ) float a, b;
to give the scanning routine the name lexscan(1), returning a float, and taking two floats as arguments. Note that if you give arguments to the scanning routine using a K&R-style, non-prototyped function declaration, you must terminate the definition with a semicolon (;).

Whenever yylex() is called, it scans tokens from the global input file yyin (which defaults to stdin). It continues until it either reaches an end-of-file (at which point it returns the value 0) or one of its actions executes a return statement.

If the scanner reaches an end-of-file, subsequent calls are undefined unless either yyin is pointed at a new input file (in which case scanning continues from that file), or yyrestart() is called. yyrestart() takes one argument, a FILE * pointer, and initializes yyin for scanning from that file. Essentially there is no difference between just assigning yyin to a new input file or using yyrestart() to do so; the latter is available for compatibility with previous versions of lex,(1) and because it can be used to switch input files in the middle of scanning. It can also be used to throw away the current input buffer, by calling it with an argument of yyin.

If yylex() stops scanning due to executing a return statement in one of the actions, the scanner can be called again, and it will resume scanning where it left off.

By default (and for purposes of efficiency), the scanner uses block-reads rather than simple getc(3) calls to read characters from yyin. The nature of how it gets its input can be controlled by defining the YY_INPUT macro. YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its action is to place up to max_size characters in the character array buf and return in the integer variable result either the number of characters read or the constant YY_NULL (traditionally 0) to indicate EOF. The default YY_INPUT reads from the global file pointer "yyin".

A sample definition of YY_INPUT (in the definitions section of the input file):

%{
#define YY_INPUT(buf,result,max_size) \
	{ \
	int c = getchar(); \
	result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
}
%}

This definition will change the input processing to occur one character at a time.

You can also use this approach to add things, like keeping track of the input line number. If you do so, however, your scanner might not go very fast.

When the scanner receives an end-of-file indication from YY_INPUT, it checks the yywrap() function. If yywrap() returns false (zero), it is assumed that the function has set up yyin to point to another input file, and scanning continues. If it returns true (non-zero), the scanner terminates, returning 0 to its caller.

The default yywrap() always returns 1.

The scanner writes its ECHO output to the yyout global (default, stdout), which may be redefined by the user simply by assigning it to some other FILE pointer.

START CONDITIONS

The lex(1) utility provides a mechanism for conditionally activating rules. Any rule whose pattern is prefixed with "<sc>" will be active only when the scanner is in the start condition named "sc". For example,

<STRING>[^"]*		{ /* consume the string body ... */
			...
		}
will be active only when the scanner is in the "STRING" start condition, and
<INITIAL,STRING,QUOTE>\.		{ /* handle an escape ... */
			...
		}
will be active only when the current start condition is either "INITIAL", "STRING", or "QUOTE".

Start conditions are declared in the definitions (first) section of the input using unindented lines beginning with either %s or %x followed by a list of names. The former declares inclusive start conditions, the latter exclusive start conditions. A start condition is activated using the BEGIN action. Until the next BEGIN action is executed, rules with the given start condition will be active, and rules with other start conditions will be inactive. If the start condition is inclusive, rules with no start conditions will also be active. If it is exclusive, only rules qualified with the start condition will be active. A set of rules contingent on the same exclusive start condition describe a scanner that is independent of any of the other rules in the lex(1) input. Because of this, exclusive start conditions make it easy to specify "mini-scanners" which scan portions of the input that are syntactically different from the rest (such as comments).

The following example illustrates the connection between inclusive and exclusive start conditions. The set of rules:

%s example
%%
<example>cat		 /* do something */
is equivalent to
%x example
%%
<INITIAL,example>cat   /* do something */

Also note that the special start-condition specifier <*> matches every start condition. Thus, the above example could also have been written;

%x example
%%
<*>cat   /* do something */

The default rule (to ECHO any unmatched character) remains active in start conditions.

BEGIN(0) returns to the original state where only rules with no start conditions are active. This state can also be referred to as the start-condition "INITIAL", so BEGIN(INITIAL) is equivalent to BEGIN(0). (The parentheses around the start condition name are not required but are considered good style.)

BEGIN actions can also be given as indented code at the beginning of the rules section. For example, the following will cause the scanner to enter the "SPECIAL" start condition whenever yylex() is called and the global variable

enter_special
is true:
	int enter_special;
%x SPECIAL
%%
		if ( enter_special )
			BEGIN(SPECIAL);
<SPECIAL>something here
...more rules follow...

The following scanner illustrates the uses of start conditions. It gives two different interpretations of a string like "123.456". By default, it will treat it as as three tokens, the integer "123", a dot ('.'), and the integer "456". But if the string is preceded earlier in the line by the string "expect-floats" it will treat it as a single token, the floating-point number 123.456:

%{
#include <math.h>
%}
%s expect
%%
expect-floats		BEGIN(expect);
<expect>[0-9]+"."[0-9]+	{
			printf( "found a float, = %f\n",
					atof( yytext ) );
		}
<expect>\n		 {
			/* that's the end of the line, so
			 * we need another "expect-number"
			 * before we'll recognize any more
			 * numbers
			 */
			BEGIN(INITIAL);
		}
[0-9]+	{
			printf( "found an integer, = %d\n",
					atoi( yytext ) );
		}
"."		 printf( "found a dot\n" );

The following scanner recognizes (and discards) C comments while maintaining a count of the current input line:

%x comment
%%
		int line_num = 1;
"/*"		 BEGIN(comment);
<comment>[^*\n]*		/* consume anything that is not a '*' */
<comment>"*"+[^*/\n]*   /* consume occurrences of '*' not followed by a '/' */
<comment>\n			 ++line_num;
<comment>"*"+"/"		BEGIN(INITIAL);

This scanner will try to match as much text as possible with each rule. In general, when attempting to write a high-speed scanner, it is advisable to try to match as much possible in each rule.

Note that start-condition names are really integer values and can be stored as such. Thus, the above could be extended in the following fashion:

%x comment cat
%%
		int line_num = 1;
		int comment_caller;
"/*"		 {
			 comment_caller = INITIAL;
			 BEGIN(comment);
			 }
...
<cat>"/*"	{
			 comment_caller = cat;
			 BEGIN(comment);
			 }
<comment>[^*\n]*		/* consume anything that is not a '*' */
<comment>"*"+[^*/\n]*   /* consume '*'s not followed by '/'s */
<comment>\n			 ++line_num;
<comment>"*"+"/"		BEGIN(comment_caller);

You can also access the current start condition using the integer-valued YY_START macro. For example, the above assignments to comment_caller could instead be written

comment_caller = YY_START;

Note that start conditions do not have their own name space; %s and %x declare names the same as #define does.

The following example illustrates how to match C-style quoted strings using exclusive start conditions, including expanded escape sequences (but not including checking for a string that is too long):

%x str
%%
		char string_buf[MAX_STR_CONST];
		char *string_buf_ptr;
\"	string_buf_ptr = string_buf; BEGIN(str);
<str>\"		{ /* saw closing quote - all done */
		BEGIN(INITIAL);
		*string_buf_ptr = '\0';
		/* return string constant token type and
		 * value to parser
		 */
	}
<str>\n		{
		/* error - unterminated string constant */
		/* generate error message */
	}
<str>\\[0-7]{1,3} {
		/* octal escape sequence */
		int result;
		(void) sscanf( yytext + 1, "%o", &result );
		if ( result > 0xff )
				/* error, constant is out-of-bounds */
		*string_buf_ptr++ = result;
	}
<str>\\[0-9]+ {
		/* generate error - bad escape sequence; something
		 * like '\48' or '\0777777'
		 */
	}
<str>\\n  *string_buf_ptr++ = '\n';
<str>\\t  *string_buf_ptr++ = '\t';
<str>\\r  *string_buf_ptr++ = '\r';
<str>\\b  *string_buf_ptr++ = '\b';
<str>\\f  *string_buf_ptr++ = '\f';
<str>\\(.|\n)  *string_buf_ptr++ = yytext[1];
<str>[^\\\n\"]+		{
		char *yytext_ptr = yytext;
		while ( *yytext_ptr )
				*string_buf_ptr++ = *yytext_ptr++;
	}

MULTIPLE INPUT BUFFERS

Some scanners (such as those that support include files) require reading from several input streams. Because lex(1) scanners do a large amount of buffering, one cannot control where the next input will be read from by simply writing a YY_INPUT that is sensitive to the scanning context. YY_INPUT is only called when the scanner reaches the end of its buffer. This can be a long time after scanning a statement such as an "include" that requires switching the input source.

To alleviate these types of problems, lex(1) provides a mechanism for creating and switching between multiple input buffers. An input buffer is created by using:

YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
This takes a FILE pointer and a size and creates a buffer associated with the given file that is large enough to hold size characters (when in doubt, use YY_BUF_SIZE for the size). It returns a YY_BUFFER_STATE handle, which can be passed to other routines:
void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
switches the scanner's input buffer so subsequent tokens will come from new_buffer Note that yy_switch_to_buffer() can be used by yywrap() to set things up for continued scanning, instead of opening a new file and pointing yyin at it.
void yy_delete_buffer( YY_BUFFER_STATE buffer )
is used to reclaim the storage associated with a buffer.

yy_new_buffer() is an alias for yy_create_buffer(), and is provided for compatibility with the C++ use of new and delete for creating and destroying dynamic objects.

Finally, the YY_CURRENT_BUFFER macro returns a YY_BUFFER_STATE handle to the current buffer.

The following example illustrates how to use these features to write a scanner that expands include files (the <<EOF>> feature is discussed later in this topic):

/* the "incl" state is used for picking up the name
 * of an include file
 */
%x incl
%{
#define MAX_INCLUDE_DEPTH 10
YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
int include_stack_ptr = 0;
%}
%%
include			 BEGIN(incl);
[a-z]+			ECHO;
[^a-z\n]*\n?		ECHO;
<incl>[ \t]*	/*consume the white space */
<incl>[^ \t\n]+   { /* got the include file name */
		if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
			{
			fprintf( stderr, "Includes nested too deeply" );
			exit( 1 );
		}
		include_stack[include_stack_ptr++] =
			YY_CURRENT_BUFFER;
		yyin = fopen( yytext, "r" );
	if ( ! yyin )
			error( ... );
		yy_switch_to_buffer(
			yy_create_buffer( yyin, YY_BUF_SIZE ) );
		BEGIN(INITIAL);
	}
<<EOF>> {
		if ( --include_stack_ptr < 0 )
			{
			yyterminate();
		}
		else
			{
			yy_delete_buffer( YY_CURRENT_BUFFER );
			yy_switch_to_buffer(
				 include_stack[include_stack_ptr] );
		}
	}

END-OF-FILE RULES

The special rule "<<EOF>>" indicates actions to be taken when an end-of-file is encountered and yywrap() returns non-zero (that is, indicates no further files to process). The action must finish by doing one of four things:

<<EOF>> rules cannot be used with other patterns; they can only be qualified with a list of start conditions. If an unqualified <<EOF>> rule is given, it applies to all start conditions that do not already have <<EOF>> actions. To specify an <<EOF>> rule for only the initial start condition, use

<INITIAL><<EOF>>

These rules are useful for catching things like unclosed comments, as in the following example:

%x quote
%%
...other rules for dealing with quotes...
<quote><<EOF>>   {
		 error( "unterminated quote" );
		 yyterminate();
		 }
<<EOF>>  {
		 if ( *++filelist )
			 yyin = fopen( *filelist, "r" );
		 else
			yyterminate();
		 }

MISCELLANEOUS MACROS

The macro YY_USER_ACTION can be defined to provide an action that is always executed prior to the action of the matched rule. For example, it could be defined (using #define) to call a routine to convert yytext to lowercase.

The macro YY_USER_INIT can be defined to provide an action that is always executed before the first scan (and before the scanner's internal initializations are done). For example, it could be used to call a routine to read in a data table or open a logging file.

In the generated scanner, the actions are all gathered in one large switch statement and separated using YY_BREAK, which may be redefined. By default, it is simply a "break", to separate each rule's action from action of the following rule. Redefining YY_BREAK allows, for example, C++ users to #define YY_BREAK to do nothing (while being very careful that every rule ends with a "break" or a "return") to avoid getting unreachable statement warnings, where, because a rule's action ends with "return", the YY_BREAK is inaccessible.

INTERFACING WITH YACC

One of the ways to use lex(1) is as a companion to the yacc(1) parser-generator. The yacc(1) parsers expect to call a routine named yylex() to find the next input token. The routine is supposed to return the type of the next token, as well as putting any associated value in the global yylval To use lex(1) with yacc(1), one specifies the -d option to yacc(1). This instructs yacc(1) to generate the file y.tab.h, which contains definitions of all the %tokens appearing in the yacc(1) input. This file is then included in the lex(1) scanner. For example, if one of the tokens is "TOK_NUMBER", part of the scanner might look like:

%{
#include "y.tab.h"
%}
%%
[0-9]+		yylval = atoi( yytext ); return TOK_NUMBER;

LEX OPTIONS

The lex(1) utility has the following options:

-7
Generate a seven-bit scanner, which can save considerable table space, especially when using -Cf or -CF (and, at most sites, -7 is on by default for these options. To see if this is the case, use the -v verbose flag and check the flag summary it reports).
-8
Generate an eight-bit scanner. This is the default, except for the -Cf and -CF compression options, for which the default is site-dependent, and can be checked by inspecting the flag summary generated by the -v option.
-B
Generate a batch scanner instead of an interactive scanner (see -I later in this discussion). Scanners using -Cf or -CF compression options also specify this option automatically .
-b
Generate backing-up information to lex.backup(1). This provides a list of scanner states that require backing up, and the input characters on which they do so. By adding rules, one can remove backing-up states. If all backing-up states are eliminated and either -Cf or -CF is used, the generated scanner will run faster.
-C[aefFmr]
Specify the degree of table compression and scanner optimization.
-Ca
Trade off larger tables in the generated scanner for faster performance because the elements of the tables are better aligned for memory access and computation. This option can double the size of the tables used by your scanner.
-Ce
Construct equivalence; that is, sets of characters that have identical lexical properties. Equivalence classes usually give dramatic reductions in the final table/object file sizes (typically a factor of 2-5) and have little impact on performance (one array look-up per character scanned).
-Cf
Generate full scanner tables. The lex(1) utility should not compress the tables by taking advantages of similar transition functions for different states.
-CF
Use the alternate fast-scanner representation (described in lexdoc(1)).
-Cm
Construct meta-equivalence classes, which are sets of equivalence classes (or characters, if equivalence classes are not being used) that are commonly used together. Meta-equivalence classes are often effective when using compressed tables, but they have a moderate impact on performance (one or two "if" tests and one array look-up per character scanned).
-Cr
Bypass using stdio for input in generated scanner. In general this option results in a minor performance gain only worthwhile if used in conjunction with -Cf or -CF. It can cause surprising behavior if you use stdio yourself to read from yyin prior to calling the scanner.
-C
Alone, compress scanner tables, but use neither equivalence classes nor meta-equivalence classes.

The options -Cf or -CF and -Cm do not make sense together; there is no opportunity for meta-equivalence classes if the table is not being compressed. Otherwise, the options can be freely mixed.

The default setting is -Cem, which specifies that lex(1) should generate equivalence classes and meta-equivalence classes. This setting provides the highest degree of table compression. You can trade off faster-executing scanners at the cost of larger tables with the following generally being true:

Slowest and smallest -Cem -Cm -Ce -C -C{f,F}e -C{f,F} -C{f,F}a Fastest and largest

-C options are cumulative.

-c
Does nothing; a deprecated option included for POSIX compliance.
NOTE:
In previous releases of lex(1) -c specified table-compression options. This functionality is now given by the -C flag. To ease the impact of this change, when lex(1) encounters -c, it currently issues a warning message and assumes that you wanted -C instead. In the future this "promotion" of -c to -C will be eliminated in the name of full POSIX compliance (unless the POSIX meaning is removed first).
-d
Run the generated scanner in debug mode. Whenever a pattern is recognized and the global yy_flex_debug is non-zero (which is the default), the scanner will write to stderr a line of the form:
--accepting rule at line 53 ("the matched text")
The line number refers to the location of the rule in the file defining the scanner (that is, the file that was provided to lex). Messages are also generated when the scanner backs up, accepts the default rule, reaches the end of its input buffer (or encounters a NUL; to the scanner's concerned, the two look identical), or reaches an end-of-file.
-F
Use fast scanner table representation (and bypass stdio). This representation is about as fast as the full-table representation (-f), and for some sets of patterns will be considerably smaller (and for others, larger). See lexdoc(1) for more details.

This option is equivalent to -CFr (discussed later in this topic).

-f
Use fast scanner. No table compression is done and stdio is bypassed. The result is large but fast. This option is equivalent to -Cfr (discussed later in this topic).
-h
Generate a "help" summary of lex's(1) options to stderr and then exit.
-I
Generate an interactive scanner, that is, a scanner that stops immediately rather than looking ahead if it knows that the currently scanned text cannot be part of a longer rule's match. This is the opposite of batch scanners (see -B above). See lexdoc(1) for details.

Note that -I cannot be used in conjunction with full or fast tables; that is, the -f, -F, -Cf, or -CF flags. For other table-compression options, -I is the default.

-i
Generate a case-insensitive scanner. The case of letters given in the lex(1) input patterns will be ignored, and tokens in the input will be matched regardless of case. The matched text given in yytext will have the preserved case (that is, it will not be folded).
-L
Do not generate #line directives in lex.yy.c. The default is to generate such directives so error messages in the actions will be correctly located with respect to the original lex(1) input file, and not to the relatively meaningless line numbers of lex.yy.c.
-l
Turn on maximum compatibility with the original AT&T lex implementation, at a considerable performance cost. This option is incompatible with -f, -F, -Cf, and -CF. See lexdoc(1) for details.
-n
Does nothing. Another deprecated option included only for POSIX compliance.
-Pprefix
Change the default yy prefix used by lex(1) to be prefix instead. See lexdoc(1) for a description of all the global variables and file names that this affects.
-p
Generate a performance report to stderr. The report consists of comments regarding features of the lex(1) input file that will cause a loss of performance in the resulting scanner. If you give the flag twice, you will also get comments regarding features that lead to minor performance losses.
-Sskeleton_file

Use skeleton_file to construct the scanner instead of the default file. You will never need this option unless you are performing lex(1) maintenance or development.
-s
Suppress the default rule (that unmatched scanner input is echoed to stdout). If the scanner encounters input that does not match any of its rules, it aborts with an error.
-T
Run in trace mode. It will generate many messages to stderr concerning the form of the input and the resultant non-deterministic and deterministic finite automata. This option is mostly for use in maintaining lex(1).
-t
Write the scanner it generates to standard output instead of lex.yy.c.
-V
Print the version number to stderr and exits.
-v
Write to stderr a summary of statistics regarding the scanner it generates.
-w
Suppress warning messages.

PERFORMANCE CONSIDERATIONS

The main design goal of lex(1) is to generate high-performance scanners. It has been optimized for dealing well with large sets of rules. Aside from the effects on scanner speed of the table-compression -C options outlined above, there are a number of options and actions that degrade performance. These are provided below, from most to lease expensive, (with the first three having considerable impact on performance, and the last two having little impact on performance):

Note also that unput() is implemented as a routine call that potentially does quite a bit of work, while the yyless() macro has little impact on performance; so if you are simply putting back some excess text you scanned, use yyless().

Avoid REJECT at all costs when performance is important. It is a particularly expensive option.

Eliminating backing up is difficult and can be an enormous amount of work for a complicated scanner. In principal, one begins by using the -b flag to generate a lex.backup file. For example, on the input

%%
cat		return TOK_KEYWORD;
catdog	 return TOK_KEYWORD;
the file appears as follows:
State #6 is non-accepting -
 associated rule line numbers:
	 2	 3
 out-transitions: [ o ]
 jam-transitions: EOF [ \001-n  p-\177 ]
State #8 is non-accepting -
 associated rule line numbers:
	 3
 out-transitions: [ a ]
 jam-transitions: EOF [ \001-'  b-\177 ]
State #9 is non-accepting -
 associated rule line numbers:
	 3
 out-transitions: [ r ]
 jam-transitions: EOF [ \001-q  s-\177 ]
Compressed tables always back up.

The first few lines indicate that a scanner state exists in which it can make a transition on an 'a' but not on any other character. These lines also indicate that, in that state, the currently scanned text does not match any rule. The state occurs when trying to match the rules found at lines 2 and 3 in the input file. If the scanner is in that state and then reads something other than an 'a', it will have to back up to find a rule that is matched. It becomes apparent that this must be the state the scanner is in when it has seen "ca". When this has happened, if anything other than another 'a' is seen, the scanner will have to back up to simply match the 'c' (by the default rule).

The comment regarding State #8 indicates that there is a problem when "catd" has been scanned. Indeed, on any character other than an 'o', the scanner will have to back up to accept "cat". Similarly, the comment for State #9 concerns when "catdo" has been scanned and a 'g' does not follow.

The final comment is a reminder that it is not useful to remove backing up from the rules unless -Cf or -CF is being used, since there is no gain in performance doing so with compressed scanners.

The way to remove the backing up is to add "error" rules:

%%
cat		 return TOK_KEYWORD;
catdog	return TOK_KEYWORD;
catdo	 |
catd		|
ca		 {
			/* false alarm, not really a keyword */
			return TOK_ID;
		}

Eliminating backing up among a list of keywords can also be done using a "catch-all" rule:

%%
cat		 return TOK_KEYWORD;
catdog	 return TOK_KEYWORD;
[a-z]+	return TOK_ID;
This is usually the best solution when appropriate.

Backing up messages tend to cascade. With a complicated set of rules, it is not uncommon to get hundreds of messages. If one can decipher them, however, it often only takes about a dozen rules to eliminate the backing up. (It is easy, however, to make a mistake and have an error rule accidentally match a valid token. A possible future lex(1) feature might automatically add rules to eliminate backing up).

Variable trailing context (where both the leading and trailing parts do not have a fixed length) entails almost the same performance loss as REJECT (that is, substantial). So when possible a rule like:

%%
mouse|rat/(cat|dog)   run();
is better written:
%%
mouse/cat|dog		 run();
rat/cat|dog		 run();
or as
%%
mouse|rat/cat		 run();
mouse|rat/dog		 run();

Note that here the special '|' action does not have a positive impact on performance, and could even have a negative impact.

A final note regarding performance: dynamically resizing yytext to accommodate huge tokens is a slow process because it presently requires that the (huge) token be rescanned from the beginning. If performance is vital, you should attempt to match "large" quantities of text but not "huge" quantities, where the cutoff between the two is at about 8 KB characters/token.

Another area where you can increase a scanner's performance (and one that is easier to implement) arises from the fact that the longer the tokens that are matched, the faster the scanner will run. This is because with long tokens the processing of most input characters takes place in the (short) inner scanning loop, and does not often have to go through the additional work of setting up the scanning environment (such as yytext for the action. Recall the scanner for C comments:

%x comment
%%
		int line_num = 1;
"/*"		 BEGIN(comment);
<comment>[^*\n]*
<comment>"*"+[^*/\n]*
<comment>\n			 ++line_num;
<comment>"*"+"/"		BEGIN(INITIAL);

This could be sped up by writing it as:

%x comment
%%
		int line_num = 1;
"/*"		 BEGIN(comment);
<comment>[^*\n]*
<comment>[^*\n]*\n	++line_num;
<comment>"*"+[^*/\n]*
<comment>"*"+[^*/\n]*\n ++line_num;
<comment>"*"+"/"		BEGIN(INITIAL);

Instead of each newline requiring the processing of another action, the recognition of the newlines is "distributed" over the other rules to keep the matched text as long as possible. Note that adding rules does not slow down the scanner. The speed of the scanner is independent of the number of rules or (modulo the considerations given at the beginning of this section) how complicated the rules are with regard to operators such as '*' and '|'.

The following is a final example for speeding up a scanner. It presumes that you want to scan through a file that contains identifiers and keywords, one per line and with no other extraneous characters, recognizing all the keywords. The example provides a logical first approach:

%%
asm	|
auto	 |
break	|
... etc ...
volatile |
while	/* it is a keyword */
.|\n	 /* it is not a keyword */

To eliminate the back-tracking, introduce a catch-all rule:

%%
asm	|
auto	 |
break	|
... etc ...
volatile |
while	/* it is a keyword */
[a-z]+   |
.|\n	 /* it is not a keyword */

If it is certain that there is exactly one word per line, the total number of matches can be reduced by half by merging in the recognition of newlines with that of the other tokens:

%%
asm\n	|
auto\n   |
break\n  |
... etc ...
volatile\n |
while\n  /* it is a keyword */
[a-z]+\n |
.|\n	 /* it is not a keyword */

Caution should be used as backing up into the scanner has been reintroduced. In particular, while you might know that there will never be any characters in the input stream other than letters or newlines, lex(1) cannot know this; it will plan for possibly needing to back up when it has scanned a token like "auto" and the next character is something other than a newline or a letter. Previously, it would then just match the "auto" rule and be done, but now it has no "auto" rule, only a "auto\n" rule. To eliminate the possibility of backing up, you could either duplicate all rules without final newlines, or, because you never expect to encounter such input, and therefore do not know how it is classified, you could introduce one more catch-all rule, this one, which does not include a newline:

%%
asm\n	|
auto\n   |
break\n  |
... etc ...
volatile\n |
while\n  /* it is a keyword */
[a-z]+\n |
[a-z]+   |
.|\n	 /* it is not a keyword */
Compiled with -Cf, this is about as fast as one can get a lex(1) scanner to go for this particular problem.

A final note: lex(1) is slow when matching NULs, particularly when a token contains multiple NULs. It is best to write rules that match short amounts of text if you anticipate that the text will often include NULs.

INCOMPATIBILITIES WITH AT&T LEX AND POSIX

This lex(1) is a rewrite of the AT&T lex(1) tool (the two implementations do not share any code, however), with some extensions and incompatibilities, both of which are of concern to those who want to write scanners acceptable to either implementation. The POSIX lex(1) specification is closer to the behavior of this implementation of lex(1) than to the original AT&T lex(1) implementation. Some incompatibilities remain, however, between this lex(1) and POSIX. The intent is that, ultimately, this lex(1) will be fully POSIX-conformant. This section discusses all of the known areas of incompatibility.

The lex(1) -l option turns on maximum compatibility with the original AT&T lex(1) implementation, at the cost of a major loss in the generated scanner's performance. The incompatibilities that can be overcome using the -l option are discussed later in this section.

The lex(1) utility is fully compatible with AT&T lex(1) with the following exceptions:

The following lex(1) features are not included in AT&T lex(1) or the POSIX specification:

The next-to-last feature in the list refers to the fact that, with lex(1), you can put multiple actions on the same line, separated with semicolons, while with AT&T lex,(1) the following
cat	handle_cat(); ++num_cats_seen;
is (rather surprisingly) truncated to
cat   handle_cat();

The lex(1) utility does not truncate the action. Actions that are not enclosed in braces are simply terminated at the end of the line.

DIAGNOSTICS

The lex(1) utility produces the following diagnostic messages:

warning, rule cannot be matched
Indicates that the given rule cannot be matched because it follows other rules that will always match the same text as it. For example, in the following "dog" cannot be matched because it comes after an identifier "catch-all" rule:
[a-z]+	got_identifier();
dog	 got_dog();
Using REJECT in a scanner suppresses this warning.
warning, -s option given but default rule can be matched
It is possible (perhaps only in a particular start condition) that the default rule (match any single character) is the only one that will match a particular input. Since -s was given, presumably this is not intended.
reject_used_but_not_detected undefined
This is a compile-time error; the scanner uses REJECT but lex(1) did not find it in the first two sections. Usually, this happens because the reference is implicit (through an #include file, for example). Make an explicit reference to the action in your lex(1) input file. (Note that previously lex(1) supported a %used mechanism for dealing with this problem. Although this feature is still supported, it is now deprecated.)
yymore_used_but_not_detected undefined
This is a compile-time error; the scanner uses yymore(3) but lex(1) did not find it in the first two sections. Usually this happens because the reference is implicit (through an #include file, for example). Make an explicit reference to the action in your lex(1) input file. (Note that previously lex(1) supported a %used mechanism for handling this problem. Although this feature is still supported, it is now deprecated.)
lex scanner jammed
A scanner compiled with -s has encountered an input string which was not matched by any of its rules. This error can also occur due to internal problems.
token too large, exceeds YYLMAX
Your scanner uses %array, and one of its rules matched a string longer than the YYLMAX constant (eight KB bytes by default). You can increase the value by defining (using #define) YYLMAX in the definitions section of your lex(1) input.
scanner requires -8 flag to use the character 'x'
Your scanner specification includes recognizing the eight-bit character 'x', you did not specify the -8 flag, and your scanner defaulted to seven-bit because you used the -Cf or -CF table-compression options. See the discussion of the -7 flag for details.
lex scanner push-back overflow
You used unput() to push back so much text that the scanner's buffer could not hold both the pushed-back text and the current token in yytext Ideally, the scanner should dynamically resize the buffer in this case; at present, however, it does not.
input buffer overflow, can't enlarge buffer because scanner uses REJECT
The scanner was working on matching an extremely large token and needed to expand the input buffer. This does not work with scanners that use REJECT.
fatal lex scanner internal error--end of buffer missed
This can occur in an scanner that is reentered after a long-jump has jumped out (or over) the scanner's activation frame. Before reentering the scanner, use:
yyrestart( yyin );
too many start conditions in <> construct!
You listed more start conditions in a <> construct than exist (you must have listed at least one of them twice).

FILES

See lex(1).

DEFICIENCIES / BUGS

See lex(1).

AUTHOR

Vern Paxson, with the help of many ideas and much inspiration from Van Jacobson. Original version by Jef Poskanzer. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson.

Thanks to the many lex(1) beta-testers, feedbackers, and contributors, especially Francois Pinard, Casey Leedom, Nelson H.F. Beebe, benson@odi.com, Peter A. Bigot, Keith Bostic, Frederic Brehm, Nick Christopher, Jason Coughlin, Bill Cox, Dave Curtis, Scott David Daniels, Chris G. Demetriou, Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, Chris Faylor, Jon Forrest, Kaveh R. Ghazi, Eric Goldman, Ulrich Grepel, Jan Hajic, Jarkko Hietaniemi, Eric Hughes, John Interrante, Ceriel Jacobs, Jeffrey R. Jones, Henry Juengst, Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, Marq Kole, Ronald Lamprecht, Greg Lee, Craig Leres, John Levine, Steve Liddle, Mohamed el Lozy, Brian Madsen, Chris Metcalf, Luke Mewburn, Jim Meyering, G.T. Nicol, Landon Noll, Marc Nozell, Richard Ohnemus, Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre, Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Frederic Raimbault, Rick Richardson, Kevin Rodgers, Jim Roskind, Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, Alex Siegel, Mike Stump, Paul Stuart, Dave Tallman, Chris Thewalt, Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, Ken Yap, Nathan Zelle, David Zuhn, and those whose names have slipped my marginal mail-archiving skills but whose contributions are appreciated all the same.

Thanks to Keith Bostic, Jon Forrest, Noah Friedman, John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various distribution headaches.

Thanks to Esmond Pitt and Earle Horton for 8-bit character support; to Benson Margulies and Fred Burke for C++ support; to Kent Williams and Tom Epperly for C++ class support; to Ove Ewerlid for support of NUL's; and to Eric Hughes for support of multiple buffers.

This work was primarily done when I was with the Real Time Systems Group at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there for the support I received.

Send comments to:

Vern Paxson Systems Engineering Bldg. 46A, Room 1123 Lawrence Berkeley Laboratory University of California Berkeley, CA 94720 vern@ee.lbl.gov 

SEE ALSO

lex(1)

yacc(1)

sed(1)

awk(1)

M. E. Lesk and E. Schmidt, LEX - Lexical Analyzer Generator