Browse Prior Art Database

Method of Creating Efficient Lexical Analyzers in C

IP.com Disclosure Number: IPCOM000117341D
Original Publication Date: 1996-Feb-01
Included in the Prior Art Database: 2005-Mar-31
Document File: 4 page(s) / 82K

Publishing Venue

IBM

Related People

Feriozi, DT: AUTHOR

Abstract

Disclosed is a software implementation that provides a new algorithm for lexical analysis that appears to be significantly more efficient than the prior art. The efficiency is achieved primarily through a reduction in generality of previous methods, coupled with an exploitation of features that are specific to the C programming language.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method of Creating Efficient Lexical Analyzers in C

      Disclosed is a software implementation that provides a new
algorithm for lexical analysis that appears to be significantly more
efficient than the prior art.  The efficiency is achieved primarily
through a reduction in generality of previous methods, coupled with
an exploitation of features that are specific to the C programming
language.

The traditional state machine model of lexical analysis works as
follows:
  1.  Use the current state and the current input to lookup the next
       state in the current state table.
  2.  Set the current state to the next state.
  3.  Get the next input character.
  4.  Go to (1)

      The primary bottleneck here is the fact that a table lookup
must be performed for every input character.  Changing the state
variable is a minor overhead, but one that can be avoided.  A size
penalty is paid using this method as well.  Depending on how many
states are required, the table space needed can be very large.  The
advantage of the state machine model is that it is quite general.

      A more efficient lexical analyzer can be constructed if some of
the unnecessary generality of the state machine model is removed.
The first thing to recognize is that in most cases the first
character of an input token narrows the final choice of lexemes from
all possibilities to just a very few possibilities.  This means that
a model of lexical analysis that centers around identification of the
first character will tend to be more efficient because it prunes the
search tree considerably with just a single comparison.  Subsequent
analysis can be efficient because it can be tuned to identifying only
the remaining few possibilities.

      The significance of the first character of the token can be
exploited by using a large case statement whose elements consist of
every character in the input alphabet.  As a concrete example,
consider the base ascii character set that is used on personal
computers.  It consists of 128 characters numbered from 0 to 127.  In
C, a SWITCH statement containing all cases from 0 to 127 can be used,
and will be optimized into a single jump statement by the compiler.
This is similar to ad hoc hand coding methods...