Browse Prior Art Database

Smalltalk Tokenizer and Backus-Naur Form Parser

IP.com Disclosure Number: IPCOM000114377D
Original Publication Date: 1994-Dec-01
Included in the Prior Art Database: 2005-Mar-28
Document File: 4 page(s) / 156K

Publishing Venue

IBM

Related People

Gover, PM: AUTHOR

Abstract

Many programming problems involve reading data streams with a fixed formal syntax: for example, compiling programming languages, evaluating expressions typed by the user, processing control cards and configuration data. Programmers can write such programs from first principles, or they may use "program generators" to write them. Some systems provide functions for this (for example, UNIX provides "lex" and "yacc") but others do not.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 31% of the total text.

Smalltalk Tokenizer and Backus-Naur Form Parser

      Many programming problems involve reading data streams with a
fixed formal syntax: for example, compiling programming languages,
evaluating expressions typed by the user, processing control cards
and configuration data.  Programmers can write such programs from
first principles, or they may use "program generators" to write them.
Some systems provide functions for this (for example, UNIX provides
"lex" and "yacc") but others do not.

      The programming tool described below provides these functions
for any Smalltalk environment.  It is a combination of a text string
tokenizer ("TokenStream") and a Backus-Naur Form (BNF) syntax parser
(called "BNF") which can process a token stream, for use in Smalltalk
systems.  Both the tokenizer and the parser are single Smalltalk
class definitions, with all their function specified as class
methods.

      The user specifies the syntax for the parse, and the processing
for the parse results, in Smalltalk code, integrating it in the
Smalltalk environment in the usual way, and avoiding any separate
compilation.  Systems such as lex and yacc require the user to
specify the syntax using an external file containing statements
written in a special programming language, and require a separate
compilation step.

      lex is much more powerful than TokenStream, but TokenStream is
correspondingly simpler and easier to use.  BNF is simpler to use but
less powerful than yacc.  Lo, TokenStream:  The tokenizer reads a
character stream and creates an equivalent stream of tokens.  Tokens
are identifiers, numbers, character strings, individual special
characters or sequences of special characters chosen by the
programmer.  For example:
    If (1>=month or 12<=month) then
      Call error;
  with >= and <= as special sequences, produces tokens
    If, (, 1, >=, month, or, 12, <=, month, ), then, Call, error, ;.

      The programmer may override the definitions of the characters
in identifiers and numbers.  TokenStream has a similar function to
the UNIX lex command, but without separate compilation; however, it
does not have a powerful language for defining tokens, and it does
not ascribe types to the tokens produced.

      BNF: The parser processes a TokenStream of the input and a list
of reserved words.

      The programmer must define each of the syntax nodes in the
grammar as a subclass of BNF.  (The hierarchy of these subclasses
does not need to copy the hierarchy in the grammar.)  The programmer
defines a class method which returns a Smalltalk array containing the
syntax, and the name of a class method to handle the result of a
successful parse.  The default method creates a new node in the
syntax tree, as in the following example.
  The BNF syntax of:
     TelephoneNumber ::= Name ':' Number '-' Number Number
  parses a TokenStream from
     King: 071-123 4567
  producing an ordered collection:
...