November 2002

YAY (YAY's another YACC)

Introduction

    YAY is an incremental parser specification language. It is much like Lex and Yacc, but simpler in terms of semantics. The simplicity comes from the inclusion of "special forms" that better match human-understandable language patterns. YAY is a little more verbose than Yacc due to it's incremental nature, I am sure you will find the semantics much easier to understand.

    Documentation is not complete yet, please bear with me.

Terminology

    It is assumed that some formal language theory is known. I define the terminology I use in the rest of the YAY pages here to ensure clarity.

    YAY documentation is usually phrased in terms of a parsing engine acting on some input and producing an output.

    • Tokens - are the objects that are manipulated by machines. Only context can determine if the term "token" refers to an input object or an output object, Symbols and variables are specific types of tokens.
    • Symbols - are strictly input tokens to a machine, these are usually raw characters.
    • Variables - are compound tokens produced by a machine.

    • Parser - is a machine that takes symbols and produces tokens. Any parser is a machine.
    • Machine - a cascade of parsers (possibly one), resulting in another parser.

    • Pattern - a specification for matching a series of symbols.
    • Production - defines what type of token is produced for a given set of symbols. Each production is a set of patterns (grammar rules) that can be inherited and extended. Tokens are instances of productions.

Overview

    YAY can declare parsers (and lexxers). Usually these are cascaded so that the tokens produced by one are delivered as symbols to another. Having different parsers work on different parts of the parsing problem allow the reuse of parsers in other machines.

The Built-in Parser

    YAY provides a built-in parser that is used for the input of all machines. This means that no machine will ever deal with characters directly, but rather with the slightly processed tokens of this built-in parser. The built-in parser simply categorizes characters into logical groups. Here is a list of token types (productions) that are delivered by the built-in parser.

    Name  
    ASCII
    Latin-1
    Decimal
    Hex
    Decimal
    Hex
    CapitalLetter   65..90 0x41..0x5A 192..214, 216..222 0xC0..0xD6, 0xD8..0xDE
    SmallLetter   97..122 0x61..0x7A 223..246, 248..255 0xDF..0xF6, 0xF8..0xFF
    Letter   = CapitalLetter | SmallLetter = CapitalLetter | SmallLetter
    Digit   48..57 0x30..0x39 48..57 0x30..0x39
    Glyph   33..126 0x21..0x7E 33..255 0x30..0xFF
    White   9, 10, 13, 32 0x9, 0xA, 0xD, 0x20 9, 10, 13, 32 0x9, 0xA, 0xD, 0x20
    Control   0..31, 127 0x00..0x1F, 0x7F 0..31, 127, 128, 129 0x00..0x1F, 0x7F, 0x80, 0x81
    Character   0..127 0x00..0x7F 0..255 0x00..0xFF

A Simple Parser (A Lexxer)

    We will ignore the necessary prefix and suffix for a full parser specification right now, and focus on the production definitions for the parser:

      PATTERN Word <Value AS List(Type=WordChar, Min=1)>;
      PATTERN WhiteSpace List(Type=White, Min=1);

      PATTERN WordChar;
      PATTERN "_" EXTENDS WordChar;
      PATTERN Digit EXTENDS WordChar;
      PATTERN Letter EXTENDS WordChar;

    Productions are defined using the PATTERN keyword (excuse the confusion). Each production can be given a name and extended with patterns with subsequent PATTERN declarations. The most recent declaration (furthest down) is tested first against the input symbols, and a match results in the creation of a token for that production. This parser will group consecutive WordChar into a single Word, and similarly with WhiteSpace.

Patterns

    Patterns are defined within the curly braces the PATTERN keyword. A pattern matches a series of symbols, of which there are four major types:

    Literal Pattern

      A literal is a string delimited by double quotes. The literal must be matched exactly, character for character on the input stream.

        PATTERN FieldKeyword "FIELD";

      The FieldKeyword production expects a sequence of tokens that spell "FIELD".

    Possible Patterns

      If a pattern is meant to be one of many possible patterns we would usually use the EXTENDS keyword. We can shorten the pattern declaration when the constituent patterns are relatively simple. Here is how WordChar would "usually" be declared:

        PATTERN WordChar;
        PATTERN "_" EXTENDS WordChar;
        PATTERN <Digit> EXTENDS WordChar;
        PATTERN <Letter> EXTENDS WordChar;

      We can simplify this to:

        PATTERN WordChar "_" | <Digit> | <Letter>;

      Of course we loose the ability to name each constituent pattern

    In-Line Sub Pattern

      A sub pattern can be defined in-line, but it does not get a name and can make the pattern harder to read. An anonymous sub pattern can be used wherever a pattern is expected.

      Here we are explicit about the allowed name of a field; it must be a list of WordChars:

        PATTERN FieldDeclaration "FIELD" <Name AS List(Type=WordChar, Min=1)>;

      In this next example, words can start with "+", followed by a list of at least one WordChars:

        PATTERN Word "+" | WordChar List(Type=WordChar, Min=1);

    Named Subpattern

      A named subpattern is usually used when declaring compound patterns. Giving a name to each also allows reference to these subpatterns during semantic checks and higher level processing.

        PATTERN TypeDeclaration <Name AS Word> "AS" <Type AS Word> ";";

      Every TypeDeclaration production will expect two words separated by "AS". Should a match be made, the generated token will have two variables, called "Name" and "Type", that contain what was matched to each.

    Anonymous Subpattern

      Many times we do not want to distinguish compound pattern components, and we use anonymous pattern declarations instead

           PATTERN TypeDeclaration <Word> "AS" <Word>;

      Here, the TypeDeclaration is simply a string conforming to a certain pattern. It has no parameters to distinguish the prefix Word from the suffix Word.

    Variable Subpattern

      Many languages have redundant aspects to them. This means that some sequence of characters are expected more than once in the lexical stream. <example>For example, XML requires the prefix and suffix tags to be identical, except for extra slash.</example>.

        PATTERN SimpleXMLTag "<" <TagName AS Word> ">" XML "</" <TagName> ">";

      Notice that TagName was used a second time. This syntax demands that there is a named subpattern, defined leftward, that matches in name. The similarity to the anonymous subpattern is deliberate; you can understand TagName as a pattern declaration itself. During token processing, the named subpattern must match the input tokens, and any subsequent variable subpatterns with the same name must match the same literal value.

      The variable subpattern allows us to reduce our semantic checks significantly.

Special Forms

    A special form is a parametric description of a common language pattern.

    List Special Form

      List matches many similar items in a row (replaces *, +, ., ?, (n), (n,), (n,m) of YACC)

      Parameter Required Description
      Type
      Y
      The type of the elements of the list
      Separator
      N
      The expected separator
      End
      N
      What terminates the list parsing
      Min
      N
      Minimum number of matches
      Max
      N
      Maximum number of matches

      Example (Strings)

      The String example shows that a String is a list of StringChar delimited by quotes. StringChars can consist of double characters indicating special escape sequences.

        PATTERN String "\"" <Value AS List(Type=StringChar, End="\"")> "\"";

        PATTERN StringChar;
        PATTERN " " EXTENDS StringChar;
        PATTERN Glyph EXTENDS StringChar;
        PATTERN "\\" "\\" EXTENDS StringChar;
        PATTERN "\\" "\"" EXTENDS StringChar;
        PATTERN "\\" "n" EXTENDS StringChar;
        PATTERN "\\" "t" EXTENDS StringChar;
                

      Remember that the Glyph production is provided by the built-in parser, so is used here to succinctly refer to majority of valid string characters. Also notice that the End parameter allows us to forget checking that a non-escaped quote is not allowed in the StringChar production.

        "\n"    = newline, as intended in this example
        "\\n"   = "\n" (backslash, then 'n')
        "\n\"   is invalid because it has no closing quote
        "n\\"   = "n\" ('n' then backslash)
        "\q"    = "\q"
        because the backslash is only used to escape certain characters

    Operator Special Form

      The Operator special form is usually used for expression parsing, and specifies infix syntax only. It must always EXTEND an existing pattern, and it is assumed the operands are of that same pattern.

      The Operator special form provides clear precedence rules, operator symbols, and associativity. YAY's incremental nature allows new operators to be added easily. Operator replaces the need for %left, %right, %nonassoc, %prec statements in YACC

        Parameter Required Description
        Symbol
        Y
        The symbol used for this operator
        Associate
        Y
        Set the associatively of the operator
        Message
        N
        Define the KMAL message (DBOS)

      Example (Expressions)

        PATTERN Expression;
        PATTERN <FieldName AS Word> EXTENDS Expression;
        PATTERN Integer EXTENDS Expression;
        PATTERN "(" Expression ")" EXTENDS Expression;

        PATTERN Select Operator(Symbol=".", Associate=Left) EXTENDS Expression;
        PATTERN Compare Operator(Symbol=Equal, Associate=None) EXTENDS Expression;
        PATTERN Assign Operator(Symbol="=", Associate=None) EXTENDS Expression;

      "(a).(b.25.e)==h.y" is allowed by this production, only later semantic checks can distinguish errors.

My Notes:

<e-kyle> each PARSER is given an input (either text, or another parsers output) and will have a function called getToken() that will get the next token
<e-kyle> "match" refers to the productions that are allowed to be matched to the input (other productions are sub-productions).
<e-kyle> "export" refers to tokens that are exported. Tokens that are matched, but not exported means the PARSER will continue matching tokens until it finds one to export.
<e-kyle> You can see the KMALLexxer will skip WhiteSpace and Comments.