Did you know ... Search Documentation:
sql_parser.pl -- SQL Parser
PublicShow source

This module contains an SQL parser

sql_parse/4

Parsing is invoked with sql_parse(+Term, -TrailingComments, +Options, +Tokens). Notice that all terms are bound when the predicate is called: you must direct the parser where to start. For a view definition, an example invocation might be sql_tokens(Tokens, "CREATE VIEW foo AS SELECT bar FROM qux", []), sql_parse(view_definition(Definition, Types), TrailingComments, [], Tokens). ---++ Comments Because comments can appear literally anywhere in the input text, every parse node has both a syntax element (such as view_definition/2) and a list of comments which preceed the element. This means comments are pushed as far as possible down the syntax tree. Any transformations of the input with the intention that it should be printed out again need to take the comments into account. Any other uses of the parse tree may pass it to strip_sql_comments(+InTree, -OutTree) to simply remove them all, leaving the tree with just the syntactic elements.

Finally, there may be trailing comments at the end of the input which are not followed by any token. This means they're not absorbed into the parse tree - so that they're not lost, they are returned as a list from sql_parse/4. ---++ Options Current options include:

  • dbms(+DBMS): If omitted, DBMS will be set to 'Microsoft SQL Server'. For the most part only 'Microsoft SQL Server' views are parseable, but it would not be difficult to extend the parser if this was ever required (just search the source for dbms('Microsoft SQL Server') conditionals)
  • view_name(+Name): Passed to gripe/6 so that complaints about view syntax can be associated with a view name

Internally used options include (these should not be passed in under normal circumstances)

  • query_id(-Qid): Used to logically separate distinct parts of the query
  • subquery: Used to flag whether currently in the top-level query or a subquery

Parse tree

The parse tree returned can be very complicated. The best documentation for this is probably either the sql_write or the sql_check module, which take the tree as an input and do processing on it. ---++ Type inference Type inference makes the parser take almost 4 times longer, but the resulting information is very useful. It is rarely possible to tell as the input is read what the type of each element is. Where possible, the types are defined (for example, the type of count(*) is always native_type(int)) but where the type is unknown, a new variable is created and a constraint is made.

Type inference is done with CHR, and types are in one of three states: 1 Known, and bound (ie committed) 2 Unknown with one unresolved dependency 3 Unknown with two unresolved dependencies

A dependency here refers to something which would influence the eventual type. Some examples of the slightly more complicated case 2:

  • The type of a column selected from a table that we have not yet resolved
    • Consider SELECT foo FROM bar: Until we read the FROM clause we cannot begin to guess what the type of foo is, even though it has only one dependency
  • The type of a scalar subquery
    • SELECT (SELECT TOP 1 foo FROM bar) AS q: In this case, the type of q is not actually a subquery, it is the single element that subquery returns. A constraint-handling rule checks for a type constraint of type scalar(_) and replaces it with the single element it contains.
  • The type of an set operation (ie aggregation)
    • SELECT SUM(foo) AS q FROM bar: The type of q may be coerced to a decimal, depending on the eventual type of foo, which is not known until we have read the FROM clause

Some examples of case 3:

  • The type of a column which is the union of two selects
  • The type of an arithmetic expression

Internal Details

Syntax of the grammar

The grammar started out as an EBNF format, and is based roughly on http://savage.net.au/SQL/sql-92.bnf.html {}/1 are escaped Prolog, like in a DCG [...] denote optional clauses | denotes options @foo matches the token foo (case-insensitive matching is employed) #Foo matches the next token with Foo #Foo:Type matches the next token with Foo if it is a literal of type Type

Left factoring types

SQL Server has some very complicated rules for inferring the type of decimal arithmetic (see http://msdn.microsoft.com/en-us/library/ms190476). The crucial, yet sadly missing information from that page deals with overflows. This is half-explained at http://blogs.msdn.com/b/sqlprogrammability/archive/2006/03/29/564110.aspx.

Because we have truncation, the order of operations is crucial: Although (x * y) / z is mathematically equivalent to x * (y / z), the types of the two expressions in SQL Server are actually different due to truncation. The parser is LL, but this means we will always read x * y / z as x * (y / z), whereas SQL Server does the type inference in reverse. This is only a problem for division and multiplication since the handling of addition and subtraction are symmetric, but without a transformation, we will compute the wrong type. After a term/2 is parsed, left_factor_types/3 is called, which translates just the types in the term from LL into LR form.

Uses

  • Automatic determination of the view interface from the view SQL. These are prone to bit rot:
    • Someone changes the SQL but forgets to change the interface
    • Someone changes the underlying type of a table or view which is directly or indirectly referenced by the view. It's a burden to find all views which might reference the table which has been changes
  • Determination of view hierarchies
    • This can be used to determine the order in which views should be refreshed or replaced/installed from metadata
    • Can also be used with the load dependency analysis
  • Determination of indices
    • Parsing the view allows us to build a set of disjunctions used in WHERE clauses
  • Sanity checking
    • Some of the views contain some very, very weird things. Currently there is no oversight or review. If gripes are enabled at compile-time, the quality of code in views can be brought up to a higher standard
    • Some views may contain extremely inefficient logical structures. If the parse-tree can be suitably analysed, more efficient equivalent queries can be automatically generated. In any case, uses of patterns which are known to be inefficient can be reported at compile-time
    • DBMS indepdence
      • Transforming the views is relatively simple once we have the tree isolated. This can allow us to customize views to take advantage of features in later version of SQL Server while not alienating clients using older versions
      • If necessary, we can translate views to run on other DBMS, such as Oracle, 'PostgreSQL' or MySQL

Known problems

% It is not practical to determine what + means ahead of time if the source view is MS T-SQL. We would have to guess and backtrack if wrong, and that is horribly inefficient. Instead if we read + in 'Microsoft SQL Server' mode, we should delay determining whether it is really + or actually || until the types of the LHS and RHS are resolved.

Undocumented predicates

The following predicates are exported, but not or incorrectly documented.

Source sql_parse(Arg1, Arg2, Arg3, Arg4)
Source sql_gripe_level(Arg1)
Source strip_sql_comments(Arg1, Arg2)