Did you know ... | Search Documentation: |
sql_parser.pl -- SQL Parser |
This module contains an SQL parser
Parsing is invoked with sql_parse(+Term, -TrailingComments, +Options, +Tokens)
. Notice that all terms are bound when the predicate is called: you must direct the parser where to start. For a view definition, an example invocation might be
sql_tokens(Tokens, "CREATE VIEW foo AS SELECT bar FROM qux", [])
, sql_parse(view_definition(Definition, Types), TrailingComments, [], Tokens)
.
---++ Comments
Because comments can appear literally anywhere in the input text, every parse node has both a syntax element (such as view_definition/2) and a list of comments which preceed the element. This means comments are pushed as far as possible down the syntax tree. Any transformations of the input with the intention that it should be printed out again need to take the comments into account. Any other uses of the parse tree may pass it to strip_sql_comments(+InTree, -OutTree)
to simply remove them all, leaving the tree with just the syntactic elements.
Finally, there may be trailing comments at the end of the input which are not followed by any token. This means they're not absorbed into the parse tree - so that they're not lost, they are returned as a list from sql_parse/4. ---++ Options Current options include:
dbms(+DBMS)
: If omitted, DBMS will be set to 'Microsoft SQL Server'. For the most part only 'Microsoft SQL Server' views are parseable, but it would not be difficult to extend the parser if this was ever required (just search the source for dbms('Microsoft SQL Server')
conditionals)view_name(+Name)
: Passed to gripe/6 so that complaints about view syntax can be associated with a view nameInternally used options include (these should not be passed in under normal circumstances)
query_id(-Qid)
: Used to logically separate distinct parts of the query
The parse tree returned can be very complicated. The best documentation for this is probably either the sql_write or the sql_check module, which take the tree as an input and do processing on it.
---++ Type inference
Type inference makes the parser take almost 4 times longer, but the resulting information is very useful. It is rarely possible to tell as the input is read what the type of each element is. Where possible, the types are defined (for example, the type of count(*)
is always native_type(int)
) but where the type is unknown, a new variable is created and a constraint is made.
Type inference is done with CHR, and types are in one of three states: 1 Known, and bound (ie committed) 2 Unknown with one unresolved dependency 3 Unknown with two unresolved dependencies
A dependency here refers to something which would influence the eventual type. Some examples of the slightly more complicated case 2:
scalar(_)
and replaces it with the single element it contains.Some examples of case 3:
The grammar started out as an EBNF format, and is based roughly on http://savage.net.au/SQL/sql-92.bnf.html {}/1 are escaped Prolog, like in a DCG [...] denote optional clauses | denotes options @foo matches the token foo (case-insensitive matching is employed) #Foo matches the next token with Foo #Foo:Type matches the next token with Foo if it is a literal of type Type
SQL Server has some very complicated rules for inferring the type of decimal arithmetic (see http://msdn.microsoft.com/en-us/library/ms190476). The crucial, yet sadly missing information from that page deals with overflows. This is half-explained at http://blogs.msdn.com/b/sqlprogrammability/archive/2006/03/29/564110.aspx.
Because we have truncation, the order of operations is crucial: Although (x * y) / z is mathematically equivalent to x * (y / z), the types of the two expressions in SQL Server are actually different due to truncation. The parser is LL, but this means we will always read x * y / z as x * (y / z), whereas SQL Server does the type inference in reverse. This is only a problem for division and multiplication since the handling of addition and subtraction are symmetric, but without a transformation, we will compute the wrong type. After a term/2 is parsed, left_factor_types/3 is called, which translates just the types in the term from LL into LR form.
% It is not practical to determine what + means ahead of time if the source view is MS T-SQL. We would have to guess and backtrack if wrong, and that is horribly inefficient. Instead if we read + in 'Microsoft SQL Server' mode, we should delay determining whether it is really + or actually || until the types of the LHS and RHS are resolved.
The following predicates are exported, but not or incorrectly documented.