This library allows you to parse SGML, XML and HTML data into a Prolog
data structure. The library defines several families of predicates:
- High-level predicates
-
Most users will only use load_html/3, load_xml/3 or load_sgml/3 to
parse arbitrary input into a DOM structure. These predicates all
call load_structure/3, which provides more options and may be
used for processing non-standard documents.
The DOM structure can be used by library(xpath) to extract information
from the document.
- The low-level parser
-
The actual parser is written in C and consists of two parts: one for
processing DTD (Document Type Definitions) and one for parsing data.
The data can either be parsed to a Prolog (DOM) term or the parser
can perform callbacks for the DOM events.
- Utility predicates
-
Finally, this library provides prmitives for classifying characters
and strings according to the XML specification such as xml_name/1 to
verify whether an atom is a valid XML name (identifier). It also
provides primitives to quote attributes and CDATA elements.
- dtd(+Type, -DTD) is det
- DTD is a DTD object created from the file
dtd(Type)
. Loaded DTD
objects are cached. Note that DTD objects may not be shared
between threads. Therefore, dtd/2 maintains the pool of DTD
objects using a thread_local predicate. DTD objects are
destroyed if a thread terminates.
- Errors
- -
existence_error(source_sink, dtd(Type))
- load_dtd(+DTD, +DtdFile, +Options)
- Load DtdFile into a DTD. Defined options are:
- dialect(+Dialect)
- Dialect to use (xml, xmlns, sgml)
- encoding(+Encoding)
- Encoding of DTD file
- Arguments:
-
DTD | - is a fresh DTD object, normally created using
new_dtd/1. |
- load_structure(+Source, -ListOfContent, :Options) is det
- Parse Source and return the resulting structure in
ListOfContent. Source is handed to open_any/5, which allows for
processing an extensible set of input sources.
A proper XML document contains only a single toplevel element
whose name matches the document type. Nevertheless, a list is
returned for consistency with the representation of element
content.
The encoding(+Encoding)
option is treated special for
compatibility reasons:
- If Encoding is one of
iso-8859-1
, us-ascii
or utf-8
,
the stream is opened in binary mode and the option is passed
to the SGML parser.
- If Encoding is present, but not one of the above, the
stream is opened in text mode using the given encoding.
- Otherwise (no Encoding), the stream is opened in binary
mode and doing the correct decoding is left to the parser.
- load_sgml_file(+File, -DOM) is det
- Load SGML from File and unify the resulting DOM structure with
DOM.
- deprecated
- - New code should use load_sgml/3.
- load_xml_file(+File, -DOM) is det
- Load XML from File and unify the resulting DOM structure with
DOM.
- deprecated
- - New code should use load_xml/3.
- load_html_file(+File, -DOM) is det
- Load HTML from File and unify the resulting DOM structure with
DOM.
- deprecated
- - New code should use load_html/3.
- load_html(+Input, -DOM, +Options) is det
- Load HTML text from Input and unify the resulting DOM structure
with DOM. Options are passed to load_structure/3, after adding
the following default options:
- dtd(DTD)
- Pass the DTD for HTML as obtained using
dtd(html, DTD)
.
- dialect(Dialect)
- Current dialect from the Prolog flag
html_dialect
- max_errors(-1)
- syntax_errors(quiet)
- Most HTML encountered in the wild contains errors. Even in the
context of errors, the resulting DOM term is often a
reasonable guess at the intent of the author.
You may also want to use the library(http/http_open) to support
loading from HTTP and HTTPS URLs. For example:
:- use_module(library(http/http_open)).
:- use_module(library(sgml)).
load_html_url(URL, DOM) :-
load_html(URL, DOM, []).
- load_xml(+Input, -DOM, +Options) is det
- Load XML text from Input and unify the resulting DOM structure
with DOM. Options are passed to load_structure/3, after adding
the following default options:
- load_sgml(+Input, -DOM, +Options) is det
- Load SGML text from Input and unify the resulting DOM structure
with DOM. Options are passed to load_structure/3, after adding
the following default options:
- xml_quote_attribute(+In, -Quoted) is det
- xml_quote_cdata(+In, -Quoted) is det
- Backward compatibility for versions that allow to specify
encoding. All characters that cannot fit the encoding are mapped
to XML character entities (&#dd;). Using ASCII is the safest
value.
- xml_name(+Atom) is semidet
- True if Atom is a valid XML name.
- xml_basechar(+CodeOrChar) is semidet
- xml_ideographic(+CodeOrChar) is semidet
- xml_combining_char(+CodeOrChar) is semidet
- xml_digit(+CodeOrChar) is semidet
- xml_extender(+CodeOrChar) is semidet
- XML character classification predicates. Each of these
predicates accept both a character (one-character atom) and a
code (integer).
- See also
- - http://www.w3.org/TR/2006/REC-xml-20060816
- xml_is_dom(@Term) is semidet
- True if term statisfies the structure as returned by
load_structure/3 and friends.
Undocumented predicates
The following predicates are exported, but not or incorrectly documented.
- xml_quote_attribute(Arg1, Arg2, Arg3)
- set_sgml_parser(Arg1, Arg2)
- free_dtd(Arg1)
- sgml_register_catalog_file(Arg1, Arg2)
- free_sgml_parser(Arg1)
- new_dtd(Arg1, Arg2)
- load_dtd(Arg1, Arg2)
- xsd_time_string(Arg1, Arg2, Arg3)
- sgml_parse(Arg1, Arg2)
- new_sgml_parser(Arg1, Arg2)
- dtd_property(Arg1, Arg2)
- iri_xml_namespace(Arg1, Arg2, Arg3)
- xsd_number_string(Arg1, Arg2)
- xml_quote_cdata(Arg1, Arg2, Arg3)
- get_sgml_parser(Arg1, Arg2)
- open_dtd(Arg1, Arg2, Arg3)
- iri_xml_namespace(Arg1, Arg2)
- xml_name(Arg1, Arg2)