This module offers a simple tokenizer with flexible options.
- author
- - Shon Feder
- license
- - http://unlicense.org/
Rational:
tokenize_atom/2, in library(porter_stem), is inflexible, in that it doesn't
allow for the preservation of white space or control characters, and it only
tokenizes into a list of atoms.
The tokenize
library is meant to be easy to use while allowing for relatively
flexible input and output. Features include
- options for tokenization of spaces, numbers, strings, control characters and punctuation
- options to output packed tokens
- options to represent tokens in any of the common SWI-Prolog text formats
- option to preserve or ignore case
- a predicate to emit text given a list of tokens
E.g.,
?- tokenize('Tokenizes: words,"strings", 1234.5\n', Tokens, [cased(true), spaces(false)]),
| untokenize(Tokens, Codes).
Tokens = [word('Tokenizes'), punct(:), word(words), punct(','), string(strings), punct(','), number(1234.5), cntrl('\n')],
Codes = "Tokenizes:words,"strings"...34.5
".
tokenize
is much more limited and much less performant than a lexer generator,
but it is dead simple to use and flexible enough for many common use cases.
- tokenize(+Text:text, -Tokens:list(term)) is semidet
-
- See also
- - tokenize/3 when called with an empty list of options: thus, with defaults.
- tokenize(+Text:text, -Tokens:list(term), +Options:list(term)) is semidet
- True when Tokens is unified with a list of tokens representing the text from
Text, according to the options specified in Options.
Each token in Tokens will be one of:
- word(W)
- Where W is comprised of contiguous alpha-numeric chars.
- punct(P)
- Where
char_type(P, punct)
.
- cntrl(C)
- Where
char_type(C, cntrl)
.
- space(S)
- Where
S == ' '
.
- number(N)
- Where
number(N)
.
- string(S)
- Where S was a sequence of bytes enclosed by double quotation marks.
Note that the above describes the default behavior, in which the token is
represented as an atom
. This representation can be changed by using the
to
option described below.
Valid Options are:
- cased(+boolean)
- Determines whether tokens perserve cases of the source text. Defaults to
cased(false)
.
- spaces(+boolean)
- Determines whether spaces are represted as tokens or discarded. Defaults to
spaces(true)
.
- cntrl(+boolean)
- Determines whether control characters are represented as tokens or discarded. Defaults to
cntrl(true)
.
- punct(+boolean)
- Determines whether punctuation characters are represented as tokens or discarded. Defaults to
punct(true)
.
- numbers(+boolean)
- Determines whether the tokenizer represents and tags numbers. Defaults to
numbers(true)
.
- strings(+boolean)
- Determines whether the tokenizer represents and tags strings. Defaults to
strings(true)
.
- pack(+boolean)
- Determines whether tokens are packed or repeated. Defaults to
pack(false)
.
- to(+one_of([strings, atoms, chars, codes]))
- Determines the representation format used for the tokens. Defaults to
to(atoms)
.
- tokenize_file(+File:atom, -Tokens:list(term)) is semidet
-
- See also
- - tokenize_file/3 when called with an empty list of options: thus, with defaults.
- tokenize_file(+File:atom, -Tokens:list(term), +Options:list(term)) is semidet
- True when Tokens is unified with a list of tokens represening
the text of File.
- See also
- - tokenize/3 which has the same available options and behavior.
- untokenize(+Tokens:list(term), -Untokens:list(codes)) is semidet
- True when Untokens is unified with a code list representation of each
token in Tokens.