clic.tokenizer: Tokenize strings to types

This module provides the core tokenisation used in CLiC, used both when parsing incoming texts and when parsing concordance queries

Method

To extract tokens, we use Unicode text segmentation as described in [UAX29], using the implementation in the [ICU] library and standard rules for en_GB, and then apply our own additions (see later).

Please read the document for a full description of ICU word boundaries, however as a quick example the following phrase:

The quick (“brown”) fox can’t jump 32.3 feet in-the-air, right?

…would have boundaries at every point marked with a |:

The| |quick| |(||“|brown|”|)| |fox| |can’t| |jump| |32.3| |feet| |in|-|the|-|air|,| |right|?|

We consider a boundary mark to be a word-boundary if…

  • The “Rule Status value” of the boundary according to ICU (see [ICU_RSV]) is at the end of a word (e.g. after jump) or the end of a number (e.g. after 32.3).

In addition, we introduce these rules for CLiC:

  • It is after a hyphen character surrounded by alpha-numeric characters, e.g. after all hyphens in |in|-|the|-|air|.
  • It is after an apostrophe preceded with s, e.g. after the ' in 3| |days|'| |work|.
  • It is after an apostrophe followed by one of a whitelist of words (see INITIAL_ABBREVIATIONS), e.g. after the ' in '|tis| |nothing|.

Note that these additional rules are because ICU does not handle apostrophes on the outside of words, nor hyphenated-words.

…so if we mark word boundaries in the example above with :

The‖ |quick‖ |(||“|brown‖”|)| |fox‖ |can’t‖ |jump‖ |32.3‖ |feet‖ |in‖-‖the‖-‖air‖,| |right‖?|

Tokens are then extracted by combining all text before adjacent word-boundaries, e.g.:

  • |feet‖ becomes the token feet.
  • |in‖-‖the‖-‖air‖,| becomes the token in-the-air

Tokens are then normalised into types by:-

  • Lower-casing, The -> the.
  • Normalising any non-ascii characters with [UNIDECODE], e.g.
    • can’t -> can't.
    • café -> cafe.
  • Removing any surrounding underscores, e.g. _connoisseur_ -> connoisseur.

Queries for concordance searches are also turned into a list of types by this module. In this case we consider * as being part of a token for wildcard searches. See later examples for more information.

Examples / edge cases

A type is lower-case, ASCII-representable term, and “fancy” apostrophes are normalised:

>>> [x[0] for x in types_from_string('''
...     I am a café cat, don’t you k'now.
... ''')]
['i', 'am', 'a', 'cafe', 'cat', "don't", 'you', "k'now"]

Numbers are types too:

>>> [x[0] for x in types_from_string('''
...     Just my $0.02, but we're 12 minutes late.
... ''')]
['just', 'my', '0.02', 'but', "we're", '12', 'minutes', 'late']

All surrounding punctuation is filtered out:

>>> [x[0] for x in types_from_string('''
...     "I am a cat", they said, "hear me **roar**!".
...
...     "...or at least mew".
... ''')]
['i', 'am', 'a', 'cat', 'they', 'said', 'hear', 'me', 'roar',
 'or', 'at', 'least', 'mew']

Unicode word-splitting doesn’t combine hypenated words, but we do:

>>> [x[0] for x in types_from_string('''
...     It had been a close and sultry day--one of the hottest of the
...     dog-days--even out in the open country
... ''')]
['it', 'had', 'been', 'a', 'close', 'and', 'sultry', 'day',
 'one', 'of', 'the', 'hottest', 'of', 'the', 'dog-days',
 'even', 'out', 'in', 'the', 'open', 'country']

>>> [x[0] for x in types_from_string('''
...     so many out-of-the-way things had happened lately
... ''')]
['so', 'many', 'out-of-the-way', 'things',
 'had', 'happened', 'lately']

We also consider apostrophes surrounding words to be part of the word, unlike the standard. Preceding apostrophes have to be part of our whitelist though:

>>> [x[0] for x in types_from_string('''
...     'tis 3 days' work. 'twmade-up-word
... ''')]
["'tis", '3', "days'", 'work', 'twmade-up-word']

Preceding apostrophes have to curve the right direction to be included as part of the token (and thus the type), so that the first ’em keeps the apostrophe and the second one doesn’t:

>>> [x[0] for x in types_from_string('''
...     Closing ’em. Opening ‘em.
... ''')]
['closing', "'em", 'opening', 'em']

We strip underscores whilst generating types, which are considered part of a word in the unicode standard:

>>> [x[0] for x in types_from_string('''
... had some reputation as a _connoisseur_.
... ''')]
['had', 'some', 'reputation', 'as', 'a', 'connoisseur']
clic.tokenizer.INITIAL_ABBREVIATIONS = {'em', 'tis', 'twas', 'twill', 'twould'}

A list of words that, if prepended with an apostrophe, we should consider the apostrophe part of the token, rather than the start of a quote.

clic.tokenizer.types_from_string(s, offset=0, additional_word_parts={})

Extract tuples of (type, start, end) from s, optionally adding (offset) to the start and end values

clic.tokenizer.word_boundary_type(s, bi, last_b, additional_word_parts={})

Add our own boundary types atop of what bi.getRuleStatus() returns