clic.concordance: Concordance endpoint

Searches texts for given phrase(s).

  • corpora: 1+ corpus name (e.g. ‘dickens’) or book name (‘AgnesG’) to search within
  • subset: subset to search through, one of shortsus/longsus/nonquote/quote/all. Default ‘all’ (i.e. all text)
  • q: 1+ string to search for. If multiple terms are provided, we will search for each in turn
  • contextsize: Size of context window around search results. Default 0.
  • metadata: Optional data to return, see book_metadata.py for all options.

Parameters should be provided in querystring format, for example:

?corpora=dickens&corpora=AgnesG&q=my+hands&q=my+feet

Returns a data array, one entry per result. Each item is an array with the following items:

  • The left context window (if contextsize > 0, otherwise omitted)
  • The node (i.e. the text searched for)
  • The right context window (if contextsize > 0, otherwise omitted)
  • Result metadata
  • Position-in-book metadata

Each of left/node/right context window is an array of word/non-word tokens, with the final item indicating which of the tokens are word tokens. For example:

[
    'while',
    ' ',
    'this',
    ' ',
    'shower',
    ' ',
    'gets',
    ' ',
    'owered',
    ",'",
    ' ',
    [0, 2, 4, 6, 8],
]

Result metadata and Position-in-book metadata are currently subject to change.

The version object gives both the current version of CLiC and the revision of the corpora ingested in the database.

Query (q parameter) format

Queries are broken down using the tokenizer into lists of types to search for. This means any punctuation, spaces or newlines are ignored.

Queries can use the wildcard character, *, to search for 0 or more characters at this point. For example, oliver will find instances of “Oliver” and “olvier”, but oliver* will find “Oliver’s” in addition. The asterisk can be used anywhere within a type, not just at the end.

Examples

/api/concordance?corpora=AgnesG&q=his+hands&q=his+feet&contextsize=3:

{"data":[
  [
    ["to"," ","put"," ","into"," ",[0,2,4]],
    ["his"," ","hands",","," ",[0,2]],
    ["it"," ","should"," ","bring",[0,2,4]],
     . . .
  ], [
    ["the"," ","fire",","," ","with"," ",[0,2,5]],
    ["his"," ","hands"," ",[0,2]],
    ["behind"," ","his"," ","back",[0,2,4]],
     . . .
  ], [
    ["was"," ","frisking"," ","at"," ",[0,2,4]],
    ["his"," ","feet",","," ",[0,2]],
    ["and"," ","finally"," ","upon",[0,2,4]],
     . . .
  ],
], "version":{"corpora":"master:fc4de7c", "clic":"1.6:95bf699"}}

Method

The concordance search peforms the following steps:

  1. Resolve the corpora option to a list of book IDs, translate the subset selection to a database region.
  2. Tokenise each provided query using the standard method in tokenizer, converting into a list of database like expressions for types. Note that the CLiC UI generally only provides one query, unless you select “Any word”, in which case it separates on whitespace and gives multiple queries. For example:
    • “He s* her foot” (whole phrase) results in one query ['he', 's%', 'her', 'foot']
    • “latter pavement” (any word) results in 2 queries, ['latter'] and ['pavement']
  3. For each query, choose an “anchor” type. The aim here is to find the least frequent term that will filter the results the fastest. See find_anchor_offset() for details.
  4. Search the database for all types that match this anchor in the given books, and within the given region. For example if our query was oliver*, this would match the types oliver, oliver's, olivers, etc.
  5. For each match, fetch the rest of the node types and context, if required.
  6. Check that any remaining types in the node also match the query and are in a relevant region.
  7. Combine the results with the text from the original book, add the chapter/paragraph/sentence statistics from the anchor, return result.

Examples / edge cases

These are the corpora we use for the following tests:

>>> db_cur = test_database(
... alice='''
... ‘Well!’ thought Alice to herself, ‘after such a fall as this, I shall
... think nothing of tumbling down stairs! How brave they’ll all think me at
... home! Why, I wouldn’t say anything about it, even if I fell off the top
... of the house!’ (Which was very likely true.)
...
... ‘I beg your pardon,’ said Alice very humbly: ‘you had got to the fifth
... bend, I think?’
...
... ‘I had NOT!’ cried the Mouse, sharply and very angrily.
... ''',
...
... willows='''
... ‘Get off!’ spluttered the Rat, with his mouth full.
...
... ‘Thought I should find you here all right,’ said the Otter cheerfully.
... ‘They were all in a great state of alarm along River Bank when I arrived
... this morning.
... ''')

f*ll * matches fall or fell, and selects the word next to it:

>>> format_conc(concordance(db_cur, ['alice'], q=['f*ll *']))
[['alice', 49, 'fall', 'as'],
 ['alice', 199, 'fell', 'off']]

We don’t match cheerfully in willows, though:

>>> format_conc(concordance(db_cur, ['willows'], q=['f*ll *']))
[['willows', 47, 'full', 'Thought']]

Can use ? to match precisely one character:

>>> format_conc(concordance(db_cur, ['willows'], q=['the?']))
[['willows', 126, 'They']]

…where * matches 0 or many characters:

>>> format_conc(concordance(db_cur, ['willows'], q=['the*']))
[['willows', 23, 'the'], ['willows', 103, 'the'], ['willows', 126, 'They']]

Search multiple books at the same time:

>>> format_conc(concordance(db_cur, ['alice', 'willows'], q=['f*ll *']))
[['alice', 49, 'fall', 'as'],
 ['alice', 199, 'fell', 'off'],
 ['willows', 47, 'full', 'Thought']]

Multiple queries can be done too (which is ordinarily used for the “Any word” option). We select the word before fall, and the word after fell:

>>> format_conc(concordance(db_cur, ['alice'], q=['* fall', 'fell *']))
[['alice', 47, 'a', 'fall'],
 ['alice', 199, 'fell', 'off']]

Since queries are tokenised first, punctuation / case of queries is ignored. NB: I is capitalised since we return the token from the text, not the type we search for:

>>> format_conc(concordance(db_cur, ['alice'], q=['"i--FELL--off!"']))
[['alice', 197, 'I', 'fell', 'off']]
>>> format_conc(concordance(db_cur, ['alice'], q=['i fell off']))
[['alice', 197, 'I', 'fell', 'off']]

Similarly, apostrophes are normalised (don’t vs don’t), so it doesn’t matter which type of apostrophe is searched for (Note the output always matches the original text):

>>> format_conc(concordance(db_cur, ['alice'], q=["wouldn't"], contextsize=[1]))
[['alice', 157, 'I', '**', 'wouldn’t', '**', 'say']]
>>> format_conc(concordance(db_cur, ['alice'], q=["wouldn’t"], contextsize=[1]))
[['alice', 157, 'I', '**', 'wouldn’t', '**', 'say']]

Examples: subset selection

Results can be limited to regions. We can get quote concordances:

>>> format_conc(concordance(db_cur, ['alice', 'willows'], q=["thought"], subset=["quote"], contextsize=[1]))
[['willows', 55, 'full', '**', 'Thought', '**', 'I']]

…nonquote concordances:

>>> format_conc(concordance(db_cur, ['alice', 'willows'], q=["thought"], subset=["nonquote"], contextsize=[1]))
[['alice', 9, 'Well', '**', 'thought', '**', 'Alice']]

…or all (the default):

>>> format_conc(concordance(db_cur, ['alice', 'willows'], q=["thought"], contextsize=[1]))
[['alice', 9, 'Well', '**', 'thought', '**', 'Alice'],
 ['willows', 55, 'full', '**', 'Thought', '**', 'I']]

When searching in subsets, we do not consider boundaries, searching for “think I” finds a match that straddles 2 quotes:

>>> format_conc(concordance(db_cur, ['alice', 'willows'], q=["think I"], subset=["quote"], contextsize=[5]))
[['alice', 341, 'to', 'the', 'fifth', 'bend', 'I',
  '**', 'think', 'I', '**',
  'had', 'NOT', 'cried', 'the', 'Mouse']]

Query parsing

Query parsing is done by tokenising the string using the tokenizer module <../clic.tokenizer/> into a list of types, see there for more information.

When we parse for concordance search queries, we preserve asterisks and convert them into percent marks, which is what the database uses to mean “0 or more characters” in like expressions (see concordance):

>>> parse_query('''
... We have *books everywhere*!
...
... Moo* * oi*-nk
... ''')
['we', 'have', '%books', 'everywhere%',
 'moo%', '%', 'oi%-nk']

If the same phrase was in a book, we would throw away the asterisks when converting to types:

>>> [x[0] for x in types_from_string('''
... We have *books everywhere*!
...
... Moo* * oi*-nk
... ''')]
['we', 'have', 'books', 'everywhere',
 'moo', 'oi', 'nk']

We also support ? for single characters, which get turned into a like expression _:

>>> [x for x in parse_query('''To the ?th degree''')]
['to', 'the', '_th', 'degree']
clic.concordance.STOPWORDS = {'a', 'and', 'he', 'i', 'in', 'of', 'that', 'the', 'to', 'was'}

Most frequent words in entirety of corpora, as of 2018-12-13

clic.concordance.concordance(cur, corpora=['dickens'], subset=['all'], q=[], contextsize=['0'], metadata=[])

Main entry function for concordance search

  • corpora: List of corpora / book names
  • subset: Subset to search within, or ‘all’
  • q: Quer(ies) to search for, results will contain one of the given expressions
  • contextsize: Size of context window, defaults to none.
  • metadata, Array of extra metadata to provide with result, some of - ‘book_titles’ (return dict of book IDs to titles at end of result)
clic.concordance.find_anchor_offset(*types)

Choose our anchor node in types and return the offset of the anchor node (i.e. 0 for the first word, 1 for the second…)

In general longest word is chosen:

>>> find_anchor_offset('our', 'reckoning')
1

>>> find_anchor_offset('finding', 'nemo')
0

Stopwords (see STOPWORDS) aren’t chosen, even when they are shorter:

>>> find_anchor_offset('the', 'fog')
1

>>> find_anchor_offset('joe', 'that')
0

If all stopwords, the longest is chosen:

>>> find_anchor_offset('he', 'was', 'that')
2

Wildcards aren’t counted when considering length:

>>> find_anchor_offset('so', 'h%ds')
1

>>> find_anchor_offset('jazz', 'h%ds')
0
clic.concordance.parse_query(q)

Turn a query string into a list of LIKE expressions

clic.concordance.to_conc(full_text, full_tokens, node_tokens, contextsize)

Convert full text + tokens back into wire format - full_text: String covering entire area, including window - full_tokens: List of tokens, including window - node_tokens: List of tokens, excluding window - contextsize: Number of tokens should be in window, if 0 then don’t return window

A token is a NumericRange type indicating the range in full_text it corresponds to