clic.subset: Subset endpoint¶

Returns subsets of given texts, for example quotations.

corpora: 1+ corpus name (e.g. ‘dickens’) or book name (‘AgnesG’) to search within
subset: subset to return, one of shortsus/longsus/nonquote/quote/all. Default ‘all’ (i.e. all text)
contextsize: Size of context window around subset. Default 0.
metadata: Optional data to return, see book_metadata.py for all options.

Parameters should be provided in querystring format, for example:

?corpora=dickens&corpora=AgnesG&subset=quote

Returns a data array, one entry per result. The data array is sorted by the book id, then chapter number. Each item is an array with the following items:

The left context window (if contextsize > 0, otherwise omitted)
The node (i.e. the subset)
The right context window (if contextsize > 0, otherwise omitted)
Result metadata
Position-in-book metadata

Each of left/node/right context window is an array of word/non-word tokens, with the final item indicating which of the tokens are word tokens. For example:

[
    'while',
    ' ',
    'this',
    ' ',
    'shower',
    ' ',
    'gets',
    ' ',
    'owered',
    ",'",
    ' ',
    [0, 2, 4, 6, 8],
]

Result metadata and Position-in-book metadata are currently subject to change.

The version object gives both the current version of CLiC and the revision of the corpora ingested in the database.

Examples:

/api/subset?corpora=AgnesG&subset=longsus:

{"data":[
  [["observed"," ","Smith",";"," ","'","and"," ","a"," ","darksome"," ",[0,2,6,8,10]], . . .],
  [["replied"," ","she",","," ","with"," ","a"," ","short",","," ","bitter"," ","laugh",";"," ",[0,2,5,7,9,12,14]], . . .],
   . . .
], "version":{"corpora":"master:fc4de7c", "clic":"1.6:95bf699"}}

/api/subset?corpora=AgnesG&subset=longsus&contextsize=3:

{"data":[
  [
    ["you",","," ","Miss"," ","Agnes",",'"," ",[0,3,5]],
    ["observed"," ","Smith",";"," ","'","and"," ","a"," ","darksome"," ",[0,2,6,8,10]],
    ["'","un"," ","too",";"," ","but",[1,3,6]],
     . . .
  ], [
    ["shown"," ","much"," ","mercy",",'"," ",[0,2,4]],
    ["replied"," ","she",","," ","with"," ","a"," ","short",","," ","bitter"," ","laugh",";"," ",[0,2,5,7,9,12,14]],
    ["'","killing"," ","the"," ","poor",[1,3,5]],
     . . .
  ],
], "version":{"corpora":"master:fc4de7c", "clic":"1.6:95bf699"}}

Method¶

The subset search peforms the following steps:

Resolve the corpora option to a list of book IDs, translate the subset selection to a database region.
For each region, find all tokens within the region, and (contextsize) + 10 characters either side (it is faster to approximate the context’s number of characters than get (contextsize) words).
Combine the results with the text from the original book, add the chapter/paragraph/sentence statistics for the first node in the region, return result.

Examples / edge cases¶

>>> db_cur = test_database(
... alice='''
... ‘Well!’ thought Alice to herself, ‘after such a fall as this, I shall
... think nothing of tumbling down stairs! How brave they’ll all think me at
... home! Why, I wouldn’t say anything about it, even if I fell off the top
... of the house!’ (Which was very likely true.)
... ''',
...
... willows='''
... ‘Get off!’ spluttered the Rat, with his mouth full.
...
... ‘Thought I should find you here all right,’ said the Otter cheerfully.
... ‘They were all in a great state of alarm along River Bank when I arrived
... this morning.’
... ''',
...
... mansfield='''
... over the various advertisements of “A most desirable
... Estate in South Wales”; “To Parents and Guardians”; and a “Capital
... season’d Hunter.”
... ''')

We can ask for quotes:

>>> format_conc(subset(db_cur, ['alice', 'willows'], subset=['quote']))
[['alice', 1, 'Well'],
 ['alice', 35, 'after', 'such', 'a', 'fall', 'as', 'this', 'I', 'shall',
 'think', 'nothing', 'of', 'tumbling', 'down', 'stairs', 'How', 'brave',
 'they’ll', 'all', 'think', 'me', 'at', 'home', 'Why', 'I', 'wouldn’t',
 'say', 'anything', 'about', 'it', 'even', 'if', 'I', 'fell', 'off', 'the',
 'top', 'of', 'the', 'house'],
 ['willows', 1, 'Get', 'off'],
 ['willows', 54, 'Thought', 'I', 'should', 'find', 'you', 'here', 'all', 'right'],
 ['willows', 125, 'They', 'were', 'all', 'in', 'a', 'great', 'state', 'of',
  'alarm', 'along', 'River', 'Bank', 'when', 'I', 'arrived', 'this', 'morning']]

Or nonquotes, from a single book:

>>> format_conc(subset(db_cur, ['alice'], subset=['nonquote']))
[['alice', 9, 'thought', 'Alice', 'to', 'herself'],
 ['alice', 231, 'Which', 'was', 'very', 'likely', 'true']]

Context size can also be configured, but the return is only approximate:

>>> format_conc(subset(db_cur, ['alice'], subset=['nonquote'], contextsize=[3]))
[['alice', 9, 'Well', '**', 'thought', 'Alice', 'to', 'herself', '**', 'after', 'such', 'a', 'fall', 'as', 'this', 'I'],
 ['alice', 231, 'off', 'the', 'top', 'of', 'the', 'house', '**', 'Which', 'was', 'very', 'likely', 'true', '**']]

Suspensions without any words inside aren’t returned:

>>> format_conc(subset(db_cur, ['mansfield'], subset=['shortsus'], contextsize=[3]))
[['mansfield', 104, 'To', 'Parents', 'and', 'Guardians', '**', 'and', 'a', '**', 'Capital', 'season’d', 'Hunter']]

clic.subset.subset(cur, corpora=['dickens'], subset=['all'], contextsize=['0'], metadata=[])¶

Main entry function for subset search

corpora: List of corpora / book names
subset: Subset(s) to search for.
contextsize: Size of context window, defaults to none.
metadata, Array of extra metadata to provide with result, some of - ‘book_titles’ (return dict of book IDs to titles at end of result)

clic.subset: Subset endpoint¶

Method¶

Examples / edge cases¶

CLiC User Guide

Navigation

Related Topics