clic.subset: Subset endpoint¶
Returns subsets of given texts, for example quotations.
- corpora: 1+ corpus name (e.g. ‘dickens’) or book name (‘AgnesG’) to search within
- subset: subset to return, one of shortsus/longsus/nonquote/quote/all. Default ‘all’ (i.e. all text)
- contextsize: Size of context window around subset. Default 0.
- metadata: Optional data to return, see book_metadata.py for all options.
Parameters should be provided in querystring format, for example:
?corpora=dickens&corpora=AgnesG&subset=quote
Returns a data
array, one entry per result. The data array is sorted by the book id,
then chapter number. Each item is an array with the following items:
- The left context window (if
contextsize
> 0, otherwise omitted) - The node (i.e. the subset)
- The right context window (if
contextsize
> 0, otherwise omitted) - Result metadata
- Position-in-book metadata
Each of left/node/right context window is an array of word/non-word tokens, with the final item indicating which of the tokens are word tokens. For example:
[
'while',
' ',
'this',
' ',
'shower',
' ',
'gets',
' ',
'owered',
",'",
' ',
[0, 2, 4, 6, 8],
]
Result metadata and Position-in-book metadata are currently subject to change.
The version
object gives both the current version of CLiC and the revision of the
corpora ingested in the database.
Examples:
/api/subset?corpora=AgnesG&subset=longsus:
{"data":[
[["observed"," ","Smith",";"," ","'","and"," ","a"," ","darksome"," ",[0,2,6,8,10]], . . .],
[["replied"," ","she",","," ","with"," ","a"," ","short",","," ","bitter"," ","laugh",";"," ",[0,2,5,7,9,12,14]], . . .],
. . .
], "version":{"corpora":"master:fc4de7c", "clic":"1.6:95bf699"}}
/api/subset?corpora=AgnesG&subset=longsus&contextsize=3:
{"data":[
[
["you",","," ","Miss"," ","Agnes",",'"," ",[0,3,5]],
["observed"," ","Smith",";"," ","'","and"," ","a"," ","darksome"," ",[0,2,6,8,10]],
["'","un"," ","too",";"," ","but",[1,3,6]],
. . .
], [
["shown"," ","much"," ","mercy",",'"," ",[0,2,4]],
["replied"," ","she",","," ","with"," ","a"," ","short",","," ","bitter"," ","laugh",";"," ",[0,2,5,7,9,12,14]],
["'","killing"," ","the"," ","poor",[1,3,5]],
. . .
],
], "version":{"corpora":"master:fc4de7c", "clic":"1.6:95bf699"}}
Method¶
The subset search peforms the following steps:
- Resolve the corpora option to a list of book IDs, translate the subset selection to a database region.
- For each region, find all tokens within the region, and (contextsize) + 10 characters either side (it is faster to approximate the context’s number of characters than get (contextsize) words).
- Combine the results with the text from the original book, add the chapter/paragraph/sentence statistics for the first node in the region, return result.
Examples / edge cases¶
>>> db_cur = test_database(
... alice='''
... ‘Well!’ thought Alice to herself, ‘after such a fall as this, I shall
... think nothing of tumbling down stairs! How brave they’ll all think me at
... home! Why, I wouldn’t say anything about it, even if I fell off the top
... of the house!’ (Which was very likely true.)
... ''',
...
... willows='''
... ‘Get off!’ spluttered the Rat, with his mouth full.
...
... ‘Thought I should find you here all right,’ said the Otter cheerfully.
... ‘They were all in a great state of alarm along River Bank when I arrived
... this morning.’
... ''',
...
... mansfield='''
... over the various advertisements of “A most desirable
... Estate in South Wales”; “To Parents and Guardians”; and a “Capital
... season’d Hunter.”
... ''')
We can ask for quotes:
>>> format_conc(subset(db_cur, ['alice', 'willows'], subset=['quote']))
[['alice', 1, 'Well'],
['alice', 35, 'after', 'such', 'a', 'fall', 'as', 'this', 'I', 'shall',
'think', 'nothing', 'of', 'tumbling', 'down', 'stairs', 'How', 'brave',
'they’ll', 'all', 'think', 'me', 'at', 'home', 'Why', 'I', 'wouldn’t',
'say', 'anything', 'about', 'it', 'even', 'if', 'I', 'fell', 'off', 'the',
'top', 'of', 'the', 'house'],
['willows', 1, 'Get', 'off'],
['willows', 54, 'Thought', 'I', 'should', 'find', 'you', 'here', 'all', 'right'],
['willows', 125, 'They', 'were', 'all', 'in', 'a', 'great', 'state', 'of',
'alarm', 'along', 'River', 'Bank', 'when', 'I', 'arrived', 'this', 'morning']]
Or nonquotes, from a single book:
>>> format_conc(subset(db_cur, ['alice'], subset=['nonquote']))
[['alice', 9, 'thought', 'Alice', 'to', 'herself'],
['alice', 231, 'Which', 'was', 'very', 'likely', 'true']]
Context size can also be configured, but the return is only approximate:
>>> format_conc(subset(db_cur, ['alice'], subset=['nonquote'], contextsize=[3]))
[['alice', 9, 'Well', '**', 'thought', 'Alice', 'to', 'herself', '**', 'after', 'such', 'a', 'fall', 'as', 'this', 'I'],
['alice', 231, 'off', 'the', 'top', 'of', 'the', 'house', '**', 'Which', 'was', 'very', 'likely', 'true', '**']]
Suspensions without any words inside aren’t returned:
>>> format_conc(subset(db_cur, ['mansfield'], subset=['shortsus'], contextsize=[3]))
[['mansfield', 104, 'To', 'Parents', 'and', 'Guardians', '**', 'and', 'a', '**', 'Capital', 'season’d', 'Hunter']]
-
clic.subset.
subset
(cur, corpora=['dickens'], subset=['all'], contextsize=['0'], metadata=[])¶ Main entry function for subset search
- corpora: List of corpora / book names
- subset: Subset(s) to search for.
- contextsize: Size of context window, defaults to none.
- metadata, Array of extra metadata to provide with result, some of - ‘book_titles’ (return dict of book IDs to titles at end of result)