CLiC API

Overview

Any data available through the CLiC web interface is also available by directly calling the CLiC API. The CLiC API returns a JSON representation of the CLiC data, which means that the data can be retrieved directly using any programming language. To get you started we have written some example code for both Python and R.

The sample code is for the corpora and subset endpoints. The corpora endpoint is used to retrieve the list of resources available in CLiC. This can then be used by your code to filter or select the resources you want to fetch (see the example usage). In the example code the corpora endpoint is called using the get_lookup() function which returns content something like:

> lookup
     corpus                      author     id               title
  1: ChiLit            Agnes Strickland  rival   The Rival Crusoes
  2: ChiLit                 Andrew Lang prigio       Prince Prigio
  3: ChiLit           Ann Fraser Tytler  leila       Leila at Home
 ---
136:    ntc              Wilkie Collins   arma            Armadale
137:    ntc              Wilkie Collins wwhite  The Woman in White
138:    ntc William Makepeace Thackeray vanity         Vanity Fair

The subset endpoint is used to retrieve tokenized text for all or parts of one or more of the corpora. This endpoint is called ‘subset’ because you can restrict the retrieved tokens to a specific subset of the whole text, these subsets are: quoted text, non-quoted text, long suspensions or short suspensions. In the example code the subset endpoint is called using the get_tokens() function. The subset endpoint is documented at clic.subset.

The cluster endpoint is used to retrieve n-grams and their counts. In the example code the clusters endpoint is called using the get_clusters() function. The cluster endpoint is documented at clic.cluster.

We would be interested to hear about how you use the CLiC API and are always happy to consider CLiC related guest posts for the CLiC blog. To let us know how you are using the CLiC API, to give us feedback, or if you need any help that you cannot find here or through the CLiC homepage you can contact us at clic@contacts.bham.ac.uk.

To help us understand who is using the API, when writing code to access the API please set the “User-Agent” header to something that identifies you or your application.

The CLiC API uses the legacy names for the CLiC corpora. The following table gives the correspondence between the corpora names as seen in the CLiC web interface and those used by the CLiC API.

CLiC Web CLiC API Description
DNov dickens Dickens’s Novels
19C ntc 19th Century Reference Corpus
ChiLit ChiLit 19th Century Children’s Literature Corpus
ArTs Other Additional Requested Texts

Example code

Python 3

import json
from operator import itemgetter
from collections import OrderedDict
import requests
import pandas as pd

UA = "CLiC API Example Python3 Code"  # user agent !! CHANGE ME !!
HOSTNAME = "clic.bham.ac.uk"

def api_request(endpoint, query=None):
    """
    Makes the API requests.
    Returns the endpoint specific data structure as a python structure.

    - endpoint: see the inline docs in /server/clic
    - query: endpoint specific parameters as a querystring
    """
    if query is None:
        uri = 'http://%s/api/%s' % (HOSTNAME, endpoint)
    else:
        uri = 'http://%s/api/%s?%s' % (HOSTNAME, endpoint, query)
    resp = requests.get(uri, headers={'User-agent': UA, 'Accept': 'application/json'})
    try:
        rv = resp.json()
    except json.decoder.JSONDecodeError:
        print("API request did not return valid JSON")
    if rv.get('error', False):
        raise ValueError("API returned error: " + rv['error']['message'])
    if rv.get('warn', False):
        print("API returned warning: " + rv['warn']['message'])
    if rv.get('info', False):
        print("API returned info: " + rv['info']['message'])
    return rv

def get_lookup():
    """
    Returns a pandas DataFrame listing the texts for each of the available corpora.
    """
    rv = api_request(endpoint="corpora")
    d = []
    for corpus in rv['corpora']:
        corpus_id = corpus['id']
        for book in corpus['children']:
            d.append({'corpus' : corpus_id, 'author' : book['author'], \
                      'shortname' : book['id'], 'title': book['title']})
    df = pd.DataFrame(d, columns=['corpus', 'author', 'shortname', 'title'])
    df.sort_values(['corpus', 'author', 'title'], inplace=True, ascending=True)
    df.reset_index(inplace=True, drop=True)
    return df

def get_tokens(shortname, subset=None, lowercase=True, punctuation=False):
    """
    Fetches tokens using the 'subset' endpoint.
    Returns a list of tokens.

    - shortname: can be any value from the 'corpus' or 'shortname' columns returned
          by get_lookup() can be a string or a list of strings
    - subset: any one of "shortsus", "longsus", "nonquote", "quote"
    - lowercase: boolean indicating if the tokens should be transformed to lower case
    - punctuation: boolean indicating if punctuation tokens should be included
    """
    if isinstance(shortname, str):
        shortname = [shortname]
    query = '&'.join(["corpora=%s" % sn for sn in shortname])
    if subset is not None:
        if subset not in ["shortsus", "longsus", "nonquote", "quote"]:
            raise ValueError('bad subset parameter: "%s"' % subset)
        query = query + "&subset=%s" % subset
    rv = api_request(endpoint="subset", query=query)
    if punctuation:
        tokens = [j for i in rv['data'] for j in i[0][:-1]]
    else:
        tokens = [j for i in rv['data'] for j in [i[0][:-1][k] for k in i[0][-1]]]
    if lowercase:
        return [i.lower() for i in tokens]
    return tokens

def get_clusters(shortname, length, cutoff=5, subset=None):
    """
    Fetches n-grams using the 'cluster' endpoint.
    Returns a OrderedDict of clusters to counts.

    - shortname: can be any value from the 'corpus' or 'shortname' columns returned
          by get_lookup() can be a string or a list of strings
    - length: cluster length to search for, one of 1/3/4/5 (NB: There is no 2)
    - cutoff: [default: 5] the cutoff frequency, if a cluster occurs less times
          than this it is not returned
    - subset: [optional] any one of "shortsus", "longsus", "nonquote", "quote"
    """
    if isinstance(shortname, str):
        shortname = [shortname]
    query = '&'.join(["corpora=%s" % sn for sn in shortname])
    if subset is not None:
        if subset not in ["shortsus", "longsus", "nonquote", "quote"]:
            raise ValueError('bad subset parameter: "%s"' % subset)
        query = query + "&subset=%s" % subset
    query = query + "&clusterlength=%d&cutoff=%d" % (length, cutoff)
    rv = api_request(endpoint="cluster", query=query)
    clusters = OrderedDict(sorted(rv['data'], key=itemgetter(1), reverse=True))
    return clusters

Find out what texts are available:

>>> lookup = get_lookup()
>>> lookup.head()
   corpus             author shortname                       title
0  ChiLit   Agnes Strickland     rival           The Rival Crusoes
1  ChiLit        Andrew Lang    prigio               Prince Prigio
2  ChiLit  Ann Fraser Tytler     leila               Leila at Home
3  ChiLit        Anna Sewell    beauty                Black Beauty
4  ChiLit     Beatrix Potter     bunny  The Tale Of Benjamin Bunny
>>> lookup.tail()
    corpus                       author shortname                          title
133    ntc                 Thomas Hardy    native       The Return of the Native
134    ntc               Wilkie Collins    Antoni  Antonina, or the Fall of Rome
135    ntc               Wilkie Collins      arma                       Armadale
136    ntc               Wilkie Collins    wwhite             The Woman in White
137    ntc  William Makepeace Thackeray    vanity                    Vanity Fair

Filter what is available:

>>> lookup[lookup['author'] == "Thomas Hardy"]
    corpus        author shortname                      title
131    ntc  Thomas Hardy      Jude           Jude the Obscure
132    ntc  Thomas Hardy      Tess  Tess of the D'Urbervilles
133    ntc  Thomas Hardy    native   The Return of the Native

Fetch the tokens for a specific text:

>>> tokens = get_tokens(shortname='leila')
>>> len(tokens)
63026
>>> tokens[0:9]
['it', 'was', 'the', 'intention', 'of', 'the', 'writer', 'of', 'the']

Fetch the tokens for all quotes text in novels by Jane Austen:

>>> wanted = [sn for sn in lookup[lookup['author'] == "Jane Austen"]['shortname']]
>>> wanted
['ladysusan', 'mansfield', 'northanger', 'sense', 'emma', 'persuasion', 'pride']

>>> austen_quotes = get_tokens(shortname=wanted, subset="quote")
>>> len(austen_quotes)
307445
>>> austen_quotes[0:9]
['poor', 'miss', 'taylor', 'i', 'wish', 'she', 'were', 'here', 'again']

Keep each text separate:

>>> austen_quotes = {}
>>> for sn in wanted:
...     austen_quotes[sn] = get_tokens(shortname=sn, subset="quote")
...
>>> {key:len(value) for key,value in austen_quotes.items()}
>>> print(json.dumps({key:len(value) for key,value in austen_quotes.items()}))
{
  "ladysusan": 2791,
  "mansfield": 62013,
  "northanger": 28937,
  "sense": 51744,
  "emma": 80319,
  "persuasion": 28653,
  "pride": 52988
}
>>> austen_quotes['emma'][0:9]
['poor', 'miss', 'taylor', 'i', 'wish', 'she', 'were', 'here', 'again']

An now lets get some clusters for the Jane Austen novels:

>>> austen_clusters = get_clusters(shortname=wanted, length=5, cutoff=5, subset="quote")
>>> print(json.dumps(austen_clusters, indent=2))
{
  "i do not know what": 26,
  "i am sure you will": 16,
  "i do not know that": 16,
  "i do not mean to": 16,
  "and i am sure i": 16,
  "i have no doubt of": 14,
  "i do not think i": 14,
  "i am sure i should": 13,
  "i am sure i do": 11,
  "i do not pretend to": 11,
  ...

R

Functions to access the CLiC API from R are available in an R package. The package contains a Getting Started vignette which includes code samples.