The CLiC corpora

An overview of the corpora available in CLiC can be seen from the Counts tab. Simply choose the corpus for which you want to see the list of books and the Counts tab will provide a table listing the names and word counts for each book, per subset and in total.

For a full list of titles in CLiC please refer to Appendix 1. The procedure followed for retrieving, cleaning and importing the most recent texts is described in detail in our [GitHub_corpora] repository. Every change (“commit”) to the repository is marked with a “commit” number (actually a sequence of characters and numbers) in GitHub, circled in green in Fig. 2. From CLiC 2.0.0 onwards, this “commit” number is also displayed on the CLiC interface – circled in green in Fig. 3– to indicate which version of the corpora is used. We recommend that users record both the version number of CLiC itself (displayed directly under the CLiC logo, circled in red in Fig. 3, i.e. in this case “2.0.0”) and the version of the corpora (i.e. in this case “db61de3”) when saving results. This ensure that if at a later stage there should be differences in search results it is possible to investigate which changes to the interface or the corpora caused this discrepancy.

_images/figure-version-numbers-github.png

Fig. 2 The “commit” version number of the Github corpora repository

_images/figure-version-numbers.png

Fig. 3 The CLiC release and corpora version numbers in the CLiC interface

The Corpora repository also contains the full text of the corpus files after any manual cleaning changes have been implemented. The history of the Corpora repository lists the changes and you can refer to the original versions, as downloaded from the Project Gutenberg page for:

The texts can be selected individually and combined freely for analysis in any of the CLiC tools. You can also choose from one of our four pre-selected corpora: DNov – Dickens’s Novels (15 texts), 19C – 19th Century Reference Corpus (29 texts), ChiLit – 19th Century Children’s Literature Corpus (71 texts) and ArTs – Additional Requested Texts (31 texts). ArTs includes additional GCSE and A-Level titles (please see Appendix 2 for an overview of all CLiC texts listed in the AQA, OCR and Edexcel GCSE and A-Level English specifications). The ArTs corpus was not designed to be analysed as a whole, but rather to add individual requests to CLiC.

In order to select texts in any of the CLiC analysis tabs, go to controlbar on the right-hand side. You can select any or all of the texts by picking the corpora from a drop-down list or typing their names into a textbox. For example, in the Concordance tool, once you have clicked on the Concordance tab, a textbox labeled ‘Search the corpora’ will appear (see the Concordance section), as illustrated in Fig. 4 and Fig. 5.

_images/figure-corpora-select.png

Fig. 4 Selecting corpora in the Concordance tab (same procedure in Subsets and Clusters; see the Keywords section on how to select target and reference corpora)

_images/figure-corpora-select_2.png

Fig. 5 The dropdown menu for selecting corpora

You can search the pre-selected corpora in their entirety or you can pick individual books from them, effectively creating your own subcorpus. For example, you could select several books from Dickens, several books from the 19th Century Reference Corpus (19C) and several books from the 19th Century Children’s Literature Corpus (ChiLit).

You can also select an author-based corpus from the drop-down. For example, typing austen into the textbox (which is not case-sensitive) gives you the option of selecting all books by Jane Austen at once, as illustrated in Fig. 6.

_images/figure-corpora-authorbased.png

Fig. 6 Example of creating an author-based corpus: selecting all of Jane Austen’s novels

The CLiC corpora have been marked up to distinguish between several textual subsets of novels. The example from Great Expectations below illustrates the subsets and Fig. 7 shows how these are marked up in the chapter views, which can be retrieved from the ‘in bk.’ (in book) button in concordances (see the Concordance section for details) and the Text tab. The “in book” view also contains a legend of the markup layers (see Fig. 8), which you can individually select and deselect.

"And on what evidence, Pip," asked Mr. Jaggers, very coolly, as he
paused with his handkerchief half way to his nose,"does Provis make
this claim?”

"He does not make it," said I, "and has never made it, and has no
knowledge or belief that his daughter is in existence.”

For once, the powerful pocket-handkerchief failed. My reply was so
unexpected that Mr. Jaggers put the handkerchief back into his pocket
without completing the usual performance, folded his arms, and looked
with stern attention at me, though with an immovable face.

[Great Expectations, Chapter 51]

  • quotes: any text listed in quotes, i.e. mostly character speech but also thoughts or songs that might appear in quotes
  • non-quotes: narration
    • and a special case of non-quotes, suspensions, which represent narratorial interruptions of character speech that do not end with sentence-final punctuation. Suspensions are further divided by length:
      • short suspensions have a length up to four words
      • long suspensions have a length of five or more words
_images/figure-corpora-markupsubsets.png

Fig. 7 Chapter view of example (1) (retrieved via the ‘in bk.’ (in book) button in a concordance of asked Mr Jaggers very coolly), exemplifying the mark-up of subsets

_images/figure-corpora-highlight-subsets.png

The rationale behind the division of the subsets can be found in the open access article by [Mahlberg_et_al_2016]. The procedure described in that article refers to the earliest CLiC corpora, DNov and 19C. The tagging procedure for the most recently added corpora – ChiLit and ArTs – differs in the technical implementation – see clic.region for details.