clic.keyword: Keyword endpoint

Module to compute keywords (words that are used significantly more frequently in one corpus than they are in a reference corpus).

The statistical measure used is Log Likelihood as explained by Rayson and Garside:

Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1 - 6 Available at: http://ucrel.lancs.ac.uk/people/paul/publications/rg_acl2000.pdf

Usage as follows:

analysis = pd.DataFrame([(‘a’, 2500), (‘the’, 25000)], columns=(‘Type’, ‘Count’)) reference = pd.DataFrame([(‘a’, 10), (‘the’, 10)], columns=(‘Type’, ‘Count’))

result = extract_keywords(analysis,
reference, tokencount_analysis=20, tokencount_reference=100, round_values=True, limit_rows=3)

Or using text files as input:

from collections import Counter

with open(‘~/data/input/DNov/BH.txt’) as inputfile:
bh = inputfile.read().split() bh = Counter(bh)
with open(‘~/data/input/DNov/OT.txt’) as inputfile:
ot = inputfile.read().split() ot = Counter(ot)

bh_df = pd.DataFrame(bh.items(), columns=[‘Type’, ‘Count’]) ot_df = pd.DataFrame(ot.items(), columns=[‘Type’, ‘Count’])

extract_keywords(bh_df,
ot_df, tokencount_analysis=bh_df.Count.sum(), tokencount_reference=ot_df.Count.sum(), round_values=True,)
clic.keyword.extract_keywords(wordlist_analysis, wordlist_reference, tokencount_analysis, tokencount_reference, p_value=0.0001, exclude_underused=True, freq_cut_off=5, round_values=True, limit_rows=False)

This is the core method for keyword extraction. It provides a number of handles to select sections of the data and/or adapt the input for the formula.

Input = Two dataframes with columns ‘Type’, and ‘Count’ and
two total tokencounts
Output = An aligned dataframe which is sorted on the LL value and maybe filter
using the following handles:
  • p_value: limits the keywords based on their converted p_value. A p_value of 0.0001
    will select keywords that have 0.0001 or less as their p_value. It is a cut-off. One can choose one out of four values: 0.0001, 0.001, 0.01, or 0.05. If any other value is chosen, it is ignored and no filtering on p_value is done.
  • exclude_underused: if True (default) it filters the result by excluding tokens that are
    statistically underused.
  • freq_cut_off: limits the wordlist_analysis to words that have the freq_cut_off (inclusive)
    as minimal frequency. 5 is a sane default for Log Likelihood. If one does not want frequency-based filtering, set a value of 0.
  • round_values: if True (default) it rounds the columns with expected frequencies and LL
    to 2 decimals.
  • limit_rows: if a number (for instance, 100), it limits the result to the number of rows
    specified. If false, rows are not limited.
The defaults are reasonably sane:
  • a token needs to occur at least 5 times in the corpus of analysis
  • a high p-value is set
  • no filtering of the rows takes place
  • the underused tokens are excluded
  • rounding is active

For more information on the algorithm, cf. log_likelihood().

The first column contains the indeces for the original merged dataframe. It does not display a rank and it should be ignored for keyword analysis (to be more precise: it displays the frequency rank of the token in the corpus of analysis).

clic.keyword.facets_to_df(facets)

Converts the facets into a dataframe that can be manipulated more easily.

clic.keyword.keyword(cur, clusterlength, pvalue, subset=['all'], corpora=['dickens'], refsubset=['all'], refcorpora=['dickens'])

Main entry,

clic.keyword.log_likelihood(counts)

This function uses vector calculations to compute LL values.

Input: dataframe that is formatted as follows:

Type, Count_analysis, Total_analysis, Count_ref, Total_ref

Output: dataframe that is formatted as follows:

Type, Count_analysis, Total_analysis, Count_ref, Total_ref, Expected_count_analysis, Expected_count_ref LL

Hapax legomena in the Count_analysis are not deleted. It is expected that Count_analysis and Expected_count_analysis are not zero.