clic.region.chapter: Tag chapter.* regions

Add chapter.* tags to regions.

Chapter tagging depends on metadata region tags:

>>> from .metadata import tagger_metadata

chapter.title / chapter.text regions

Any chapter conforming to the definition in corpora gets detected, and the text afterwards will be marked as chapter.text:

>>> run_tagger('''
... Initial text is the zero'th chapter.
...
... INTRODUCTION.
...
... The introduction has some chapter text.
... It's not very exciting.
...
... CHAPTER I. The first chapter
...
... The first chapter has some text.
...
... CHAPTER II. The second, empty, chapter
...
... CHAPTER III. The third, chapter
...
... ...has some text too, which goes to the very end.
... '''.strip(), tagger_metadata, tagger_chapter_title, tagger_chapter_text)
[('chapter.text', 0, 36, 0, "Initial text is the zero'th chapter."),
 ('chapter.title', 38, 51, 1, 'INTRODUCTION.'),
 ('chapter.text', 53, 116, 1, 'The introduction has...s not very exciting.'),
 ('chapter.title', 118, 146, 2, 'CHAPTER I. The first chapter'),
 ('chapter.text', 148, 180, 2, 'The first chapter has some text.'),
 ('chapter.title', 182, 220, 3, 'CHAPTER II. The second, empty, chapter'),
 ('chapter.title', 222, 253, 4, 'CHAPTER III. The third, chapter'),
 ('chapter.text', 255, 304, 4, '...has some text too...oes to the very end.')]

We ignore any metadata if it’s there:

>>> run_tagger('''
... Fly Fishing
... J R Hartley
...
... Initial text is the zero'th chapter, but not including title.
...
... INTRODUCTION.
...
... The introduction has some chapter text.
... It's not very exciting.
... '''.strip(), tagger_metadata, tagger_chapter_title, tagger_chapter_text)
[('metadata.title', 0, 11, None, 'Fly Fishing'),
 ('metadata.author', 12, 23, None, 'J R Hartley'),
 ('chapter.text', 25, 86, 0, 'Initial text is the ...not including title.'),
 ('chapter.title', 88, 101, 1, 'INTRODUCTION.'),
 ('chapter.text', 103, 166, 1, 'The introduction has...s not very exciting.')]

It’s possible to not have any chapters too:

>>> run_tagger('''
... Here is some text, without any preamble
... It's not very exciting.
... '''.strip(), tagger_metadata, tagger_chapter_title, tagger_chapter_text)
[('chapter.text', 0, 63, 0, 'Here is some text, w...s not very exciting.')]

Paragraph / sentence counts reset at the start of the new chapter.

>>> [x for x in run_tagger('''
... Initial text is the zero'th chapter. Second sentence.
...
... INTRODUCTION.
...
... First chapter, first sentence. Second sentence. Third sentence.
...
... Second paragraph, fourth sentence. Fifth!
...
... CHAPTER I. The first chapter
...
... First chapter, first sentence. Second sentence. Third.
...
... Second paragraph, fourth sentence. Fifth!
... '''.strip(), tagger_metadata, tagger_chapter) if x[0] in ['chapter.paragraph', 'chapter.sentence']]
[('chapter.paragraph', 0, 53, 1, 'Initial text is the ...er. Second sentence.'),
 ('chapter.sentence', 0, 36, 1, "Initial text is the zero'th chapter."),
 ('chapter.sentence', 37, 53, 2, 'Second sentence.'),
 ('chapter.paragraph', 70, 133, 1, 'First chapter, first...nce. Third sentence.'),
 ('chapter.sentence', 70, 100, 1, 'First chapter, first sentence.'),
 ('chapter.sentence', 101, 117, 2, 'Second sentence.'),
 ('chapter.sentence', 118, 133, 3, 'Third sentence.'),
 ('chapter.paragraph', 135, 176, 2, 'Second paragraph, fo...rth sentence. Fifth!'),
 ('chapter.sentence', 135, 169, 4, 'Second paragraph, fourth sentence.'),
 ('chapter.sentence', 170, 176, 5, 'Fifth!'),
 ('chapter.paragraph', 208, 262, 1, 'First chapter, first...ond sentence. Third.'),
 ('chapter.sentence', 208, 238, 1, 'First chapter, first sentence.'),
 ('chapter.sentence', 239, 255, 2, 'Second sentence.'),
 ('chapter.sentence', 256, 262, 3, 'Third.'),
 ('chapter.paragraph', 264, 305, 2, 'Second paragraph, fo...rth sentence. Fifth!'),
 ('chapter.sentence', 264, 298, 4, 'Second paragraph, fourth sentence.'),
 ('chapter.sentence', 299, 305, 5, 'Fifth!')]

chapter.part regions

Chapters can be interleaved with “PART x.” or “BOOK x.” headings, these will marked a “chapter.part”. They don’t influence chapter counts, but aren’t part of chapter.text. For example:

>>> [x for x in run_tagger('''
... Initial text is the zero'th chapter. Second sentence.
...
... BOOK 1.
...
... CHAPTER I. The first chapter in Book 1
...
... The text in chapter 1.
...
... CHAPTER II. The second chapter
...
... The text in chapter 2.
...
... BOOK 2.
...
... Some introductory text at start of the book.
...
... CHAPTER I. The first chapter in Book 2
...
... First chapter. Note that the chapter numbers carry on from previous book
... '''.strip(), tagger_metadata, tagger_chapter) if x[0] in ['chapter.part', 'chapter.title', 'chapter.text']]
[('chapter.text', 0, 53, 0, 'Initial text is the ...er. Second sentence.'),
 ('chapter.part', 55, 62, 1, 'BOOK 1.'),
 ('chapter.title', 64, 102, 1, 'CHAPTER I. The first chapter in Book 1'),
 ('chapter.text', 104, 126, 1, 'The text in chapter 1.'),
 ('chapter.title', 128, 158, 2, 'CHAPTER II. The second chapter'),
 ('chapter.text', 160, 182, 2, 'The text in chapter 2.'),
 ('chapter.part', 184, 191, 2, 'BOOK 2.'),
 ('chapter.text', 193, 237, 2, 'Some introductory te...t start of the book.'),
 ('chapter.title', 239, 277, 3, 'CHAPTER I. The first chapter in Book 2'),
 ('chapter.text', 279, 351, 3, 'First chapter. Note ...n from previous book')]

chapter.paragraph / chapter.sentence regions

All chapter.text is broken up into chapter.paragraph, paragraph boundaries are defined as a sequence of 2 newlines within chapter text (i.e. a blank line in the text).

chapter.paragraph are then broken up into chapter.sentence, using the Unicode sentence segmentation in [UAX29], using the implementation in the [ICU] library.

  • We use the en_GB@ss=standard locale (ss=standard tells ICU to not treat abbreviations like “Mr. Jones” as a sentence break.)
  • Before applying the algorithm, we remove newlines from the chapter text, since we do not want to treat them as sentence breaks (which ICU does by default).

The following shows both splitting of paragraph and sentences:

>>> run_tagger('''
... “Thou find’st it out, child?  Ay, ’tis worth all the feather-beds and
... pouncet-boxes in Ulm; is it not?  That accursed Italian fever never left
... me till I came up here.  A man can scarce draw breath in your foggy
... meadows below there.  Now then, here is the view open.  What think you of
... the Eagle’s Nest?”
...
... “And this is Schloss Adlerstein?” she exclaimed.
...
... “That is Schloss Adlerstein; and there shalt thou be in two hours’ time,
... unless the devil be more than usually busy, or thou mak’st a fool of
... thyself.  If so, not Satan himself could save thee.”
...
... '''.strip(), tagger_metadata, tagger_chapter)
[('chapter.text', 0, 549, 0, '“Thou find’st it out...lf could save thee.”'),
 ('chapter.paragraph', 0, 303, 1, '“Thou find’st it out...f\nthe Eagle’s Nest?”'),
 ('chapter.sentence', 0, 28, 1, '“Thou find’st it out, child?'),
 ('chapter.sentence', 30, 102, 2, 'Ay, ’tis worth all t...s in Ulm; is it not?'),
 ('chapter.sentence', 104, 166, 3, 'That accursed Italia...till I came up here.'),
 ('chapter.sentence', 168, 231, 4, 'A man can scarce dra...meadows below there.'),
 ('chapter.sentence', 233, 265, 5, 'Now then, here is the view open.'),
 ('chapter.sentence', 267, 303, 6, 'What think you of\nthe Eagle’s Nest?”'),
 ('chapter.paragraph', 305, 353, 2, '“And this is Schloss...ein?” she exclaimed.'),
 ('chapter.sentence', 305, 338, 7, '“And this is Schloss Adlerstein?”'),
 ('chapter.sentence', 339, 353, 8, 'she exclaimed.'),
 ('chapter.paragraph', 355, 549, 3, '“That is Schloss Adl...lf could save thee.”'),
 ('chapter.sentence', 355, 505, 9, '“That is Schloss Adl...t a fool of\nthyself.'),
 ('chapter.sentence', 507, 549, 10, 'If so, not Satan him...lf could save thee.”')]

By default using ICU, sentence breaks would occur at the end of lines without any punctuation. Instead, we ignore end of lines unless they would be a sentence break anyway. We also don’t break on “Mr. Oliver”:

>>> [x for x in run_tagger('''
... modest-looking little shop-window, containing a few newspapers, some
... Rather yellow packets of stationery, and two or three books of ballads.
... Above the door was painted in very small, dingy letters, the words,
... "James Oliver, News Agent."
...
... So if you wish to stay here with my brother, Mr. Oliver, and this little
... girl, Miss Dorothy Raleigh, as I suppose her name is, you must get all these things.
... '''.strip(), tagger_metadata, tagger_chapter) if x[0] in ('chapter.paragraph', 'chapter.sentence')]
[('chapter.paragraph', 0, 236, 1, 'modest-looking littl...Oliver, News Agent."'),
 ('chapter.sentence', 0, 140, 1, 'modest-looking littl...ee books of ballads.'),
 ('chapter.sentence', 141, 236, 2, 'Above the door was p...Oliver, News Agent."'),
 ('chapter.paragraph', 238, 395, 2, 'So if you wish to st...et all these things.'),
 ('chapter.sentence', 238, 395, 3, 'So if you wish to st...et all these things.')]
clic.region.chapter.tagger_chapter(book)

Add chapter.* tags to (book)

clic.region.chapter.tagger_chapter_paragraph(book)

Add chapter.paragraph tags to (book)

clic.region.chapter.tagger_chapter_part(book)

Add chapter.part tags to (book)

clic.region.chapter.tagger_chapter_sentence(book)

Add chapter.sentence tags to (book)

clic.region.chapter.tagger_chapter_text(book)

Add chapter.text tags to (book)

clic.region.chapter.tagger_chapter_title(book)

Add chapter.title tags to (book)