abbyy_to_epub3 package

Submodules

abbyy_to_epub3.constants module

abbyy_to_epub3.create_epub module

class abbyy_to_epub3.create_epub.ArchiveBookItem(item_dir, item_identifier, item_bookpath)[source]

Bases: object

Archive.org is a website which contains an archive of items composed of archived digital content. Archive.org items are distributed across a cluster of machines called datanodes. In order to access the files of an item, you need to know 4 things:

  1. The Archive.org item_identifier (the unique ID of this item) e.g. https://archive.org/details/{item_identifier}

  2. the datanode server address which hosts this item

  3. the item_dir which is the file path on this datanode where this items files are kept

  4. the name of the files within this item_dir

Certain archive.org items are specifically structured (file organizations, contents, names) to store and play Books. Every Archive Book Item contains the following files: - a jp2.zip containing all the scanned images of the book - an abbyy file containing the OCR’d plaintest of these scans - scandata.xml whose metadata describes the structure of the book

(metadata, pages numbers)

  • meta.xml which describes the entire archive.org item

A complication is that Archive.org Book Items may contain 1 or more books. In order to accommodate this subtlety and delineate between books, an item_dir and item_identifier are not sufficient to isolate a specific book. To circumvent this limitation, we require another identifier called the item_bookpath which acts as a prefix to the files of a specific book. Given a datanode and an item_dir of an Archive Book Item, all the constituent files for a book can be constructed using item_identifier and item_bookpath in the following ways:

  • There is a single global metadata manifest file for the entire Archive Item named {item_identifier}_meta.xml.

  • All of the other book specific files follow the form {item_bookpath}_{file}. e.g. {item_bookpath}_abbyy.gz

class abbyy_to_epub3.create_epub.Ebook(item_dir, item_identifier, item_bookpath, debug=False, epubcheck=None, ace=None)[source]

Bases: abbyy_to_epub3.create_epub.ArchiveBookItem

Ebook is a utility for generating epub3 files based on Archive.org items. Holds extracted information about a book & the ebooklib EPUB object.

DEFAULT_ACE_LEVEL = 'minor'
DEFAULT_EPUBCHECK_LEVEL = 'warning'
craft_epub(epub_outfile='out.epub', tmpdir=None)[source]

Assemble the extracted metadata & text into an EPUB

craft_html()[source]

Assembles the XHTML content.

Create some minimal navigation: * Break sections at text elements marked role: heading * Break files at any headings with roleLevel: 1 Imperfect, but better than having no navigation or monster files.

Images will get alternative text of “Picture #” followed by an index number for this image. Barring real alternative text for true accessibility, this at least adds some identifying information.

create_accessibility_metadata()[source]

Set up accessibility metadata

extract_cover()[source]

http://web.archive.org/web/20180416230000/https://www.safaribooksonline.com/blog/2009/11/20/best-practices-in-epub-cover-images/

extract_images()[source]

Extracts all of the images for the text.

For efficiency’s sake, do these all at once. Memory & CPU will be at a higher premium than disk space, so unzip the entire scan file into temp directory, instead of extracting only the needed images.

get_cover_leaf()[source]

Try to find a cover image. If nothing is tagged as ‘Cover’, use the first page tagged ‘Title’. If nothing is tagged as ‘Title’, either, use the first page tagged ‘Normal’. Self.pages is an OrderedDict so break as soon as you find something useful, and don’t search the whole set of pages.

identify_headers_footers_pagenos(placement)[source]

Attempts to identify the presence of headers, footers, or page numbers

  1. Build a dict of first & last lines, indexed by page number.

  2. Try to identify headers and footers.

Headers and footers can appear on every page, or on alternating pages (for example if one page has header of the title, the facing page might have the header of the chapter name).

They may include a page number, or the page number might be a standalone header or footer.

The presence of headers and footers in the document does not mean they appear on every page (for example, chapter openings or illustrated pages sometimes don’t contain the header/footer, or contain a modified version, such as a standalone page number).

Page numbers may be in Arabic or Roman numerals.

This method does not attempt to look for all edge cases. For example, it will not find: - constantly varied headers, as in a dictionary - page numbers that don’t steadily increase - page numbers misidentified in the OCR process, eg. IO2 for 102 - page numbers with characters around them, eg. ‘~ 45 ~’

image_dim(block)[source]

Given a dict object containing the block info for an image, generate a tuple of its dimensions: (left, top, right, bottom)

images_are_extracted()[source]

Given a block and our identified text structure, return True if this block’s text is a header, footer, or page number to be ignored, False otherwise.

load_scandata_pages()[source]

Parse the page-by-page scandata file. This stores page size, right or left leaf, and page type (eg copyright, color card, etc).

make_chapter(heading)[source]

Create a chapter section in an ebooklib.epub.

make_image(block)[source]

Given a dict object containing the block info for an image, generate the image HTML

set_metadata()[source]

Set the metadata on the epub object

validate_a11y(epub_file, level=None)[source]

Individual test failures are logged in EARL syntax https://daisy.github.io/ace/docs/report-json/ Structurally: “assertions”: [

{

“@type”: “earl:assertion”, “earl:result”: {

“earl:outcome”: “fail”

}, “assertions”: [

{

“@type”: “earl:assertion”, “earl:result”: {

“earl:outcome”: “fail”, “html”: “[The invalid HTML]”

}, “earl:test”: {

“earl:impact”: “serious”, “help”: {

“dct:description”: “[Plain language error]”

},

}

}

], “earl:testSubject”: {

“url”: “cover.xhtml”,

},

},

]

validate_epub(epub_file, level=None)[source]

abbyy_to_epub3.parse_abbyy module

class abbyy_to_epub3.parse_abbyy.AbbyyParser(document, metadata_file, metadata, paragraphs, blocks, debug=False)[source]

Bases: object

The ABBYY parser object. Parses ABBYY metadata in preparation for import into an EPUB 3 document.

And ABBYY document begins with a font and style information:

<documentData>
  <paragraphStyles>
    <paragraphStyle
      id="{idnum}" name="stylename"
      mainFontStyleId="{idnum}" [style info]>
    <fontStyle id="{idnum}" [style info]>
  </paragraphStyle>
  [more styles]
</documentData>

This is followed by the data for the pages.

<page>
    <block></block>
    [more blocks]
</page>

Blocks have types. We process types Text, Picture, and Table.

Text:

<page>
        <region>
        <text> contains a '\n' as a text element
        <par> The paragraph, repeatable
            <line> The line, repeatable
                <formatting>
                <charParams>: The individual character

Picture: we know the corresponding scan (page) number, & coordinates.

Table:

<row>
  <cell>
    <text>
      <par>

Each <par> has an identifier, which has a unique style, including the paragraph’s role, eg:

<par align="Right" lineSpacing="1790"
    style="{000000DD-016F-0A36-032F-EEBBD9B8571E}">

This corresponds to a paragraphStyle from the <documentData> element:

<paragraphStyle
    id="{000000DD-016F-0A36-032F-EEBBD9B8571E}"
    name="Heading #1|1"
    mainFontStyleId="{000000DE-016F-0A37-032F-176E5F6405F5}"
    role="heading" roleLevel="1"
    [style information]>

The roles map as follows:

Role name

role

Body text

text

Footnote

footnote

Header or footer

rt

Heading

heading

Other

other

Table caption

tableCaption

Table of contents

contents

etree = ''
find_namespace()[source]

find the namespace of an XML document. Assumes that the namespace of the first element in the context is the namespace we need. This is more memory-efficient then parsing the entire tree to get the root node.

is_block_type(blockattr, blocktype)[source]

Identifies if a block has the given type.

ns = ''
nsm = ''
parse_abbyy()[source]

Parse the ABBYY into a format useful for create_epub. Process the the elements we will need to construct the EPUB: paragraphStyle, fontStyle, and page. We traverse the entire tree twice with iterparse, because lxml builds the whole node tree in memory for even tag-selective iterparse, & if we don’t traverse the whole tree, we can’t delete the unowned nodes. fast_iter makes the process speedy, and the dual processing saves on memory. Because of the layout of the elements in the ABBYY file, it’s too complex to do this in a single iterative pass.

parse_block(block)[source]

Parse a single block on the page.

parse_metadata()[source]

Parse out the metadata from the _meta.xml file

process_pages(elem)[source]

Iteratively process pages from the ABBYY file. We have to process now rather than copying the pages for later processing, because deepcopying an lxml element replicates the entire tree. The ABBYY seems to be sometimes inconsistent about whether these elements have a namespace, so be forgiving.

process_styles(elem)[source]

Iteratively parse styles from the ABBYY file into data structures. The ABBYY seems to be sometimes inconsistent about whether these elements have a namespace, so be forgiving.

version = ''
abbyy_to_epub3.parse_abbyy.add_last_text(blocks, page)[source]

Given a list of blocks and the page number of the last page in the list, mark up the last text block for that page in the list, if it exists.

abbyy_to_epub3.utils module

abbyy_to_epub3.utils.dirtify_xml(text)[source]

Re-adds forbidden entities to any XML string. Could cause problems in the unlikely event the string literally should be ‘&amp’

abbyy_to_epub3.utils.fast_iter(context, func)[source]

Garbage collect as you iterate to save memory Based on StackOverflow modification of Liza Daly’s fast_iter

abbyy_to_epub3.utils.gettext(elem)[source]

Given an element, get all text from within element and its children. Strips out file artifact whitespace (unlike etree.itertext).

abbyy_to_epub3.utils.is_increasing(l)[source]

Given a list, return True if the list elements are monotonically increasing, and False otherwise.

abbyy_to_epub3.utils.sanitize_xml(text)[source]

Removes forbidden entities from any XML string

Module contents