abbyy_to_epub3 package

Submodules

abbyy_to_epub3.constants module

abbyy_to_epub3.create_epub module

abbyy_to_epub3.parse_abbyy module

class abbyy_to_epub3.parse_abbyy.AbbyyParser(document, metadata_file, metadata, paragraphs, blocks, debug=False)[source]

Bases: object

The ABBYY parser object. Parses ABBYY metadata in preparation for import into an EPUB 3 document.

Here are the components of the ABBYY schema we use:

<page>
    <block>types Picture, Separator, Table, or Text</block>

Text:

<page>
        <region>
        <text> contains a '\n' as a text element
        <par> The paragraph, repeatable
            <line> The line, repeatable
                <formatting>
                <charParams>: The individual character

Image: Separator: Table:

<row>
  <cell>
    <text>
      <par>

Each paragraph has an identifier, which has a unique style, including the paragraph’s role, eg:

 <paragraphStyle
     id="{000000DD-016F-0A36-032F-EEBBD9B8571E}"
     name="Heading #1|1"
     mainFontStyleId="{000000DE-016F-0A37-032F-176E5F6405F5}"
     role="heading"
     roleLevel="1"
     align="Right"
     startIndent="0" leftIndent="0"
     rightIndent="0" lineSpacing="1790" fixedLineSpacing="1">
<par align="Right" lineSpacing="1790"
     style="{000000DD-016F-0A36-032F-EEBBD9B8571E}">

The roles map as follows:

Role name role
Body text text
Footnote footnote
Header or footer rt
Heading heading
Other other
Table caption tableCaption
Table of contents contents
etree = ''
is_block_type(elem, blocktype)[source]

Identifies if an XML element is a textblock.

ns = ''
nsm = ''
parse_abbyy()[source]

read the ABBYY file into an lxml etree

parse_content()[source]

Parse each page of the book.

parse_metadata()[source]

Parse out the metadata from the _meta.xml file

parse_paragraph_styles()[source]

Paragraph styles are on their own at the start of the ABBYY

version = ''
abbyy_to_epub3.parse_abbyy.add_last_text(blocks, page)[source]

Given a list of blocks and the page number of the last page in the list, mark up the last text block for that page in the list, if it exists.

abbyy_to_epub3.parse_abbyy.gettext(elem)[source]

abbyy_to_epub3.utils module

abbyy_to_epub3.utils.dirtify_xml(text)[source]

Re-adds forbidden entities to any XML string. Could cause problems in the unlikely event the string literally should be ‘&amp’

abbyy_to_epub3.utils.is_increasing(l)[source]

Given a list, return True if the list elements are monotonically increasing, and False otherwise.

abbyy_to_epub3.utils.sanitize_xml(text)[source]

Removes forbidden entities from any XML string

Module contents