abbyy_to_epub3 package¶
Submodules¶
abbyy_to_epub3.constants module¶
abbyy_to_epub3.create_epub module¶
-
class
abbyy_to_epub3.create_epub.
ArchiveBookItem
(item_dir, item_identifier, item_bookpath)[source]¶ Bases:
object
Archive.org is a website which contains an archive of items composed of archived digital content. Archive.org items are distributed across a cluster of machines called datanodes. In order to access the files of an item, you need to know 4 things:
The Archive.org item_identifier (the unique ID of this item) e.g. https://archive.org/details/{item_identifier}
the datanode server address which hosts this item
the item_dir which is the file path on this datanode where this items files are kept
the name of the files within this item_dir
Certain archive.org items are specifically structured (file organizations, contents, names) to store and play Books. Every Archive Book Item contains the following files: - a jp2.zip containing all the scanned images of the book - an abbyy file containing the OCR’d plaintest of these scans - scandata.xml whose metadata describes the structure of the book
(metadata, pages numbers)
meta.xml which describes the entire archive.org item
A complication is that Archive.org Book Items may contain 1 or more books. In order to accommodate this subtlety and delineate between books, an item_dir and item_identifier are not sufficient to isolate a specific book. To circumvent this limitation, we require another identifier called the item_bookpath which acts as a prefix to the files of a specific book. Given a datanode and an item_dir of an Archive Book Item, all the constituent files for a book can be constructed using item_identifier and item_bookpath in the following ways:
There is a single global metadata manifest file for the entire Archive Item named {item_identifier}_meta.xml.
All of the other book specific files follow the form {item_bookpath}_{file}. e.g. {item_bookpath}_abbyy.gz
-
class
abbyy_to_epub3.create_epub.
Ebook
(item_dir, item_identifier, item_bookpath, debug=False, epubcheck=None, ace=None)[source]¶ Bases:
abbyy_to_epub3.create_epub.ArchiveBookItem
Ebook is a utility for generating epub3 files based on Archive.org items. Holds extracted information about a book & the ebooklib EPUB object.
-
DEFAULT_ACE_LEVEL
= 'minor'¶
-
DEFAULT_EPUBCHECK_LEVEL
= 'warning'¶
-
craft_epub
(epub_outfile='out.epub', tmpdir=None)[source]¶ Assemble the extracted metadata & text into an EPUB
-
craft_html
()[source]¶ Assembles the XHTML content.
Create some minimal navigation: * Break sections at text elements marked role: heading * Break files at any headings with roleLevel: 1 Imperfect, but better than having no navigation or monster files.
Images will get alternative text of “Picture #” followed by an index number for this image. Barring real alternative text for true accessibility, this at least adds some identifying information.
-
extract_images
()[source]¶ Extracts all of the images for the text.
For efficiency’s sake, do these all at once. Memory & CPU will be at a higher premium than disk space, so unzip the entire scan file into temp directory, instead of extracting only the needed images.
-
get_cover_leaf
()[source]¶ Try to find a cover image. If nothing is tagged as ‘Cover’, use the first page tagged ‘Title’. If nothing is tagged as ‘Title’, either, use the first page tagged ‘Normal’. Self.pages is an OrderedDict so break as soon as you find something useful, and don’t search the whole set of pages.
Attempts to identify the presence of headers, footers, or page numbers
Build a dict of first & last lines, indexed by page number.
Try to identify headers and footers.
Headers and footers can appear on every page, or on alternating pages (for example if one page has header of the title, the facing page might have the header of the chapter name).
They may include a page number, or the page number might be a standalone header or footer.
The presence of headers and footers in the document does not mean they appear on every page (for example, chapter openings or illustrated pages sometimes don’t contain the header/footer, or contain a modified version, such as a standalone page number).
Page numbers may be in Arabic or Roman numerals.
This method does not attempt to look for all edge cases. For example, it will not find: - constantly varied headers, as in a dictionary - page numbers that don’t steadily increase - page numbers misidentified in the OCR process, eg. IO2 for 102 - page numbers with characters around them, eg. ‘~ 45 ~’
-
image_dim
(block)[source]¶ Given a dict object containing the block info for an image, generate a tuple of its dimensions: (left, top, right, bottom)
Given a block and our identified text structure, return True if this block’s text is a header, footer, or page number to be ignored, False otherwise.
-
load_scandata_pages
()[source]¶ Parse the page-by-page scandata file. This stores page size, right or left leaf, and page type (eg copyright, color card, etc).
-
make_image
(block)[source]¶ Given a dict object containing the block info for an image, generate the image HTML
-
validate_a11y
(epub_file, level=None)[source]¶ Individual test failures are logged in EARL syntax https://daisy.github.io/ace/docs/report-json/ Structurally: “assertions”: [
- {
“@type”: “earl:assertion”, “earl:result”: {
“earl:outcome”: “fail”
}, “assertions”: [
- {
“@type”: “earl:assertion”, “earl:result”: {
“earl:outcome”: “fail”, “html”: “[The invalid HTML]”
}, “earl:test”: {
“earl:impact”: “serious”, “help”: {
“dct:description”: “[Plain language error]”
},
}
}
], “earl:testSubject”: {
“url”: “cover.xhtml”,
},
},
]
-
abbyy_to_epub3.parse_abbyy module¶
-
class
abbyy_to_epub3.parse_abbyy.
AbbyyParser
(document, metadata_file, metadata, paragraphs, blocks, debug=False)[source]¶ Bases:
object
The ABBYY parser object. Parses ABBYY metadata in preparation for import into an EPUB 3 document.
And ABBYY document begins with a font and style information:
<documentData> <paragraphStyles> <paragraphStyle id="{idnum}" name="stylename" mainFontStyleId="{idnum}" [style info]> <fontStyle id="{idnum}" [style info]> </paragraphStyle> [more styles] </documentData>
This is followed by the data for the pages.
<page> <block></block> [more blocks] </page>
Blocks have types. We process types Text, Picture, and Table.
Text:
<page> <region> <text> contains a '\n' as a text element <par> The paragraph, repeatable <line> The line, repeatable <formatting> <charParams>: The individual character
Picture: we know the corresponding scan (page) number, & coordinates.
Table:
<row> <cell> <text> <par>
Each <par> has an identifier, which has a unique style, including the paragraph’s role, eg:
<par align="Right" lineSpacing="1790" style="{000000DD-016F-0A36-032F-EEBBD9B8571E}">
This corresponds to a paragraphStyle from the <documentData> element:
<paragraphStyle id="{000000DD-016F-0A36-032F-EEBBD9B8571E}" name="Heading #1|1" mainFontStyleId="{000000DE-016F-0A37-032F-176E5F6405F5}" role="heading" roleLevel="1" [style information]>
The roles map as follows:
Role name
role
Body text
text
Footnote
footnote
Header or footer
rt
Heading
heading
Other
other
Table caption
tableCaption
Table of contents
contents
-
etree
= ''¶
-
find_namespace
()[source]¶ find the namespace of an XML document. Assumes that the namespace of the first element in the context is the namespace we need. This is more memory-efficient then parsing the entire tree to get the root node.
-
ns
= ''¶
-
nsm
= ''¶
-
parse_abbyy
()[source]¶ Parse the ABBYY into a format useful for create_epub. Process the the elements we will need to construct the EPUB: paragraphStyle, fontStyle, and page. We traverse the entire tree twice with iterparse, because lxml builds the whole node tree in memory for even tag-selective iterparse, & if we don’t traverse the whole tree, we can’t delete the unowned nodes. fast_iter makes the process speedy, and the dual processing saves on memory. Because of the layout of the elements in the ABBYY file, it’s too complex to do this in a single iterative pass.
-
process_pages
(elem)[source]¶ Iteratively process pages from the ABBYY file. We have to process now rather than copying the pages for later processing, because deepcopying an lxml element replicates the entire tree. The ABBYY seems to be sometimes inconsistent about whether these elements have a namespace, so be forgiving.
-
process_styles
(elem)[source]¶ Iteratively parse styles from the ABBYY file into data structures. The ABBYY seems to be sometimes inconsistent about whether these elements have a namespace, so be forgiving.
-
version
= ''¶
-
abbyy_to_epub3.utils module¶
-
abbyy_to_epub3.utils.
dirtify_xml
(text)[source]¶ Re-adds forbidden entities to any XML string. Could cause problems in the unlikely event the string literally should be ‘&’
-
abbyy_to_epub3.utils.
fast_iter
(context, func)[source]¶ Garbage collect as you iterate to save memory Based on StackOverflow modification of Liza Daly’s fast_iter
-
abbyy_to_epub3.utils.
gettext
(elem)[source]¶ Given an element, get all text from within element and its children. Strips out file artifact whitespace (unlike etree.itertext).