Welcome to abbyy_to_epub3’s documentation!

ABBYY XML to EPUB3

Introduction

This module transforms ABBYY XML documents, generated by ABBYY FineReader 10, into primitively accessible ePub 3. The code is optimized for ABBYY XML documents created by the Internet Archive, though it may work for other ABBYY XML as well.

Features

  1. Unicode-compliant
  2. Can handle left-to-right and right-to-left text.
  3. Attempts to recognize running headers, footers, and decimal or page numbers. Level of confidence in fuzzy matching can be fine tuned in config.ini. Errs on the side of minimizing false positives.

Limitations

  1. Accessibility is inherently limited by the input ABBYY FineReader documents. If they are marked up with headings and other semantic markup, that structure will be incorporated into the ePub.
  2. There is currently no functionality for image description.
  3. The module can also transform ABBYY XML documents generated by ABBYY FineReader 6. However, those documents are not marked up with headings, so there is no structural navigation for accessibility.

Requirements

  • Python 3
  • If running epubcheck, a Java Runtime environment
  • If running DAISY Ace, Node.js

Usage

From within a Python program:

from abbyy_to_epub3 import create_epub
book = create_epub.Ebook('docname')  # See *Assumptions* below.
book.craft_epub()

From the shell:

abbyy2epub docname     # See *Assumptions* below.

The available command line arguments are:

..code:: bash

usage: abbyy2epub [-h] [-d] [–epubcheck] [–ace] docname

Process an ABBYY file into an EPUB

positional arguments:
docname A directory containing all the necessary files. See the README
for details.
optional arguments:
-h, --help show this help message and exit
-d, --debug Show debugging information
--epubcheck Run EpubCheck on the newly created EPUB
--ace Run DAISY Ace on the newly created EPUB

System dependencies

If you’d like to run epubcheck, there are certain system dependencies. Depending on running environment, these may need to be manually installed. On Ubuntu, I installed these with:

sudo apt-get install default-jre libpython3-dev

If you’d like to run the DAISY Ace accessibility checker, you’ll also need Node.js and Ace. On Ubuntu, I installed these with:

sudo apt-get install nodejs
sudo npm install ace-core -g

If Ace successfully installed, you should be able to run:

ace --help

at the command line. This should display usage information. For more information see the Ace Getting Started Guide <http://inclusivepublishing.org/toolbox/accessibility-checker/getting-started/>.

Installation

This package can be installed on your local system. From the directory containing setup.py:

pip install -r requirements.txt
python setup.py develop
pip install .

You can rebuild the documentation, which is generated with Sphinx.

cd docs
make html

Testing

Run py.test from the top-level app directory. Create new tests in the tests subdirectory.

Assumptions

This application assumes you are working in a directory which contains a subdirectory for the document and a specific set of files. If the document is named docname, the directory structure assumed is:

docname/
    docname_abbyy.gz
    docname_meta.xml
    docname_jp2.zip
  • docname_abbyy.gz unzips to docname_abbyy, an XML file generated by ABBYY.
  • docname_jp2.zip unzips to a directory called docname_jp2, which includes a number of documents in the format docname_####.jp2.
    • docname_0000.jp2 is scanner calibration.
    • docname_0001.jp2 is the cover image and the first image reference in the ABBYY.

Indices and tables