Wednesday, November 7, 2012

Indexing epubs

I have undertaken the task of converting my libary of printed books by my favorite philosopher into epub and mobi formats so I can read them on my various e-reader devices and make it easier to search them. Also some of the printed editions are nearly 80 years old and I worry how much longer they will be readable.

I've got 21 of his books which is most of them. The task has been to scan the pages of each book, OCR it into HTML and text versions, convert the text version into Pandoc's modified Markdown format, proofread and correct typos and create links for the index. Then I use Pandoc to create an epub version and Calibre's ebook-convert to convert the epub to mobi for my Kindle.

The scanning was tedious but thankfully a one-off. All of the books are small enough to allow two pages per scan. I wrote a small Perl script which uses the ImageMagick module to rotate and split the image in two and clean off the black borders. Then I tried various OCR programs including ReadIris, Abbyy Reader and Tesseract (free). All of them worked well.

Then life intervened and I've only just picked up the project in the last week or so. I got as far as converting a pamphlet and one book to epub and mobi. Big learning experience using Pandoc. The main problem is indexing. Most of the books have a hand-built index which references the printed pages by number. E-books of course don't have fixed page numbers, so I needed to add HTML links around the parts of the text corresponding to the index entry.

The first book I worked on took a couple of days to proofread then about eight days to index. It has 400 index entries. The next book I selected has nearly 800 index entries. And the task of finding the word(s) which correspond to the index entry is both repetitive and boring and thus very prone to errors from lack of attention on my part. Time to get some automated help.

I used Vim to wrap every index entry with an href tag and a unique ID. (A simple macro which incremented a register, inserted it into the href and went looking for the next entry.) Then I wrote a macro which loads the ID of the index entry into a register. Then I wrote a keymap macro which allows me to visually select the text I want to jump to for the link then wraps it in a <scan> tag and inserts the ID from the register. This took a lot of the repetition out of the task.

But still there was a lot of boring repetition finding the indexed words/phrases. And, foolishly perhaps, I had eliminated the page numbers from the Markdown text, thinking they were no longer of relevance. I had only finished 100 of the 800 entries after a week. And I hate repetition. That's what computers are for. And of course there's still the other books (and this current one isn't the longest). So I wrote another Perl script which uses the HTML file from the OCRing to extract the text of each page into a lookup table indexed by page number. The HTML conversion wasn't as accurate as the text conversion and so I hadn't attempted to correct all its typos and formatting errors. Then the script creates a table of the start and end line numbers in the Markdown text which correspond to the page extracted from the HTML. I had to manually assist a couple of entries but mostly it worked. Finally, the script reads each entry in the index and searches for the keyword in the Markdown between the start and end line numbers given for the page of the index entry and prints a line showing the line of text containing the keyword and it's line number in the Markdown file.

This final step has taken most of the repetition out of the indexing task. So with the indexing printout on one side of my display and Markdown file open in Vim on the other side I can zoom through the entries. Now I jump to the line of text which looks most likely to contain the keyword I'm trying to index, visually highlight the word(s), press F6 and F7 and the word is linked and so on. I was able to input the second hundred entries in about two hours. And the best part is that I can replicate this process for the rest of the books.

Update 1: Vim key mappings for F6 and F7:

" Tag the visual selection as a destination
vnoremap <F6> :s/\(\%V.*\%V.\)/ \
<span id="ix1">\1<\/span>/<CR>

" Increment the tag id (uses register y)
noremap <F7> :s/ix\d\+/ \

Update 2: A (meaningless to anyone but me) workflow.
cd NextBook
cp ../PreviousBook/metadata.xml .
cp ../PreviousBook/title.txt .
Edit these files to reflect new book title and date.

Make a Markdown file from the OCR HTML file.
pandoc -S -s -f html -t markdown NextBook.html
Stack space overflow: current size 8388608 bytes.
Use '+RTS -Ksize -RTS' to increase it.

Not sure why overflow but I googled and found this works:
pandoc +RTS -K10000000 -RTS \
-S -s -f html -t markdown NextBook.html

Initialise dir as a git repo
git init
ga .
gc "Initial commit"

And copy an old .gitignore and edit it to suit.
cp ../PreviousBook/.gitignore .

Can now edit/proofread MD file. Remove UTF chars. They usually upset Pandoc. Wrap page numbers in HTML comment tags. Use Vim spellchecker to find obvious errors. Clean up Index entries.

In Vim wrap all index page numbers in href tag:
:s/\(\d\+[fn]*\.*\)/<a href="#ix1">\1<\/a>/

In Vim, set register y to 2 (because we want to keep ix1 as is) e.g.
:let @y=2

then replace every other #ix1 with incrementing register value by running the macro behind <f7> once and then:
to run substitute over the rest.

Can now build kwiclist using and use it to link index entries to where they occur in the text. Process consists of finding from kwiclist the line number of the next link word(s), going to that line with Vim, visually highlighting the word(s) and pressing F6 to wrap the word(s) in a span tag and F7 to change the id no. to that corresponding to the next link. F7 also auto-increments the id no.

Use pandoc to create an HTML version to check links:
pandoc -S -s --epub-metadata=metadata.xml -f markdown -t html \
--toc -o NB.html title.txt

When HTML version looks right, use to add chapters to index links. (This is a one-way process, cannot use MD file to create HTML version after this because pandoc splits epubs into chapters based on H1 headings which are no longer usable as a local HTML file):
../bin/ >
Check index entries look OK. Might need to make chapters into 3-digit, leading zeroes entries. When tmp1 looks OK

Convert MD file to epub (Note: option --toc is not needed for epub):

pandoc -S -s --epub-metadata=metadata.xml -f markdown \
-t epub -o NB.epub title.txt

Check epub with Calibre reader, confirm format, TOC and Index then convert to mobi:
ebook-convert NB.epub

Copy .mobi to Kindle and confirm format and links. Optionally use pandoc to create a PDF version.

No comments: