35 Thousand recognised books… and counting

How we do it.

 
Our OCR Farm runs 24 hours a day, and we like to keep it busy. Before we add a book to the OCR Queue we ‘Top and Tail’ the book removing blank and extraneous pages to leave the Title Page of the book as the new first page. We then remove any other blank or unrecognisable pages from the book as well. We use Adobe Acrobat Professional for this stage of the editing process.The OCR Farm can be configured to recognise books in many languages, but for now most of the books in the search engine are English. We therefore have a backlog of about 15 thousand non-English books that we have recognized but not yet indexed. We are experimenting with several other search engines to deliver these non-English books. If you are interested in following our progress please visit the PROJECT TOOLS section in the Ultrapedia forums. 
 
Lately, most of the books we have processed have come via our partner Google Book Search. As we continue to expand the Ultrapedia Library we will also be adding books from two other Secret Studio projects – The Pointmore Library, and The Philatelink Library which are both Scanned and Recognised in-house.
 
Once a book has been ‘topped and tailed’ a ‘V1’ suffix is appended to the original filename. Our OCR Engine is designed to detect these ‘V1’ files, and they are then added to the OCR Queue where they wait until a Recognition Server is available to perform Character Recognition on the book.
 
As the OCR Farm outputs recognised books the suffix on the books filename is changed from ‘V1’ to ‘V2’. When a book reaches the ‘V2’ stage the newly recognised book enters a ‘Workflow’ where several other enhancements are performed.
 
The first stage in the Workflow is a continuous ‘Batch’ Operation that monitors for new ‘V2” stage files and then embeds the page thumbnails and sets the ‘Open View’ options of the PDF to aid verification. The next part of the Workflow adds ‘Headers’ and ‘Footers’ to the file. Each book is then ‘Page Checked’ to ensure there are no unrecognized pages or ‘Gross Errors’. We also ensure the book contains valid ‘Metadata’ for ‘Book Title’ and ‘Author’. OCR can be a tricky process – see my previous article Recognising a Problem.
 
Once a book has been recognised and verified its filename suffix is changed from ‘V2’ to ‘V3’ and the book is then split into individual pages for indexing into the Ultrapedia Library Search.
 
That then is pretty much it for the Search Engine – Raw PDF images of books go in one end… and single recognised pages come out the other end, get indexed and published to the search engine. This is an ongoing project which currently yields about 400 thousand new pages monthly.
 
We haven’t quite finished with Single Page V3s yet however. To keep track of which V3s have been indexed we then store a complete copy of the recognised book in huge database. We use this database to compare newer revisions of the books as we sometimes discover better original copies.
 
In common with all other computer systems Ultrapedia is not immune to GIGO – a very succinct acronym for an OCR Farm – Garbage In = Garbage Out. We keep our eyes open for newer, better, higher resolution scans all the time, and as more and more libraries join the Google Book Search program new and better scanning techniques are often employed which can result in our discovery of what we call a ‘Replacement Candidate’ for a book currently in the live search engine.
 
We keep track of all the books we have recognized in another, smaller database, and when we come across another copy of a book we have already recognised we add the new book to a ‘Reprocess Queue’ for recognition and exhaustive cross-comparison of the older (live) version and the newer ‘Replacement Candidates’. If you are interested in following the ‘life cycle’ of a replacement candidate V3 please visit the PROJECT TOOLS section in the Ultrapedia forums.
 
Various checks are done on the two files:
 
Word Count and Comparison
Spell Check and Comparison
Image Check and Comparison
 
If the ‘Replacement Candidate’ proves to have less errors then the ‘live’ version the new version is indexed into the search engine, and the old version is removed. For an up to date list of ‘Replacement Candidates please visit the PROJECT TOOLS section in the Ultrapedia forums.

Advertisements
%d bloggers like this: