35 Thousand recognised books… and counting

Browse the Library

Visit our Home Page

10% more books added to the Ultrapedia Library

As promised in my December posting the first roll-out of V4s are now available for full text search and retrieval, via our Google Mini search interface. In this first batch we’ve added over 7400 V4s to the library.

New to the library also, are foreign language books, there are 1332 French language books and 431 Spanish language books. Bringing the total number of recognised books in our library to 20866, an increase of almost 10% since 1st January 2008 when our site went live.

Here’s a breakdown

7468 – English Language Recognised Books (V4s)

V4s are a derivative of a V3, they are an optimized or slimmed down version. In this transition stage, the V3s are still in the library, so this means that there are two copies of the same book. V4s aren’t included in the total number of books, as only unique titles are included in the numbers.

19103 – English Language Recognised Books (V3s)

431 – Spanish Language Recognised Books (V3s) ***NEW***

1332 – French Language Recognised Books (V3s) ***NEW***



Other foreign language titles we hope to release in February include, Danish, Dutch, German, Italian, Norwegian, and Swedish.

The browsable library has 21298 English Language books for browsing and downloading. The French and Spanish language titles will be added soon.

Creating V4s

V4 is the version number we give to a recognised book to determine its recognition stage. V4 is appended to the filename, it’s a quick and easy method of keeping track.

V4s begin as V3s. V3s are page checked first, for recognition accuracy. We then extract and remove the Table of Contents, Indexes and Advertisements from the books as these represent ‘dead-end’ searches. We create a new file of the same filename, pre-fixed with TOC for Table of Contents, INDEX for Indexes etc… these new files are saved for rebuilding into the workflow later. Bibliographies, Chronologies and any Plates remain in the book, as they are generally unique content; but are also extracted as separate files – the V3 file then evolves into a V4.

Here are some examples of V4s recently released:

Anaesthetics, their uses and administration by Dudley Wilmot Buxton

Ancient Armour and Weapons in Europe by John Hewitt

Annals of Caesar by E. G. [Ernest Gottlieb] Sihler

The other 7465 V4 books can be found by searching for them via our Google Mini Search interface. You can also search for French and Spanish language books, and our entire collection of English language books.

Remember, that to download books you should Login first or Register. Registering is free and only requires your email address.

I think now would be a good time to recap on the V-numbers we have used so far – so here goes.


V0 – The book is not suitable for OCR

V1 – The book is a good candidate for OCR

V2 – Only used in-house

V3 – The book has been OCR’d and published on the website for browse, search and download

V4 – The book has been OCR’d and published on the website for search and download **NEW**


Other files that emerge from the V3s are V35s and V53s.

A V35 has pages from a V3 that have pictures and/or graphics and text on a page. To create a V35 we take a V3 and extract all pages of mixed graphics and text, appending V35 to the filename.

A V53 has pages of Plates only, from a V3. So a typical V53 page will consist of two discrete parts – the plate image itself, and the textual ‘Legend’ or description of the plate– the bit of text under the picture.

V35s and V53s will form part of a slideshow collection on our Image Server which we hope to release soon, so watch this blog for future postings.

Highlighting Footnotes

The majority of books in our library are reference and historical works, many of which have footnotes. Some footnotes are so detailed there isn’t enough room on one page, so they span multiple pages. As our recognition process captures the original format and layout of the book, the context of the footnotes is retained, even when footnotes span multiple pages; this wouldn’t be so for plain text OCR.





%d bloggers like this: