10% more books added to the Ultrapedia Library
As promised in my December posting the first roll-out of V4s are now available for full text search and retrieval, via our Google Mini search interface. In this first batch we’ve added over 7400 V4s to the library.
New to the library also, are foreign language books, there are 1332 French language books and 431 Spanish language books. Bringing the total number of recognised books in our library to 20866, an increase of almost 10% since 1st January 2008 when our site went live.
Here’s a breakdown
7468 – English Language Recognised Books (V4s)
V4s are a derivative of a V3, they are an optimized or slimmed down version. In this transition stage, the V3s are still in the library, so this means that there are two copies of the same book. V4s aren’t included in the total number of books, as only unique titles are included in the numbers.
19103 – English Language Recognised Books (V3s)
431 – Spanish Language Recognised Books (V3s) ***NEW***
1332 – French Language Recognised Books (V3s) ***NEW***
TOTAL UNIQUE BOOKS: 20, 866
TOTAL PAGES: APPROX 6 MILLION
Other foreign language titles we hope to release in February include, Danish, Dutch, German, Italian, Norwegian, and Swedish.
The browsable library has 21298 English Language books for browsing and downloading. The French and Spanish language titles will be added soon.
V4 is the version number we give to a recognised book to determine its recognition stage. V4 is appended to the filename, it’s a quick and easy method of keeping track.
V4s begin as V3s. V3s are page checked first, for recognition accuracy. We then extract and remove the Table of Contents, Indexes and Advertisements from the books as these represent ‘dead-end’ searches. We create a new file of the same filename, pre-fixed with TOC for Table of Contents, INDEX for Indexes etc… these new files are saved for rebuilding into the workflow later. Bibliographies, Chronologies and any Plates remain in the book, as they are generally unique content; but are also extracted as separate files – the V3 file then evolves into a V4.
Here are some examples of V4s recently released:
The other 7465 V4 books can be found by searching for them via our Google Mini Search interface. You can also search for French and Spanish language books, and our entire collection of English language books.
I think now would be a good time to recap on the V-numbers we have used so far – so here goes.
V0 – The book is not suitable for OCR
V1 – The book is a good candidate for OCR
V2 – Only used in-house
V4 – The book has been OCR’d and published on the website for search and download **NEW**
Other files that emerge from the V3s are V35s and V53s.
A V35 has pages from a V3 that have pictures and/or graphics and text on a page. To create a V35 we take a V3 and extract all pages of mixed graphics and text, appending V35 to the filename.
A V53 has pages of Plates only, from a V3. So a typical V53 page will consist of two discrete parts – the plate image itself, and the textual ‘Legend’ or description of the plate– the bit of text under the picture.
V35s and V53s will form part of a slideshow collection on our Image Server which we hope to release soon, so watch this blog for future postings.
The majority of books in our library are reference and historical works, many of which have footnotes. Some footnotes are so detailed there isn’t enough room on one page, so they span multiple pages. As our recognition process captures the original format and layout of the book, the context of the footnotes is retained, even when footnotes span multiple pages; this wouldn’t be so for plain text OCR.