35 Thousand recognised books… and counting

Turning the Tables

The single worst class of books for recognition errors are those that contain a lot of tables. Although some smaller tables are recognised correctly, most tables in the Ultrapedia Library are not. Unless the original scanned document is in pristine condition and printed clearly, with plenty of space between the table elements the accuracy of the recognised table will be low. Almanacs, books on statistics, and anything with the word ‘table’ in the title will consequently be poorly recognised.

Other classes of books that recognise poorly are those with lots of mathematical formulae, algebra books being the worst. Books with tables of logarithms, or tangents etc. are likewise not to be trusted, as well as being entirely superfluous. Maps are another example of poor recognition candidates, likewise with books that contain musical scores.

The OCR engine can get so confused that it sometimes reorients the page from portrait to landscape before doing the recognition with disastrous results. Fortunately, you will normally never see examples of these bad pages unless you deliberately search for them.

Comments on: "Turning the Tables" (1)

  1. […] creeping into the V3 collections.  I made a quick reference to this problem in my blog entry Turning the Tables. We actually create the ‘V5s’ from ‘V1s’.  To create a V5 we delete everything except […]

Comments are closed.

%d bloggers like this: