Digital Heritage Seminar: Layout Analysis and OCR with Deep Learning and Heuristics
Clemens Neudecker, Staatsbibliothek zu Berlin
“New Tools for Old Documents – Layout Analysis and OCR with Deep Learning and Heuristics”
This talk will discuss the main achievements and experiences of the QURATOR project at the Berlin State Library (SBB) for document layout analysis. Historical documents that are being digitized in large quantities by libraries and archives frequently exhibit a wide array of features that disturb layout analysis, such as complex layouts with multiple columns, drop capitals and illustrations, skewed or curved text lines, noise, annotations, etc.
In order to deal with these challenges and defects, a robust document layout analysis was developed that is implemented by pixel-wise segmentation using convolutional neural networks. In addition, heuristic methods are applied to detect columns or marginalia, and to determine the reading order of text regions. A key objective lies in feeding the resulting outputs to subsequent processes like a text recognition (OCR) engine or an image similarity search.