OCR Reads Old Newspapers So We Don’t Have To

Plenty of people don’t bother to read the current newspaper, let alone editions that were published over 100 years ago. But there’s a wealth of important historical information buried in these dusty old publications, assuming you can find a way to reliably digitize and index it all. You might think the solution is as simple as running images of the paper through optical character recognition (OCR) software, but as [John Scancella] explains, the problem is a bit more complicated than that.

Stretching the text vertically highlights the columns.

Ultimately, the issue largely comes down to formatting. The OCR software reasonably assumes all the text is in orderly horizontal lines, because in the vast majority of cases, it would be. That’s how you’re reading these words now. But as anyone who’s seen an old time newspaper knows, that’s not how things were necessarily written back then. Pages consisted of multiple narrow columns of stories separated by vertical lines; if the OCR tries to read the page from left to right, the resulting text is a mishmash of several unrelated topics.

The answer is to break all those articles into their own images, but doing that manually at any sort of scale simply isn’t an option. So [John] has been working on a system that uses OpenCV to identify the columns of text and isolate them. He details the multi-step process down in his write-up, and even provides the Python code should you want to give it a spin. But the short version is that the image is converted to grayscale and the OpenCV dilate function is used to stretch the text in the Y dimension. This produces big blobs of white that can easily be picked out with findContours() and snipped into individual images.

It’s not a perfect solution, and there are still a few pitfalls. For one, the name of the paper needs to be removed from the front page before the stretching operation happens. But it’s clearly a step in the right direction, and the results certainly look very promising. Anything that makes OCR more accurate or easier to implement is a win in our book, so we’re excited to see where [John] takes this concept.

18 thoughts on “OCR Reads Old Newspapers So We Don’t Have To

  1. there may be enough frequency information in the picture to do it without the scaling, i.e.

    for (x = 0; x < width; x++) {
    column_total[x] = 0;
    for (y = 0; y < height; y++) {
    if (pixel(x,y) == black) {
    column_total[x]++;
    } else {
    column_total[x]–;
    }
    }
    }

    then run a DFT on column_total[], then pick largest coefficient with suitable sanity checking to get the column spacing?

      1. for (x = 0; x < width; x++) {
        column_total[x] = 0;
        for (y = 0; y < height; y++) {
        column_total[x] += (127 – greyscale(x,y);
        }
        column_total[x] *= column_total[x];
        }

        Then a DFT will capture the white adjacent to the black lines too. Assumes 256 shades of grey. Should give a stronger signal above the noise.

  2. I have an OCR reading machine for the visually impaired. Self-contained scanner and TTS appliance. Had several versions and brands going back at least 20 years. They all scanned and read books, newspapers, multi-column text, headings, bills, etc and spoke the text with minimal confusion. They had no problem parsing formats or fonts. This is nothing new at all.

  3. I’ve no experience of OCR but have studied Python so some things were easy to figure out. Embedded, sorry I’ve got to learn more about as to the amount of used resources. Can OCR be taught to learn ancient languages?

  4. John, when contour gives you a trace in OpenCV, get the bounding rect of each glyph and then “bubble” it out by a certain percentage that works for the publication. For a lot of 8½” x 11″ pages I’ve used is around 2% to 5% more of the original rect. Then any rects that intersect are combined. This can give you words, lines and blocks depending on your percentage. This percentage could be defined per publication format. This could help you identify those pesky header rows if they are within a certain aspect ratio or % of overall page size maybe? Enjoy!

  5. Reading a 100 +year old news paper is hard enough, ( Ive got a few lying around) the text is fuzzy the paper has colored – getting an OCR to read it all with any success is a great achievement

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.