OCR Reads Old Newspapers So We Don’t Have To

August 30, 2020

Plenty of people don’t bother to read the current newspaper, let alone editions that were published over 100 years ago. But there’s a wealth of important historical information buried in these dusty old publications, assuming you can find a way to reliably digitize and index it all. You might think the solution is as simple as running images of the paper through optical character recognition (OCR) software, but as [John Scancella] explains, the problem is a bit more complicated than that.

Stretching the text vertically highlights the columns.

Ultimately, the issue largely comes down to formatting. The OCR software reasonably assumes all the text is in orderly horizontal lines, because in the vast majority of cases, it would be. That’s how you’re reading these words now. But as anyone who’s seen an old time newspaper knows, that’s not how things were necessarily written back then. Pages consisted of multiple narrow columns of stories separated by vertical lines; if the OCR tries to read the page from left to right, the resulting text is a mishmash of several unrelated topics.

The answer is to break all those articles into their own images, but doing that manually at any sort of scale simply isn’t an option. So [John] has been working on a system that uses OpenCV to identify the columns of text and isolate them. He details the multi-step process down in his write-up, and even provides the Python code should you want to give it a spin. But the short version is that the image is converted to grayscale and the OpenCV dilate function is used to stretch the text in the Y dimension. This produces big blobs of white that can easily be picked out with findContours() and snipped into individual images.

It’s not a perfect solution, and there are still a few pitfalls. For one, the name of the paper needs to be removed from the front page before the stretching operation happens. But it’s clearly a step in the right direction, and the results certainly look very promising. Anything that makes OCR more accurate or easier to implement is a win in our book, so we’re excited to see where [John] takes this concept.

18 thoughts on “OCR Reads Old Newspapers So We Don’t Have To”

e says:

August 30, 2020 at 7:36 am

there may be enough frequency information in the picture to do it without the scaling, i.e.

for (x = 0; x < width; x++) {
column_total[x] = 0;
for (y = 0; y < height; y++) {
if (pixel(x,y) == black) {
column_total[x]++;
} else {
column_total[x]–;
}
}
}

then run a DFT on column_total[], then pick largest coefficient with suitable sanity checking to get the column spacing?

Report comment

Reply
1. john says:
  
  August 30, 2020 at 2:11 pm
  
  If you look at version two (https://github.com/jscancella/NYTribuneOCRExperiments/blob/master/findText_usingSums.py) you will find I did almost exactly that and it indeed works better.
  
  Report comment
  
  Reply
  1. e says:
    
    August 30, 2020 at 3:20 pm
    
    for (x = 0; x < width; x++) {
    column_total[x] = 0;
    for (y = 0; y < height; y++) {
    column_total[x] += (127 – greyscale(x,y);
    }
    column_total[x] *= column_total[x];
    }
    
    Then a DFT will capture the white adjacent to the black lines too. Assumes 256 shades of grey. Should give a stronger signal above the noise.
    
    Report comment
    
    Reply
    1. John says:
      
      August 30, 2020 at 5:36 pm
      
      Pull requests are welcome!
      
      Report comment
      
      Reply
    2. Luke Brown says:
      
      August 31, 2020 at 6:05 am
      
      No experience with CV, but a lot with photoshop. Would something like a Gaussian blur followed by bumping up the highlight end (darkening) work?
      
      Report comment
      
      Reply
Rex says:

August 30, 2020 at 7:36 am

Where’s AI when you need it?

Report comment

Reply
1. limroh says:
  
  August 30, 2020 at 8:17 am
  
  Could be turned into a new e.g. Google chapta to train it with human input.
  
  Report comment
  
  Reply
2. BotherSaidPooh says:
  
  August 31, 2020 at 12:02 am
  
  Actually you can get real time ray tracing support in some of Nvidia’s latest offerings.
  
  Report comment
  
  Reply
Bob says:

August 30, 2020 at 9:37 am

I have an OCR reading machine for the visually impaired. Self-contained scanner and TTS appliance. Had several versions and brands going back at least 20 years. They all scanned and read books, newspapers, multi-column text, headings, bills, etc and spoke the text with minimal confusion. They had no problem parsing formats or fonts. This is nothing new at all.

Report comment

Reply
1. pelrun says:
  
  August 30, 2020 at 10:49 pm
  
  So what?
  
  Report comment
  
  Reply
robert says:

August 30, 2020 at 2:14 pm

Layout analysis is a well studied topic and a feature of most good OCR software.

Report comment

Reply
1. John says:
  
  August 30, 2020 at 5:42 pm
  
  To which OCR programs are you referring? Have you tried running them on these old newspaper pages?
  
  Report comment
  
  Reply
William Gallant says:

August 30, 2020 at 2:38 pm

I’ve no experience of OCR but have studied Python so some things were easy to figure out. Embedded, sorry I’ve got to learn more about as to the amount of used resources. Can OCR be taught to learn ancient languages?

Report comment

Reply
Kevin Lewis says:

August 30, 2020 at 6:43 pm

John, when contour gives you a trace in OpenCV, get the bounding rect of each glyph and then “bubble” it out by a certain percentage that works for the publication. For a lot of 8½” x 11″ pages I’ve used is around 2% to 5% more of the original rect. Then any rects that intersect are combined. This can give you words, lines and blocks depending on your percentage. This percentage could be defined per publication format. This could help you identify those pesky header rows if they are within a certain aspect ratio or % of overall page size maybe? Enjoy!

Report comment

Reply
Saabman says:

August 30, 2020 at 9:32 pm

Reading a 100 +year old news paper is hard enough, ( Ive got a few lying around) the text is fuzzy the paper has colored – getting an OCR to read it all with any success is a great achievement

Report comment

Reply
Doug says:

August 30, 2020 at 11:51 pm

I used to work part time at our rural county museum. Those old newspaper are difficult for a human to follow. Iterating to see how OCR pulls it off.

Report comment

Reply
Col. Panek says:

September 6, 2020 at 1:50 pm

That’s great.

I have a bunch of old birth records handwritten in old Flemish and Latin, in case you want something a little more challenging.

Report comment

Reply
1. john says:
  
  September 6, 2020 at 2:05 pm
  
  Give it a try with tesseract. It has really improved over the years, and you can manually select what the language is when doing ocr
  
  Report comment
  
  Reply