Automated Scanning For A Pile Of Documents

June 22, 2011

The Gado project is part of the Johns Hopkins University Center for African Studies. It has been tasked with archiving documents having to do with the East Baltimore Oral Histories Project. In short, they’ve got a pile of old pictures and documents that they want digitized but are not easily run through a page-fed scanner because they are fragile and not standard sizes. The rig seen above is an automated scanner which picks up a document from the black bin on the left, places it on the flat-bed scanner seen in the middle, and moves it to the black bin on the right once it has been scanned. It’s not fast, but it’s a cheap build (great if you’ve got a tight budgt) and it seems to work.

The machine is basically a three-axis CNC assembly. Above you can see one motor which lifts the lid of the scanner. You can’t see the document gripper in this image, but check the video after the break which shows the machine in action. A vacuum powered suction cup moves on a gantry (y-axis) but is also able to adjust its height (z-axis) and distance perpendicular to the gantry (x-axis) in order to grab one page at a time.

The pictures on the build log have captions to give you an idea of how this was built. We didn’t see any info about post-processing but let’s hope they have an auto-crop and auto-deskew filter in place to really make this automatic.

[youtube=http://www.youtube.com/watch?v=-QbE3UPDm-w&w=470]

18 thoughts on “Automated Scanning For A Pile Of Documents”

deltron says:

June 22, 2011 at 11:15 am

you could hire an illegal immigrant to do it twice as fast for half the cost.

Report comment

Reply
Tom says:

June 22, 2011 at 11:26 am

Thanks for featuring our project!

Yes, speed is definitely an issue; the threaded rod I chose for the x axis is built more for precision than quickness, so even when it’s being turned by a pretty hefty stepper, it takes a while to cover the full distance. When you’re dealing with seventy year old photos, though, speed can sometimes be a bad thing; problems happen faster :) So far the machine has not damaged a single photo, which is worth a lot when working with these kinds of materials.

That said, faster speeds = more photos digitized, so for version 2, we’re switching to 100% rotational motion, and using servos to cut down on cost. We’ll lose precision, but especially moving from side to side, it’s not too vital, and would be way faster.

RE post processing, we have some neat Python stuff, but it’s still a work in progress. I end up having to run a lot of images through Photoshop, because their crop/deskew functions are pretty decent. If anyone knows of a good open source alternative for doing auto crop/deskew (or wants to help write one), I would love to hear about it.

One neat thing post-processing wise that’s not mentioned here is some OCR stuff we’re working on. The backs of most images at our test site, the Afro American Newspaper, have the newsprint article associated with each image taped or stapled to the back (one of the reasons an ADF is no good).

Part of the rig you see above is a 720P webcam that snaps a picture of these associated pieces of text. We have a nice custom piece of C code, courtesy of a computer engineer friend, which finds the text, separates it out, and runs it through Tesseract OCR to get instant metadata, which gets stored along with the image itself.

Report comment

Reply
Tom says:

June 22, 2011 at 11:28 am

@ deltron

The initial method was interns :) But free labor is intermittent, and keeping things like naming schemes, locations in the archive, etc. straight demands automation anyway.

Report comment

Reply
Helder says:

June 22, 2011 at 11:43 am

You could mount 2 suction cups at a certain distance that would pick up a new document at the same time as the one on the scanner then move to the side to release release both one on the scanner and one on the done bin. If they’re at the same distance only one axis is needed.

Report comment

Reply
Aaron says:

June 22, 2011 at 11:55 am

Awesome hack, Tom!

— Aaron in Baltimore

Report comment

Reply
daniel reetz says:

June 22, 2011 at 12:00 pm

Tom, you should definitely check out Book Scan Wizard or Scan Tailor (there’s a command line version) from the DIY Book Scanner forums… also the Internet Archive has an “autocrop” utility out there.

Also, if you have time, I’d love to see your build posted to the forums…

Report comment

Reply
Karl says:

June 22, 2011 at 12:20 pm

If you tilt the scanner [so it slopes towards the hinge side], the photos might better self-align when the are dropped on the scanner, and reduce the need for deskewing.

Report comment

Reply
MatthewB says:

June 22, 2011 at 12:26 pm

I’m concerned that the clear tape near the scanner lid hinge will eventually shift or even let go.
Could you tell us more about that?

Report comment

Reply
abobymouse says:

June 22, 2011 at 12:58 pm

A very cool project.

I’m curious about whether you tried to take photos of the documents?

Get a big flat table and some kind of ‘tripod’/’monopod’ for a camera, and a huge sheet of glass.

Lay a bunch of documents out, put the glass on top, then take photos of them.

Report comment

Reply
Cricri says:

June 22, 2011 at 2:44 pm

Nice! I don’t know about speed and whatnot, but after seeing a helluva lot pointless projects featured here, it’s good to see a project that is actually useful!

Report comment

Reply
Tom says:

June 22, 2011 at 6:25 pm

Meh… Cool hack, but there are kodak i1210 document scanners going for $160 on ebay right now. They will do 30-40 pages per minute over a USB twain or ISIS interface. At that price/performance it’s not even worth thinking about a home-built system.

Report comment

Reply
BobidiBOB says:

June 22, 2011 at 11:20 pm

Nice idea, but way too slow when it does 100 pages a minute (better than the 160$ kodak) and cheaper it’s a great machine.

Report comment

Reply
svofski says:

June 23, 2011 at 3:13 am

kodak schmodak. This project is awesome. I’m especially impressed by the vacuum tningy.

Report comment

Reply
henry says:

June 23, 2011 at 4:57 am

The system seems to work fine, it can’t be fast because of risk of damaging the files.

I’d be inclined to use two scanners (with one “picker”) and double the speed.

Report comment

Reply
Dex says:

June 23, 2011 at 8:18 am

A couple years back, I was a cheap student not being able to afford certain textbooks. Anyways, I found that being able to borrow a DSLR, tripod, some sort of stand for me to place the material to be photographed, and a computer with software that allowed me to control the camera (like taking pictures) worked quite well. Initial setup usually took almost as long as it took me to go through a book. To flatten things out a bit you could use a piece of plexiglass (setup time includes getting rid of reflections) and I’d do 2 pages at a time. It would also include a high constrast non reflective background behind the material (I used black construction paper). For taking pictures, I’d align my mouse to the snapshot button, tape it to the floor, and use the mouse as a footpedal. The computer acted like a much much larger viewfinder which prevented me from having blurry images. After doing a quick scan of page numbers to make sure I didn’t miss anything, I would put it through imagemagick which is an open source and very powerful command based image editor, which allowed me to do things like convert to B/W, autocrop out the background image + a bit, adjust brightness, contrast, and split pages. If I was feeling like I wanted perfection, I could have gone even further by trimming to the edge of text allowing me to add perfect borders to each page later. After I was done that I would put it through Omnipage and I’d have a fully OCR’d (searchable!) and printable pdf file. For shrinking down the file to a reasonable size you’d need Adobe Acrobat.

Anyways, taking photographs is much faster than traditional scans, you just need the inital setup and a nice camera to borrow. Add a bit of automated post processing (using a batch script with imagemagick) and you should be able to get pretty good results.

However for the material that’s being scanned, this just might be the best way. It’s hard to beat the color reproduction from a scanner. If you don’t need speed but you need high quality without much post-processing, traditional scanning takes care of all of the defects that taking pictures adds in. You wouldn’t have to control lighting (which is a major pain with reflections). I’d just worry about reliability and when that’s taken care of you should be able to run it overnight.

Report comment

Reply
Tom (project creator) says:

June 23, 2011 at 8:41 am

@daniel reetz

Thanks for the tips on deskewing, etc! Hadn’t seen those before. Scan tailor looks especially promising; would be pretty easy to call a command line program from within Python, or do direct bindings if I didn’t want to be lazy :)

I would be happy to do a writeup; which forum should I be looking at?

@MatthewB

Not to worry! The clear tape is actually a remnant of my original prototype, and it did in fact give way after a few hundred actuations. I ended up drilling through the L bracket and scanner lid, and adding some bolts, which is what’s holding the lid in the picture. If you looked below the tape, you’d see a bunch of hot glue, which was my second attempt at holding the thing in place :)

@abobymouse

We actually do use this kind of system for some of the scans. We have a Canon DSLR above a white table, triggered by a Python script. You place the photo on the table, hit a “Go” button in the script, and it takes a photo, names it according to the same scheme used by the archive, and enters it into the same database as the Gado 1. We use this for images which are too damaged or deteriorated to be scanned by the machine; it’s way faster than manual scanning, and requires less contact with the materials.

@ other Tom

I wish we could just use an ADF scanner. The issue is that these photos are very old (up to 115 years) and extremely sensitive. Lots are frayed around the edges, and most have pieces of newsprint haphazardly taped or stapled to the back. This makes running them through an automatic scanner impossible, because those scanners tend to rip the materials off or otherwise ruin the photos. I included a little blurb about this on our Kickstarter page if you’re interested in reading more.

@svofski

Thanks! The vacuum system uses a suction lifter cup from McMaster, which is connected to a normal household vacuum cleaner, trigged by the Arduino using a PowerSwitch Tail. Version 2 will use a real 12v vacuum pump driven by PWM using a motor controller, which will allow for variable vacuum strength to lower the risk of damaging images even more.

@henry

For Version 2, I’m thinking of switching to rotational motion, so the arm would basically be mounted to a pole rotated by a servo, and the scanner and bins would be arranged in a circle around it. This would allow for two scanners if I used a servo with 360 degree rotation, and either way it’s going to be a lot faster.

Report comment

Reply
Sigg3 says:

June 23, 2011 at 10:29 am

People where I work use each other’s children for these jobs, copying entire books including the copyright notices:)

Report comment

Reply
anon says:

June 24, 2011 at 1:24 pm

another vote for scan tailer, works wonders and makes the txt look almost like true font. Prints copies that are clear and readable as well.

Report comment

Reply