DIY Book Scanner Processes 600 Pages/hour

July 18, 2011

Like any learned individual, [Justin] has a whole mess of books. Not being tied to the dead-tree format of bound paper, and with e-readers popping up everywhere, he decided to build a low-cost book scanner so an entire library can be carried in a his pocket. If that’s not enough, there’s also a complementary book image processor to assemble the individual pictures into a paginated tome.

The build is pretty simple – just a little bit of black craft board for the camera mount and adjustable book cradle. [Justin] ended up using the CHDK software for the Cannon PowerShot camera to hack in a remote trigger. The scanner can manage to photograph 600 pages an hour, although that would massively increase if he ever moves up to a 2-camera setup.

We’re wondering if OCR could be applied to this build – it’s nice to have an image of a page on your computer, but searchable text would be amazing. If you have experience or a story about a massive OCR job, be sure to leave a note in the comments. Check out the videos below for a walk-through of the build and a demo of the operations.

[youtube=http://www.youtube.com/watch?v=Z-wJs3Xg4Y4&w=470]

[youtube=http://www.youtube.com/watch?v=6C_yJ7eMs24&w=470]

39 thoughts on “DIY Book Scanner Processes 600 Pages/hour”

svofski says:

July 18, 2011 at 11:10 am

I found the title to be a little bit misleading. Although it’s a great and simple setup, the speed of scanning isn’t one of its highlights. If it had automatic page flipper, it would have been justified.

Report comment

Reply
Rob says:

July 18, 2011 at 11:12 am

This is exactly what my next project was going to be, book scanner. This build looks great! Going to go for a two camera one, though.

Definitely not going to use it for college textbooks. No way.

Report comment

Reply
Spork says:

July 18, 2011 at 11:30 am

No automatic page turning = no use.
I appreciate the effort put in, and it looks like a great start to a project, but this is not what I hoped for when I saw the article.

Report comment

Reply
Chalkbot says:

July 18, 2011 at 11:34 am

Hmmm, is there a way to easily automate page flipping or are you better off just seperating all the pages from the binding and doing a double-sided scan of the whole pile? I’m guessing you could get a whole book done pretty quickly (and automatically) that way, but the book would be mostly destroyed in the process.

Report comment

Reply
Carry The what says:

July 18, 2011 at 11:36 am

@Spork

my thoughts exactly…

Report comment

Reply
John Avitable says:

July 18, 2011 at 11:45 am

I agree with the above, there’s limited use in a book scanner that doesn’t automatically flip the pages. Might as well read the book while youre at it :)

Report comment

Reply
austin says:

July 18, 2011 at 11:48 am

page turner would be doomed. its sometimes tricky for a human to turn just ONE page. you usually turn 2-3 then rub your fingers together til the pages separate.

but if you could get every page to be charged say a positive charge they would repel each other and turning would be much simpler. as it stands they tend to be attracted to each other.

Report comment

Reply
SuperNuRd says:

July 18, 2011 at 11:50 am

Ha all I thought about was Harry Potter book torrents.

Report comment

Reply
Ike says:

July 18, 2011 at 11:50 am

Nice little build but really of no use to busy people. Automatic page turning is a must or just take the book apart and put it through a document scanner.
Some will cry sacrilege at cutting a book up but show a little pragmatism. It’s the information, not the format that is important.

Report comment

Reply
reipoom says:

July 18, 2011 at 11:57 am

@Rob

Not use it for text books? If there was a cheap automatic scanner on the market that’s all I’d use it for.

Sorry i don’t care what the publishers say. 300+$ for engineering books is ridiculous, and they’re changed nearly every semester.

i think my intro to chemistry was ~350. How do you explain some of the crazy math book prices when there really isn’t anything different.

Report comment

Reply
tzangcr says:

July 18, 2011 at 12:06 pm

Can’t mention book scanning without these:

http://www.diybookscanner.org/
http://bookliberator.com/doku.php

I wonder if they’ve already been featured on HAD.

Report comment

Reply
Ross says:

July 18, 2011 at 12:15 pm

For more information on these types of builds

http://www.diybookscanner.org/

Ross

Report comment

Reply
Olivier says:

July 18, 2011 at 12:19 pm

@Brian Benchoff: Canon, not Cannon.

Report comment

Reply
Chris says:

July 18, 2011 at 12:40 pm

Needs an automatic page turner, badly; even if it’s only 99% reliable. Missed pages can be detected by OCRing page numbers. Manually capturing a few pages is better than having to manually capture them all.

And if you do run all this through an OCR for use on an e-reader, I highly recommend archiving a copy of the original JPG images! You can OCR again at a later date as technology advances for better accuracy and formatting.

Plus, there’s the issue of the OCR output format. Chances are it’s Acrobat. Many years ago, I converted many important documents to Acrobat; naively believing Adobe’s hype about it being a good archival format. Now I regret that decision, because newer versions render some of them so poorly they’re unreadable. I have to start a virtual machine to run an older Acrobat version every time I want to view them. I’m still looking for a way to permanently update them that won’t cause conversion errors (any tips appreciated).

But JPG really is forever. It’s relatively simple and open source. You will always be able to correctly view a JPG, no matter how old. Even 100 years from now, you will still be able to batch convert them to whatever format your OCR software then accepts.

Report comment

Reply
aci says:

July 18, 2011 at 12:46 pm

For image processing, check out
http://scantailor.sourceforge.net/
http://www.diybookscanner.org/forum/viewforum.php?f=8

For OCR and djvu packing check
http://code.google.com/p/djvubind/

Alternative:
http://sourceforge.net/projects/bookscanwizard/
http://www.diybookscanner.org/forum/viewforum.php?f=9

Report comment

Reply
aci says:

July 18, 2011 at 12:49 pm

Chris: feel free to join in with your own work on automatic page turner designs at diybookscanner.org . There several design paths are explored. But it is a really tricky engineering task to get especially if you have low cost aims. The best manual page turn designs, in comparison, are dual camera and allow much quicker page turns than in the unit HD highlights above. Besides, post processing takes more time than the initial book capture.

Report comment

Reply
Alex says:

July 18, 2011 at 12:57 pm

Very good advice, Chris, and that’s easier than ever now that hard drives are so cheap.

Report comment

Reply
D_ says:

July 18, 2011 at 1:08 pm

@ Ross- I understood there was a web resource for home shop built book scanners, but I couldn’t think of it right off, and I’m too lazy today to go look for it.

I suppose real busy people would build a two camera unit, rather the single camera one. Many books are too expensive to destroyed for speed. A college student could possibly recover a tiny portion of a text book. I understand pragmatism, but there is a YMMV component to pragmatism too.

The man said with practice he can scan a 600 page book in an hour, I see no reason not to believe him. I downloaded the software zip file to see if it could be of some use to me A thank you to Justin for making it available.
Having seen the machinery for a bulk snail mailing facility in operation, I have no doubt an automatic page turner could be developed, but you may have dedicate a bedroom in your home for the scanner only.

Report comment

Reply
cmholm says:

July 18, 2011 at 1:19 pm

I’d like to have seen a short video where @Justin is in production mode, flipping pages, to see how smooth the work flow.

Auto page flipping? This is one of those times I wish HAD had a better search function, so that I could have found this in a few seconds, rather than a few minutes:

http://hackaday.com/2009/12/17/lego-book-scanner/

Report comment

Reply
HHH says:

July 18, 2011 at 2:18 pm

I think svofski is right…

Report comment

Reply
erniejunior says:

July 18, 2011 at 3:31 pm

@Chris: Normaly you put the images you take in the pdf files you create. Than you put the OCR text output behind the images. This way you can read all the words no matter how bad the OCR is and you are still able to search the book for most (depending on the OCR) words.
As OCR I like to use Tesseract. With a few hacks you can make it recognise the layout of the file too.

Report comment

Reply
mark g says:

July 18, 2011 at 3:38 pm

@Chris

JPGs don’t play well with e-Readers, nor can you properly index them. Give me plain text any day.

Report comment

Reply
grythumn says:

July 18, 2011 at 3:47 pm

@Chris: OCRing page numbers is not as easy as you think. I prep books for distributed proofreaders (older fonts, true) and the page headers are more likely to be missing or wrong in the OCR than the body text. Missing/damaged/duplicate page checks are still generally a human task.

Report comment

Reply
daniel_reetz says:

July 18, 2011 at 5:04 pm

Thanks for the support here in the comments, and nice single-camera build. We have a few like it.

Page turning is a very hard problem. People on the DIY Book Scanner forums are working on it – here’s one video from user dtic:

http://www.youtube.com/watch?v=_SBF51g3X7I

Here’s another from jck57:
http://www.youtube.com/watch?v=LsENrb4HE0Q

Report comment

Reply
Dosx says:

July 18, 2011 at 5:58 pm

Honestly my thoughts on this would be to cut away the page and digitize them argh page turner….argh page flipper ahh the moment u decide to scan a book or bend a page is the moment u decide to destroy it….for anyone here that works and stands in an office or has had to stand at a copying machine copying a book thats over sheets-backed you would know the stress of flipping pages to get to the next side and because of how some books are put together you cannot get a proper scan/image in this case/ of the book only good way and best way to get the entire page and perfectly readable at that is to have lose sheets/pages so my advice cut umm loose u lose a book but u save it at the same time. NO TECHNICAL KNOWLEDGE REQUIRED FOR THIS HACK

Report comment

Reply
will1384 says:

July 18, 2011 at 8:26 pm

I have made a two camera book scanner, and
have also used a flatbed scanner to scan books,
the book scanner is “amazingly” faster than the
flatbed scanner, even when you have to turn the pages by hand when using the two camera book scanner.

I got the plans / info for the book scanner
over at

http://www.diybookscanner.org

I get around 25 to 30+ pages per minute, with
the book scanner.

Report comment

Reply
Nagel says:

July 19, 2011 at 12:10 am

I was expecting automatic page flipping for the reported scan rate, as in a fire-up-and-forget type device. I’d rather use a band saw to saw off the book’s spine and then use a copy machine to scan the pages than turn each one and reinstall the “page flattener”.

Report comment

Reply
Nagel says:

July 19, 2011 at 12:16 am

Oops… Redundant posting. Read /then/ post, Nagel! Should have known with 26 prior that we’d all be “on the same page!”

Report comment

Reply
Blue Footed Booby says:

July 19, 2011 at 6:09 am

Eh, auto page turning would be a big improvement, but being able to go turn-click-turn-click with little to no waiting for the scanner to finish is still a lot better than what you’d get with a normal home scanner or something else along those lines. It’s not enough if you want to scan, say, the Bible or a dictionary, but it’s still not bad for a DIY solution.

Report comment

Reply
Rob says:

July 19, 2011 at 9:14 am

@reipoom

Definitely not for textbooks. Because I’m totally a law abiding citizen who wouldn’t rent them from chegg, scan them, then mail them back to get my money back.

Report comment

Reply
420nmSciurus says:

July 19, 2011 at 1:49 pm

This is a really neat idea. I would probably do it if I had the time in the same place where all of my books are…

Report comment

Reply
Dex says:

July 20, 2011 at 1:51 am

I’d say if you want to build/design an automatic page turner you might as well spend the time flipping the pages yourself.

I was a cheap student and experimented with different ways to digitize books from 1st year to 4th year.

First method: Chop and ADF scan. Works for more textured paper at a pretty good success rate and very good image quality, but only if the paper is “grippy” enough. This does not work for textbooks with super thin slippery pages – which are most of them. Those ones would jam up the scanner and I ended up spending more time unjamming the ADF than actual scanning. Not to mention you destroy a book in the process.

Second method: Setup a tripod to position the camera to shoot over your shoulder and start shooting, flattening the pages with a piece of plexi. Works, however setup time is almost as long as doing all the page flipping. It would take me about 3 hours to do a 1000 page textbook from beginning of setup to end of page flipping. Getting to a printable “product” (white background without all the gradients) is another story. Had to figure out Imagemagick batch scripting to do postprocessing for me. Yea it’s pretty mind numbing, but get someone to help you out and take turns.

Tips:
– do 2 pages at a time, whether you use 2 cameras or 1 camera across both pages
– use software that acts as a viewfinder so that you can see if the picture you took was too blurry, going back and correcting images is a pain
– get friends to help you when you need a break
– use a good camera, DSLRs are nice, having some sort of remote is a must (a wireless mouse taped to the floor with its cursor positioned over the shutter release button in the viewfinder software is what I used)
– use a fully black background behind your book, imagemagick has nice cropping commands that you can use based on color
– make sure you set up right (reflections are a pain) so your pictures don’t change much from beginning to end (that includes lighting and position of the book relative to the background – lighting because your contrast & brightness changes will mess up your batch scripts and changing the background will mess up your cropping scripts.)
– Omnipage is the best for OCR (also most expensive if you buy it) BTW it took me a while to figure this one out but if Omnipage isn’t loading your pictures remove your EXIF info! You should be doing this anyways … (EXIF Stripper is handy). Acrobat actually sucks for OCR but I did use Acrobat to shrink the PDF size to about 100MB depending on things like color vs B/W and number of pages. Omnipage isn’t very good for compression but it will spit out a nice OCR’d pdf (that you can shrink nicely with Acrobat).

That’s roughly 4 years of experience for you. If there are classmates and friends willing to help you in exchange for a copy of the digitized materials (provided that they own a copy of the book of course) it makes it worthwhile. Doing all this yourself and just for yourself is really a pain in the ass but if you’re dedicated enough it can be done.

Report comment

Reply
Dex says:

July 20, 2011 at 2:02 am

One more tip. For doing a check on page numbers – DON’T count through all of them. Just check the page numbers after tapping the arrow key 5 times. This would be every 10 pages if you’re going 10 pages at a time so you should only see for example “10 … 11, 20 … 21, etc” and it’ll only take a couple minutes. You’ll know when you messed up if you see say “32 … 33”.

Report comment

Reply
maximan says:

July 20, 2011 at 2:51 am

Extending the tip of Dex, you could also do binary search: Check if the last page number is correct, if not, check if the middle page has the correct number if so, you know the error is in the second half of the pages & apply this recursively.

Report comment

Reply
slacker247 says:

July 20, 2011 at 7:15 am

If you don’t care about the book/mag then just cut the binding with a bandsaw and use fujistu scansnap s300. That’s my setup. I can do about 200 pages in about a half hour.

Report comment

Reply
James says:

April 3, 2013 at 8:47 pm

This look solid but huge.
i think alot of company is coming out with portable book scanner.
Check this out.
https://www.youtube.com/watch?v=y2p_Nt2WQE0

Report comment

Reply
Rick_R says:

May 26, 2013 at 4:13 pm

Regarding Slacker247’s comment, for a lot of people it’s not easy getting access to a band saw. I just saw on some other sites that the in-story printing department at office supply chain stores like Office Depot have page cutters than can slice the spine off a book, they’ll do it for around $1 and it’s becoming a common enough request that they’re used to doing it.

Report comment

Reply
1. KJAFortWayne says:
  
  April 20, 2014 at 5:30 pm
  
  Many of the semi-skilled staff at office supply and Kinko’s stores don’t know how tightly to clamp the guillotine paper cutters, Therefore, during slicing, the material slips and the finished cut product is not square. I recommend that you mention this to the staff if you go this route.
  
  Report comment
  
  Reply
  1. KJAFortWayne says:
    
    April 20, 2014 at 5:33 pm
    
    The reason squareness is an issue is because, (a) you might want to rebind (wire or comb or other process) the pages, and (b) the side guides of the feeder of your scanner need a square original to feed properly.
    
    Report comment
    
    Reply