DIY Book Scanner Processes 600 Pages/hour

Like any learned individual, [Justin] has a whole mess of books. Not being tied to the dead-tree format of bound paper, and with e-readers popping up everywhere, he decided to build a low-cost book scanner so an entire library can be carried in a his pocket. If that’s not enough, there’s also a complementary book image processor to assemble the individual pictures into a paginated tome.

The build is pretty simple – just a little bit of black craft board for the camera mount and adjustable book cradle. [Justin] ended up using the CHDK software for the Cannon PowerShot camera to hack in a remote trigger. The scanner can manage to photograph 600 pages an hour, although that would massively increase if he ever moves up to a 2-camera setup.

We’re wondering if OCR could be applied to this build – it’s nice to have an image of a page on your computer, but searchable text would be amazing. If you have experience or a story about a massive OCR job, be sure to leave a note in the comments. Check out the videos below for a walk-through of the build and a demo of the operations.

[youtube=http://www.youtube.com/watch?v=Z-wJs3Xg4Y4&w=470]

[youtube=http://www.youtube.com/watch?v=6C_yJ7eMs24&w=470]

39 thoughts on “DIY Book Scanner Processes 600 Pages/hour

  1. I found the title to be a little bit misleading. Although it’s a great and simple setup, the speed of scanning isn’t one of its highlights. If it had automatic page flipper, it would have been justified.

  2. This is exactly what my next project was going to be, book scanner. This build looks great! Going to go for a two camera one, though.

    Definitely not going to use it for college textbooks. No way.

  3. No automatic page turning = no use.
    I appreciate the effort put in, and it looks like a great start to a project, but this is not what I hoped for when I saw the article.

  4. Hmmm, is there a way to easily automate page flipping or are you better off just seperating all the pages from the binding and doing a double-sided scan of the whole pile? I’m guessing you could get a whole book done pretty quickly (and automatically) that way, but the book would be mostly destroyed in the process.

  5. page turner would be doomed. its sometimes tricky for a human to turn just ONE page. you usually turn 2-3 then rub your fingers together til the pages separate.

    but if you could get every page to be charged say a positive charge they would repel each other and turning would be much simpler. as it stands they tend to be attracted to each other.

  6. Nice little build but really of no use to busy people. Automatic page turning is a must or just take the book apart and put it through a document scanner.
    Some will cry sacrilege at cutting a book up but show a little pragmatism. It’s the information, not the format that is important.

  7. @Rob

    Not use it for text books? If there was a cheap automatic scanner on the market that’s all I’d use it for.

    Sorry i don’t care what the publishers say. 300+$ for engineering books is ridiculous, and they’re changed nearly every semester.

    i think my intro to chemistry was ~350. How do you explain some of the crazy math book prices when there really isn’t anything different.

  8. Needs an automatic page turner, badly; even if it’s only 99% reliable. Missed pages can be detected by OCRing page numbers. Manually capturing a few pages is better than having to manually capture them all.

    And if you do run all this through an OCR for use on an e-reader, I highly recommend archiving a copy of the original JPG images! You can OCR again at a later date as technology advances for better accuracy and formatting.

    Plus, there’s the issue of the OCR output format. Chances are it’s Acrobat. Many years ago, I converted many important documents to Acrobat; naively believing Adobe’s hype about it being a good archival format. Now I regret that decision, because newer versions render some of them so poorly they’re unreadable. I have to start a virtual machine to run an older Acrobat version every time I want to view them. I’m still looking for a way to permanently update them that won’t cause conversion errors (any tips appreciated).

    But JPG really is forever. It’s relatively simple and open source. You will always be able to correctly view a JPG, no matter how old. Even 100 years from now, you will still be able to batch convert them to whatever format your OCR software then accepts.

  9. Chris: feel free to join in with your own work on automatic page turner designs at diybookscanner.org . There several design paths are explored. But it is a really tricky engineering task to get especially if you have low cost aims. The best manual page turn designs, in comparison, are dual camera and allow much quicker page turns than in the unit HD highlights above. Besides, post processing takes more time than the initial book capture.

  10. @ Ross- I understood there was a web resource for home shop built book scanners, but I couldn’t think of it right off, and I’m too lazy today to go look for it.

    I suppose real busy people would build a two camera unit, rather the single camera one. Many books are too expensive to destroyed for speed. A college student could possibly recover a tiny portion of a text book. I understand pragmatism, but there is a YMMV component to pragmatism too.

    The man said with practice he can scan a 600 page book in an hour, I see no reason not to believe him. I downloaded the software zip file to see if it could be of some use to me A thank you to Justin for making it available.
    Having seen the machinery for a bulk snail mailing facility in operation, I have no doubt an automatic page turner could be developed, but you may have dedicate a bedroom in your home for the scanner only.

  11. @Chris: Normaly you put the images you take in the pdf files you create. Than you put the OCR text output behind the images. This way you can read all the words no matter how bad the OCR is and you are still able to search the book for most (depending on the OCR) words.
    As OCR I like to use Tesseract. With a few hacks you can make it recognise the layout of the file too.

  12. @Chris: OCRing page numbers is not as easy as you think. I prep books for distributed proofreaders (older fonts, true) and the page headers are more likely to be missing or wrong in the OCR than the body text. Missing/damaged/duplicate page checks are still generally a human task.

  13. Honestly my thoughts on this would be to cut away the page and digitize them argh page turner….argh page flipper ahh the moment u decide to scan a book or bend a page is the moment u decide to destroy it….for anyone here that works and stands in an office or has had to stand at a copying machine copying a book thats over sheets-backed you would know the stress of flipping pages to get to the next side and because of how some books are put together you cannot get a proper scan/image in this case/ of the book only good way and best way to get the entire page and perfectly readable at that is to have lose sheets/pages so my advice cut umm loose u lose a book but u save it at the same time. NO TECHNICAL KNOWLEDGE REQUIRED FOR THIS HACK

  14. I have made a two camera book scanner, and
    have also used a flatbed scanner to scan books,
    the book scanner is “amazingly” faster than the
    flatbed scanner, even when you have to turn the pages by hand when using the two camera book scanner.

    I got the plans / info for the book scanner
    over at

    http://www.diybookscanner.org

    I get around 25 to 30+ pages per minute, with
    the book scanner.

  15. I was expecting automatic page flipping for the reported scan rate, as in a fire-up-and-forget type device. I’d rather use a band saw to saw off the book’s spine and then use a copy machine to scan the pages than turn each one and reinstall the “page flattener”.

  16. Eh, auto page turning would be a big improvement, but being able to go turn-click-turn-click with little to no waiting for the scanner to finish is still a lot better than what you’d get with a normal home scanner or something else along those lines. It’s not enough if you want to scan, say, the Bible or a dictionary, but it’s still not bad for a DIY solution.

  17. @reipoom

    Definitely not for textbooks. Because I’m totally a law abiding citizen who wouldn’t rent them from chegg, scan them, then mail them back to get my money back.

  18. I’d say if you want to build/design an automatic page turner you might as well spend the time flipping the pages yourself.

    I was a cheap student and experimented with different ways to digitize books from 1st year to 4th year.

    First method: Chop and ADF scan. Works for more textured paper at a pretty good success rate and very good image quality, but only if the paper is “grippy” enough. This does not work for textbooks with super thin slippery pages – which are most of them. Those ones would jam up the scanner and I ended up spending more time unjamming the ADF than actual scanning. Not to mention you destroy a book in the process.

    Second method: Setup a tripod to position the camera to shoot over your shoulder and start shooting, flattening the pages with a piece of plexi. Works, however setup time is almost as long as doing all the page flipping. It would take me about 3 hours to do a 1000 page textbook from beginning of setup to end of page flipping. Getting to a printable “product” (white background without all the gradients) is another story. Had to figure out Imagemagick batch scripting to do postprocessing for me. Yea it’s pretty mind numbing, but get someone to help you out and take turns.

    Tips:
    – do 2 pages at a time, whether you use 2 cameras or 1 camera across both pages
    – use software that acts as a viewfinder so that you can see if the picture you took was too blurry, going back and correcting images is a pain
    – get friends to help you when you need a break
    – use a good camera, DSLRs are nice, having some sort of remote is a must (a wireless mouse taped to the floor with its cursor positioned over the shutter release button in the viewfinder software is what I used)
    – use a fully black background behind your book, imagemagick has nice cropping commands that you can use based on color
    – make sure you set up right (reflections are a pain) so your pictures don’t change much from beginning to end (that includes lighting and position of the book relative to the background – lighting because your contrast & brightness changes will mess up your batch scripts and changing the background will mess up your cropping scripts.)
    – Omnipage is the best for OCR (also most expensive if you buy it) BTW it took me a while to figure this one out but if Omnipage isn’t loading your pictures remove your EXIF info! You should be doing this anyways … (EXIF Stripper is handy). Acrobat actually sucks for OCR but I did use Acrobat to shrink the PDF size to about 100MB depending on things like color vs B/W and number of pages. Omnipage isn’t very good for compression but it will spit out a nice OCR’d pdf (that you can shrink nicely with Acrobat).

    That’s roughly 4 years of experience for you. If there are classmates and friends willing to help you in exchange for a copy of the digitized materials (provided that they own a copy of the book of course) it makes it worthwhile. Doing all this yourself and just for yourself is really a pain in the ass but if you’re dedicated enough it can be done.

  19. One more tip. For doing a check on page numbers – DON’T count through all of them. Just check the page numbers after tapping the arrow key 5 times. This would be every 10 pages if you’re going 10 pages at a time so you should only see for example “10 … 11, 20 … 21, etc” and it’ll only take a couple minutes. You’ll know when you messed up if you see say “32 … 33”.

  20. Extending the tip of Dex, you could also do binary search: Check if the last page number is correct, if not, check if the middle page has the correct number if so, you know the error is in the second half of the pages & apply this recursively.

  21. If you don’t care about the book/mag then just cut the binding with a bandsaw and use fujistu scansnap s300. That’s my setup. I can do about 200 pages in about a half hour.

  22. Regarding Slacker247’s comment, for a lot of people it’s not easy getting access to a band saw. I just saw on some other sites that the in-story printing department at office supply chain stores like Office Depot have page cutters than can slice the spine off a book, they’ll do it for around $1 and it’s becoming a common enough request that they’re used to doing it.

    1. Many of the semi-skilled staff at office supply and Kinko’s stores don’t know how tightly to clamp the guillotine paper cutters, Therefore, during slicing, the material slips and the finished cut product is not square. I recommend that you mention this to the staff if you go this route.

      1. The reason squareness is an issue is because, (a) you might want to rebind (wire or comb or other process) the pages, and (b) the side guides of the feeder of your scanner need a square original to feed properly.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.