Stepping down as church treasurer led to the eviction of several large box files full of paper from my spare bedroom / home office. Inspired by a recent article in PC Pro, I looked at the freed-up space and thought ... what if all the stacks of paper in my flat could be made to disappear?

Scan it and shred it

Keeping paper originals of most things is increasingly unnecessary in the UK, and indeed most utility companies etc. no longer send paper bills or charge extra for doing so.

I didn't want to spend hundreds on a fancy double-sided, sheet feeding desktop scanner (or have it occupying all the space I'd just reclaimed!), but fortunately I was able to borrow one from work for the weekend to scan in all the old paper worth keeping.

That done, the question is how best to digitise and destroy new paper as it comes in. Following the PC Pro article, I managed to find a Fujitsu ScanSnap iX100 on eBay - it was "reconditioned" but came in the original box with all the manuals and shrink wrap, so a bit of a bargain for £130. It's tiny, battery powered, and communicates over WiFi. The killer feature on top of that is that it can scan directly into various cloud services (e.g. Dropbox) without needing to be paired with a PC or phone. So I can push incoming post etc. straight through it without having to boot up a laptop first or faff with an app.

The PC Pro article didn't go into details on how all this works, and I was initially disappointed and thought I'd misunderstood. However, the thing to do is ignore the Windows software, ignore the "ScanSnap" Android app, and go directly to "ScanSnap Cloud". This is the one which you can use to configure the scanner to hook up to your WiFi, scan directly to ScanSnap's cloud service (free once you've bought the hardware), and sync from there to Dropbox/wherever, without needing an intermediate device. You know you've got it all set up right when powering on the scanner makes the scan button and WiFi light go purple, like this:

Incidentally, I thought for a few minutes that I'd bought a dud (even after charging it up for a few hours) because I couldn't work out how to get it to power on. The answer is to open both the paper trays out (duh!).

I'm properly impressed with this now it's up and running - the OCR is very good and simply embeds the text in the PDF while leaving the original image of the page visible - so you can hit Ctrl-F and find text. I was even more impressed that, rather than simply naming files after today's date and time, it has a reasonably good go at extracting a date from the document itself and also a file name (e.g. it manages to pick out the name of the bank when fed a bank statement).

Security?

Of course, for all this to work, one has to be happy with one's potentially quite sensitive documents being fed to a cloud service.

ScanSnap Cloud keeps your scan history for two weeks. This isn't configurable (although you can purge it manually from the app). That's good enough for me - anything especially sensitive can be zapped as soon as it's scanned; most things can be cleaned up automatically. Obviously the history purging doesn't affect the copies saved to Dropbox or similar.

Update: a spot of network sniffing reveals that (apart from DNS lookups) the only communicating it does is over HTTPS to a service hosted in Microsoft Azure. All pretty sensible.

Finishing touches

So at this point I have a "ScanSnap" directory in the root of my DropBox which is full of PDFs with reasonably helpful file names. Leaving them all in one big flat folder and using the DropBox search function (which does search the text inside the PDFs) might be good enough, but a bit of sorting wouldn't go amiss.

Paul Ockenden mentioned in the original article that what he really wanted was for documents to be automatically sorted into folders depending on what they were. ScanSnap isn't quite that clever (but then again, a universally "right" answer to that problem would be pretty tricky). However, this is where a spot of scripting rounded it off for me.

My DropBox folder is already synchronised onto a Linux machine, so what I wanted was a script to fire as new files came in which would spot certain patterns in the file name and move the PDF into the right place accordingly. ionotify is the Linux feature of choice for this job; a bit of experimentation confirmed that Dropbox seems to buffer incoming files somewhere temporary and then move them into the right place, so listening for IN_MOVED_TO events in the ScanSnap directory allows one to apply some simple rules based on the file name. I'll post more on that (and the code) another time.