Working on in-alavateli blackout/redacting software

104 views
Skip to first unread message

Anton Stoychev

unread,
Jan 17, 2014, 8:14:31 AM1/17/14
to alavet...@googlegroups.com
Hey, I'm part of the team working on Alaveteli deployment in Bulgaria.

We are doing this closely with the main FOI NGO here. And they marked as mandatory for Alaveteli admins to have the ability to redact parts of the anwsers. Think pdfs and images here.

So we are going to build such functionality.

A large chunk of the information returned by the authorities will be scanned pdfs. That is based on the statitics of the FOI organisation.

Our idea for redacting a pdf is to make a html5 canvas editor that supports selection (marquee tool) and can delete or black out areas - or alternatively select areas that are going to be useful and leave only them.

We'll add this to the admin of alaveteli. 

Few issues i can think of:

1. Once pdf is converted to the image we are loosing all text that could have been OCR-ed or natively been a text in the pdf
We might be able to do a OCR on top of the cropped image but that might not really effective.
Main task here is not to loose the text because that cripples searching.

2. We don't know what happens with the "View as html" functionality 
Alaveteli has the "View as html" functionallity for attachments. We haven't thought what would that be after the redacting.
Right now we might sacrifice the functionality

3. Isn't there something better than image editing?
Open source pdf web editor? Couldn't find one.

I would like this topic to serve as discussion for:

 1. How to best implement this?
 2. What is the most minimal working version of such redacting that doesnt cripple fatally the website . Let's say an MVP of a redacting functionallity

Andrei Petcu

unread,
Jan 17, 2014, 8:22:15 AM1/17/14
to alavet...@googlegroups.com
Check this out!
http://viewerjs.org/

It's a unified ODF/PDF viewer. It unites Mozilla's PDF.js and WEBODF to view documents with JS.
Maybe you can build on top of this.

Hope this helps,
Andrei

Anton Stoychev

unread,
Jan 17, 2014, 8:29:34 AM1/17/14
to alavet...@googlegroups.com

--
You received this message because you are subscribed to a topic in the Google Groups "Alaveteli Dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/alaveteli-dev/p16CteEZ7Ag/unsubscribe.
To unsubscribe from this group and all its topics, send an email to alaveteli-de...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Andrei Petcu

unread,
Jan 17, 2014, 8:34:24 AM1/17/14
to alavet...@googlegroups.com
Maybe the viewer is useful as a replacement for the view as html part.
It is used by ownClowd.

Andrei

Anton Stoychev

unread,
Jan 17, 2014, 8:39:07 AM1/17/14
to alavet...@googlegroups.com
The problem with the "view as html" part will occur after we use html5 canvas to edit the PDF. After the redacting it will no longer be a pdf but just an image . And the image will not be convertable to html :)

James McKinney

unread,
Jan 17, 2014, 10:31:02 AM1/17/14
to alavet...@googlegroups.com
It may be worthwhile to look at Froide https://github.com/stefanw/froide a Django FOI platform, which includes PDF redaction.

You received this message because you are subscribed to the Google Groups "Alaveteli Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alaveteli-de...@googlegroups.com.

Anton Stoychev

unread,
Jan 17, 2014, 4:52:13 PM1/17/14
to alavet...@googlegroups.com
Okey. That - I didn't know of. Quite interesting!

I am running Froide locally and have tested the redacting - works well. 

However problem is still there because the PDF redaction is based on ImageMagick and the redaction generates a image-based pdf. So no text.

Does anyone know of some clever OCR libraries? It would be nice to be ruby based but not a requirement.

Alternatively, does anyone know of PDF web that can be used instead ImageMagick? I am thinking a server-side library (like ImageMagick) that can erase the text below a certain rectangle and replace it with ***** (stars)


Anton Stoychev

unread,
Jan 17, 2014, 4:54:00 PM1/17/14
to alavet...@googlegroups.com
Correction:

Alternatively, does anyone know of PDF library that can be used instead ImageMagick? I am thinking a server-side library (like ImageMagick) that can erase the text below a certain rectangle and replace it with ***** (stars)

Tony Bowden

unread,
Jan 17, 2014, 6:38:49 PM1/17/14
to alavet...@googlegroups.com
On 17 January 2014 23:52, Anton Stoychev <anti...@gmail.com> wrote:
> However problem is still there because the PDF redaction is based on
> ImageMagick and the redaction generates a image-based pdf. So no text.

If the initial document is already a text-based PDF, rather than just
an image, then Alaveteli's censor rules already work with it just fine
— there's no need for any new functionality here at all[1]. We do
stuff like this on WhatDoTheyKnow all the time. Or have I
misunderstood the requirement?

For PDFs that are already imaged based, then they're already not going
to be searchable etc. anyway, so again you're not really losing
anything by just blacking part of that out (whether 'within'
Alaveteli, or by just download the document, manipulating it offline,
and uploading it again over the old one).

Of course, adding functionality to Alaveteli to do some sort of OCR
against _all_ image documents so we can index the text etc would be
superb, but that's a more general issue that's not really related to
this sort of redaction issue, beyond making sure that redacting a
document retriggers that OCR.

Tony


[1] Well, there are some outstanding tickets to make this work better
/ more easily for redactions that span multiple lines, for example,
and patches for that would be very welcome indeed! :)

Stefan Wehrmeyer

unread,
Jan 18, 2014, 6:32:55 AM1/18/14
to alavet...@googlegroups.com, to...@mysociety.org
Hi,

developer of Froide here. I definitely would like to have functionality akin to Acrobat Pro redaction, but building that in the browser was out of scope back then and we wanted to move forward. Automatically regexing and removing in binary (as Alaveteli does, as far as I can tell) is not good enough when you go beyond email addresses and want to redact postal addresses, names etc. Due to the nature of PDFs, all parts of a word could be scattered around the file.

Froide's redaction is powered by PDF.js. I would definitely help out in building a plugin/library that offers redaction in the browser that does not destroy the text PDF. Could work something like this: Users select an area, PDF.js finds out what elements are in there, removes them/blacks them out and writes the PDF back to the server. PDF.js is not that well documented last time I checked and I haven't seen any modify/write features. Will look again.

My advice: don't overthink it for now, move forward. The problem of bad PDF redaction can be solved later (keep original PDFs obviously!), the problem of how to help people make FOI requests is more important.

Feel free to use the code I wrote on top of PDF.js:

Cheers
Stefan

Henare Degan

unread,
Jan 18, 2014, 9:47:44 AM1/18/14
to alavet...@googlegroups.com
On 18 January 2014 08:32, Stefan Wehrmeyer <stefanw...@gmail.com> wrote:
My advice: don't overthink it for now, move forward. The problem of bad PDF redaction can be solved later (keep original PDFs obviously!), the problem of how to help people make FOI requests is more important.

+1. Great advice!

Cheers,

Henare
--
Henare Degan
Volunteer & Director - OpenAustralia Foundation

e » hen...@oaf.org.au
w » www.openaustraliafoundation.org.au
t » @OpenAustralia

Anton Stoychev

unread,
Jan 18, 2014, 11:45:07 AM1/18/14
to alavet...@googlegroups.com
@Tony - Ah, right, I didn't know text-based pdfs are already being processed, cool.

@Stefan, @Henare

So what we'll do is :

1) Try to port Froide's pdf redaction. I'm not certain whether converting the pdf to image and then opening canvas editor before the redaction isnt easier that loading a pdf in pdf.js? Or at least more lightweight frontend-wise.
2) Look for a good cyrillic(or generic one) OCR library and check what the implementation involves - I would imagine if we have pdf as an image, it is matter of making a subprocess call and capturing the output. I already know a dev that was working with some ocr but it was php-based for some reason. If implementation is too convoluted - we'll stop at redaction.

I know for the sake of a working website we need the minimal to get it working but it would be nice if this topic exists as a discussion on what and how we can achieve in best scenario.

Tony Bowden

unread,
Jan 18, 2014, 11:57:34 AM1/18/14
to alavet...@googlegroups.com
On 18 January 2014 18:45, Anton Stoychev <anti...@gmail.com> wrote:
> 1) Try to port Froide's pdf redaction. I'm not certain whether converting
> the pdf to image and then opening canvas editor before the redaction isnt
> easier that loading a pdf in pdf.js? Or at least more lightweight
> frontend-wise.

Rather than porting what's already in Froide, and ending up with two
slightly different versions that would then have to be maintained in
parallel etc, it would be superb if you and Stefan could work together
to come up with a better, more generic, version both platforms could
use. And probably others too — this seems like something that would be
useful in lots of other contexts...

Tony

Anton Stoychev

unread,
Jan 25, 2014, 11:55:36 PM1/25/14
to alavet...@googlegroups.com
@Stefan do you want to join me for a more interactive online chat, hangout or call so that we can define what we can do ? 

Things like: 
 - Whether we can extract the functionality?
 - If extracted, how would other apps use it?

I am actively working on several django projects so I can work my way trough the code but it would be quicker and nicer with you help.

Stefan Wehrmeyer

unread,
Jan 28, 2014, 5:46:09 AM1/28/14
to alavet...@googlegroups.com
@Anton: sure, I will send you details off-list
Reply all
Reply to author
Forward
0 new messages