Automatic upload from certain staging folder

839 views
Skip to first unread message

Joshua Jonah

unread,
Jul 30, 2014, 11:02:00 AM7/30/14
to mayan...@googlegroups.com
Is there any way to make somewhat of a watch folder? I have a staging folder, but having to go and upload the files is kind of unnecessary. Anybody done this before?

Michel Lavoie

unread,
Jul 30, 2014, 12:00:24 PM7/30/14
to mayan...@googlegroups.com
It seems like a good idea, but how would you handle metadata? I thought about using a script to automate uploads for a specific documents but I'm afraid that poorly documented files would render my collection useless in the end.

Joshua Jonah

unread,
Jul 30, 2014, 12:02:58 PM7/30/14
to mayan...@googlegroups.com

This would be a specific directory only containing a specific type of file.

On Jul 30, 2014 12:00 PM, "Michel Lavoie" <lavoie...@gmail.com> wrote:
It seems like a good idea, but how would you handle metadata? I thought about using a script to automate uploads for a specific documents but I'm afraid that poorly documented files would render my collection useless in the end.

--

---
You received this message because you are subscribed to a topic in the Google Groups "Mayan EDMS" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mayan-edms/L2RnhallmnM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mayan-edms+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roberto Rosario

unread,
Jul 30, 2014, 1:36:52 PM7/30/14
to mayan...@googlegroups.com

This feature was actually started some time ago (https://github.com/mayan-edms/mayan-edms/blob/master/mayan/apps/sources/models.py#L194) but is not yet enabled because it depends on some scheduling update that have not made it into the master branch.

As for metadata, I came up with some ideas but none are implemented. One was to let users set default metadata values as well as document type for each watch folder. Another idea was when a document is being imported from a watch folder to look for a file with the same name but with the .metadata extension. No design decision has been reached yet so any ideas are welcomed.

You received this message because you are subscribed to the Google Groups "Mayan EDMS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mayan-edms+...@googlegroups.com.

Mathias Behrle

unread,
Aug 27, 2014, 5:22:32 PM8/27/14
to mayan...@googlegroups.com
* Roberto Rosario: " Re: [Mayan EDMS: 761] Automatic upload from certain
staging folder" (Wed, 30 Jul 2014 13:36:51 -0400):

> This feature was actually started some time ago (
> https://github.com/mayan-edms/mayan-edms/blob/master/mayan/apps/sources/models.py#L194)
> but is not yet enabled because it depends on some scheduling update that
> have not made it into the master branch.
>
> As for metadata, I came up with some ideas but none are implemented. One
> was to let users set default metadata values as well as document type for
> each watch folder. Another idea was when a document is being imported from
> a watch folder to look for a file with the same name but with the .metadata
> extension. No design decision has been reached yet so any ideas are
> welcomed.

Both possibilities could have their individual use cases, for which they fit
best. The most flexible approach is the second.

What I found when evaluating other DMS software:

- Inclusion of some identifier on the document (could be a barcode, or some
special formatted string, or...). This identifier must not necessarily be
fixed on the document, but could be the first page of a scan or some paper
scanned together with the document. This method applies preferably to scanned
documents.

- Rather straightforward is a sort of recognition, where templates can be
defined containing regions formatted in an individual way. E.g. if you have a
supplier with his custom invoice format displaying the invoice number, date,
amount at fixed places, they could be used on such a template and the software
can check, if the document contains such a region.

Perhaps this could be used slightly modified but simpler by defining
string patterns, that could be matched on the OCR result. So at last
repeating patterns could be used to extract metadata.


In any case I would find useful a document queue containing documents
already processed (OCR available), but still to be completed with metadata. So
to speak an inversion of the current workflow (where metadata are defined
first).
As already discussed in https://github.com/mayan-edms/mayan-edms/issues/9) I
think it would be best for the manual completion of metadata to have a view of
the document together with its OCR data available directly on the metadata form.

I am imagining a staging folder, from which the documents are processed
immediately. If after the initial processing no metadata are available
for the document, it is added to the postprocessing queue. When finally
(manually) processed, those documents are removed from the queue.

There should be some configuration options:
- Which metadata are required to be filled for a document to be able to leave
the queue?
- Should only documents missing the required metadata be added to the queue or
just all (if postprocessing control for all processed documents is desired)?

So far my brainstorming at this very moment, comments as always very welcome.
signature.asc

Roberto Rosario

unread,
Sep 3, 2014, 2:47:52 PM9/3/14
to mayan...@googlegroups.com
I like the barcode/qrcode idea very much, would allow for batch scanning, for example several documents placed in a scanner with a document feeder and each document has a printed page with a barcode defining the metadata kind of like FAX cover pages. Regional OCR is a must have feature and usually a defining feature of the commercial offerings, I don't know how accurate OCR a rectangle of text would but is there is a need for the feature let's do it. We need a way to let users mark/highlight the fields they want scanned and entered as metadata. This would required some design decisions (do we store the cursor's x and y positions of the square to be scanned or the x and y % in relation to the current zoom level) and a rich client w/ corresponding API endpoints to talk to the backend.


On Wednesday, August 27, 2014 5:22:32 PM UTC-4, Mathias Behrle wrote:
* Roberto Rosario: " Re: [Mayan EDMS: 761] Automatic upload from certain
  staging folder" (Wed, 30 Jul 2014 13:36:51 -0400):

> This feature was actually started some time ago (
> https://github.com/mayan-edms/mayan-edms/blob/master/mayan/apps/sources/models.py#L194)
> but is not yet enabled because it depends on some scheduling update that
> have not made it into the master branch.
>
> As for metadata, I came up with some ideas but none are implemented. One
> was to let users set default metadata values as well as document type for
> each watch folder. Another idea was when a document is being imported from
> a watch folder to look for a file with the same name but with the .metadata
> extension. No design decision has been reached yet so any ideas are
> welcomed.

Both possibilities could have their individual use cases, for which they fit
best. The most flexible approach is the second.

What I found when evaluating other DMS software:

- Inclusion of some identifier on the document (could be a barcode, or some
  special formatted string, or...). This identifier must not necessarily be
  fixed on the document, but could be the first page of a scan or some paper
  scanned together with the document. This method applies preferably to scanned
  documents.


I like the barcode/qrcode idea very much, would allow for batch scanning, for example several documents placed in a scanner with a document feeder and each document has a printed page with a barcode defining the metadata kind of like FAX cover pages. 
 
- Rather straightforward is a sort of recognition, where templates can be
  defined containing regions formatted in an individual way. E.g. if you have a
  supplier with his custom invoice format displaying the invoice number, date,
  amount at fixed places, they could be used on such a template and the software
  can check, if the document contains such a region.


Regional OCR is a must have feature and usually a defining feature of the commercial offerings, I don't know how accurate OCRing a rectangle of text would be but if there is a need for the feature let's do it. I see some requirements, we need a way to let users mark/highlight the fields they want scanned and entered as metadata. This would require some design decisions (do we store the cursor's x and y positions of the square to be scanned or the x and y % in relation to the current zoom level) and a rich client w/ corresponding API endpoints to talk to the backend.
 

  Perhaps this could be used slightly modified but simpler by defining
  string patterns, that could be matched on the OCR result. So at last
  repeating patterns could be used to extract metadata.


In any case I would find useful a document queue containing documents
already processed (OCR available), but still to be completed with metadata. So
to speak an inversion of the current workflow (where metadata are defined
first).
As already discussed in https://github.com/mayan-edms/mayan-edms/issues/9) I
think it would be best for the manual completion of metadata to have a view of
the document together with its OCR data available directly on the metadata form.


 
I am imagining a staging folder, from which the documents are processed
immediately. If after the initial processing no metadata are available
for the document, it is added to the postprocessing queue. When finally
(manually) processed, those documents are removed from the queue.

There should be some configuration options:
- Which metadata are required to be filled for a document to be able to leave
  the queue?
- Should only documents missing the required metadata be added to the queue or
  just all (if postprocessing control for all processed documents is desired)?


I'm still wrapping my head around the post processing queue, do we create a post processing attached or detached from the watch folder?

 
So far my brainstorming at this very moment, comments as always very welcome.


Thanks a lot Mathias!

Mathias Behrle

unread,
Sep 3, 2014, 6:48:15 PM9/3/14
to mayan...@googlegroups.com
* Roberto Rosario: " Re: [Mayan EDMS: 816] Automatic upload from certain
staging folder" (Wed, 3 Sep 2014 11:47:52 -0700 (PDT)):
Yes, the comparison with the Fax cover page hits the mark.

Question:
When batch scanning, how to determine the beginning and the end of a batch?
Will each document require a 'cover page' or can such a cover page be valid for
several documents? Perhaps the number of documents could be included on the
cover page, but this would always require a new cover page per batch.

> > - Rather straightforward is a sort of recognition, where templates can be
> > defined containing regions formatted in an individual way. E.g. if you
> > have a
> > supplier with his custom invoice format displaying the invoice number,
> > date,
> > amount at fixed places, they could be used on such a template and the
> > software
> > can check, if the document contains such a region.
> >
>
>
> Regional OCR is a must have feature and usually a defining feature of the
> commercial offerings, I don't know how accurate OCRing a rectangle of text
> would be but if there is a need for the feature let's do it. I see some
> requirements, we need a way to let users mark/highlight the fields they
> want scanned and entered as metadata. This would require some design
> decisions (do we store the cursor's x and y positions of the square to be
> scanned or the x and y % in relation to the current zoom level)

The more agnostic of the zoom level, the better. So I would think x and y in
relation to X and Y (where X and Y are the dimensions of the whole page).

> and a rich client w/ corresponding API endpoints to talk to the backend.

Do you mean, a separate client is needed for that purpose?
Hopefully I understand your question. My vision is:

- process all documents in the watch folder:
i.e.
- do OCR
- include the document in the database
- add the document to the post processing queue
- delete the document from the watch folder

so that all documents are already available in the system, but should/can be
post-processed to control/add metadata.

So IIUC the question, the queue is independent from the watch folder.

Probably we should also add an option when uploading via API, if the document
should be added to this queue.

Nice would be a configuration option to skip inclusion in the post-processing
queue, if certain metadata are registered (e.g. specific metadata fields
contain values).


--

Mathias Behrle
PGP/GnuPG key availabable from any keyserver, ID: 0x8405BBF6
signature.asc

Matthias Löblich

unread,
Sep 5, 2014, 10:28:00 AM9/5/14
to mayan...@googlegroups.com
Hi,
I am an IT-consultant located in Austria with experiences in doing workflows ( most in java using jbpm).

Here an idea for implementing an simple workflow engine to process new uploaded documents:

An workflow consists of configurable actions. Which are processed by an workflow-handler. An action is an python class with has to have two function init and run. All the data generated within an action is stored in an context ( dict) . This context is passed to each action (init-function) so the data from the previous actions can be reused. In the example below the "fax cover sheet detection action" uses the text and the layout informations (hOCR)  generated by "tesseract action" to classify which kind of document it is. The tagging action afterwards uses the classification to tag the document.

Example workflow run for tiff (or other images). 
3.1.1) run command line action: unpaper 
3.1.2) run command line action: tesseract
3.1.2) run fax cover sheet detection action
3.1.3) run document tagging action

Some Ideas for useful generic actions:
- run command line program action (e.g. unpaper, tesseract, convert)
- matching layout action (hOCR parser, to check if test is in an specific region of the document)
- regular expression action ( parsing text for tagging document)
- messaging action (notify an user or another system that an specific document has been imported)

happy to get your feedback !

Matthias

Roberto Rosario

unread,
Sep 6, 2014, 2:37:31 AM9/6/14
to mayan...@googlegroups.com
I don't we would need to specify the page count. We can come up with some base codes that are encoded into a qrcode and printed as a cover page. When Mayan detects the cover page all documents or pages detected afterwards inherit whatever metadadata, document type or any setting specified in the cover page. If another cover page is detected Mayan know this is the beginning of a new document or documents. Example:

* A 'set metadata' cover page with some values encoded: vendor="vendor 1"
* A 'set metadata' cover page with some values encoded: vendor="vendor 2"
* A 'new document' cover page

The physical document paper sandwich would be:

- Set metadata cover, vendor 1
- New document cover page
- Document 1 page 1
- Document 1 page 2
- New document cover page
- Document 2 page 1
- Document 2 page 2
- Set metadata cover, vendor 2
- New document cover page
- Document 3 page 1
- Document 3 page 2

All of this is scanned in one go using a paper feeder and we just scanned and pushed into Mayan 3 multipage documents with 2 of them using the same metadata and one with a different metadata. We can create more 'control message' for new cover page types as we need along the road and can cover several user scenarios. We can create an 'End document' cover page if needed. The cover page is just a blank page with a QR code. Control cover pages can be physically reused or photocopied only cover pages with dynamic user data like the set metadata cover page would need to be printer more than once if the metadata changes, but if the metadata is periodic, like say vendor names they can also be reused.
 
> > - Rather straightforward is a sort of recognition, where templates can be
> >   defined containing regions formatted in an individual way. E.g. if you
> > have a
> >   supplier with his custom invoice format displaying the invoice number,
> > date,
> >   amount at fixed places, they could be used on such a template and the
> > software
> >   can check, if the document contains such a region.
> >
>
>
> Regional OCR is a must have feature and usually a defining feature of the
> commercial offerings, I don't know how accurate OCRing a rectangle of text
> would be but if there is a need for the feature let's do it. I see some
> requirements, we need a way to let users mark/highlight the fields they
> want scanned and entered as metadata. This would require some design
> decisions (do we store the cursor's x and y positions of the square to be
> scanned or the x and y % in relation to the current zoom level)

The more agnostic of the zoom level, the better. So I would think x and y in
relation to X and Y (where X and Y are the dimensions of the whole page). 
 
> and a rich client w/ corresponding API endpoints to talk to the backend.

Do you mean, a separate client is needed for that purpose?

I meant that some interactive javascript/jquery code will be needed on the template, sorry about the confusing wording :)

Mathias Behrle

unread,
Sep 6, 2014, 7:30:09 AM9/6/14
to mayan...@googlegroups.com
* Roberto Rosario: " Re: [Mayan EDMS: 836] Automatic upload from certain
staging folder" (Fri, 5 Sep 2014 23:37:31 -0700 (PDT)):
Sounds good for me! It covers all scenarios I can imagine, including the one to
reset the metadata with a metadata cover only containing empty values.
Thanks, now clear for me.
signature.asc
Reply all
Reply to author
Forward
0 new messages