Document storage and data mining

10 views
Skip to first unread message

Gary Roach

unread,
Jul 14, 2016, 7:02:30 PM7/14/16
to django-users
Hi all;

I have finished most of the official Django tutorial, have started
fooling around with my actual project and have realized that I'm not
sure how to start. My projects initial objectives are as follows:

Photos of the pages of a document are aggregated into a single pdf file
for those who wish to view the original document. An rtf transcript of
the document is include for readability and a txt transcript is included
for global searches. A metafile is included - part of which will be
generated by the program - to facilitate key word searches. I wish to be
able to search documents - probably by key word - and then pull up the
document set by document key. No changes to the documents will be
allowed after they have been loaded into the database. At present, my
main objective is to get the documents into the database in retrievable
form.

While - with the exception of the metafile - these are static files, we
are talking about hundreds of documents. I do not think that storing
them as static files will work. They have to be searchable. I assume
that I need a model that will set up appropriate fields in the database
(postgresql) This is where I stumble. I've looked at the Model.field
reference but can't seem to come up with what I need or don't know what
I'm looking at. The latter being the most probable.

If someone could point me in the right direction or to documentation
that would help; it would be sincerely appreciated.

Gary R.

Javier Guerra Giraldez

unread,
Jul 15, 2016, 4:53:18 AM7/15/16
to django...@googlegroups.com
On 15 July 2016 at 00:02, Gary Roach <gary71...@verizon.net> wrote:
> While - with the exception of the metafile - these are static files, we are
> talking about hundreds of documents. I do not think that storing them as
> static files will work. They have to be searchable. I assume that I need a
> model that will set up appropriate fields in the database (postgresql) This
> is where I stumble. I've looked at the Model.field reference but can't seem
> to come up with what I need or don't know what I'm looking at. The latter
> being the most probable.


do store the PDFs as static files. just not in a single directory,
instead add one or two levels of subdirectories. Alternatively, use
an Object Storage, like S3, for that (check django-storages [1] for an
easy way to do it). To make them searchable, store the plain text of
the document in a TextField in the model, and add a full text index
(you can use Watson [2] to help with that).


[1] http://django-storages.readthedocs.io/en/latest/
[2] https://github.com/etianen/django-watson


--
Javier
Reply all
Reply to author
Forward
0 new messages