How to populate DB from PDF extracted data

17 views
Skip to first unread message

Shazia Nusrat

unread,
Mar 9, 2018, 3:01:32 AM3/9/18
to django...@googlegroups.com
Hi,

I am trying to work around with PDF's where user uploads PDF in image or filefield and then way to extract it for Django and finally update DB table based on it. Following are the models:

class StudentFee(models.Model):
       class_name = models.CharField(choices=CLASSES, max_lenght=200)
       fee_deposit_slip = models.ImageField(upload_to="students/")

       def __unicode__(self):
             return unicode(self.class_name)

All I need is to design a view where I can extract data from the PDF uploaded in the model below:

class StudentInfo(models.Model):
        first_name = models.CharField(max_length=200)
        last_name = models.CharField(max_length=200)
        email=models.EmailField()
        phone = PhoneField() #using phoneField custom field
        def __unicode__(self):
               return unicode(self.first_name)

All the fields in second model does exist in the PDF. In my Views.py

class StudentPDFReader(FormView):
      template_name = 'pdfdata.html'
      form_class = PDFForm
      success_url = '/success/'
     
      def form_valid(self, form):
           # here I need to extract and add entries to modelform
          return super(StudentPDFReader, self).form_valid(form)
            
Looking for kind help.

Regards,
Shazia
 

m1chael

unread,
Mar 9, 2018, 4:28:07 AM3/9/18
to django...@googlegroups.com
Good luck.

Best case scenario in my opinion is using the utility pdf2text and regex, and this will be painful. 



--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To post to this group, send email to django...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/CAD83tOyHABrpfn48EwMgjbvCB5y1U4AwwL_%2BS1EnCb6WebyWKw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Jason

unread,
Mar 9, 2018, 6:21:15 AM3/9/18
to Django users
PDF processing is very difficult, because the entire standard is a dumpster fire.  For example, it has no concept of structure like headings, paragraphs or sentences because each and every character is just a character, location coordinate, font size and font type.

In order to process the document and try to extract some of the structure out, its required to use heuristics.  Check out https://github.com/pdfminer/pdfminer.six and as Mike said above, good luck.  It is not a simple task.

Jaap van Wingerde

unread,
Mar 9, 2018, 9:17:38 AM3/9/18
to django...@googlegroups.com
Use 'pdftohtml - xml' to convert the pdf in an xml-file and use per
line in de xml-file regulair expressions to extract the data.

[pdftohtml]
https://www.sourceforge.net/projects/pdftohtml/

Op Fri, 9 Mar 2018 00:00:39 -0800
Shazia Nusrat <shazi...@gmail.com> schreef:
Reply all
Reply to author
Forward
0 new messages