I am trying to extract a plain text from pdfs using Apache Tika. I can use a python binding, python-tika, but somehow I am not sure it's an efficient way as some files can come up more than 25M.
What I need is the extract text instead of sending the files themselves to the server side. The best scenario would be 'extract on the client using Tika and send that plain text to the server/Django'. How would i implement this?
Thanks,
Ali