Problem processing text file after uploading

50 views
Skip to first unread message

David M.

unread,
Jul 7, 2012, 11:11:46 AM7/7/12
to rubyonra...@googlegroups.com
I've got a web-app currently partially working. The user uploads a .txt,
.docx or .doc file to the server.

Currently the model handles those files, saves some metadata (the
extention and orig filename) then saves the file to the hard drive. Next
it converts the doc and docx files to plain text and saves the output to
a txt file.

My problem is I want to copy the plain text contents of those txt files
to the :body field in my database, but by the time those files are
written no more changes can be sent to the data base (because all the
file handling is done in after_save)

Where or how do I sanely get the contents of those TXT files into the
database?

See model attached:

Attachments:
http://www.ruby-forum.com/attachment/7574/doc_file.rb


--
Posted via http://www.ruby-forum.com/.

Walter Lee Davis

unread,
Jul 7, 2012, 11:36:09 AM7/7/12
to rubyonra...@googlegroups.com

On Jul 7, 2012, at 11:11 AM, David M. wrote:

> I've got a web-app currently partially working. The user uploads a .txt,
> .docx or .doc file to the server.
>
> Currently the model handles those files, saves some metadata (the
> extention and orig filename) then saves the file to the hard drive. Next
> it converts the doc and docx files to plain text and saves the output to
> a txt file.
>
> My problem is I want to copy the plain text contents of those txt files
> to the :body field in my database, but by the time those files are
> written no more changes can be sent to the data base (because all the
> file handling is done in after_save)
>
> Where or how do I sanely get the contents of those TXT files into the
> database?

I built this feature in my first commercial Rails app. I used Paperclip for my file storage, which offers its own callback called 'after_post_process' that worked out perfectly for me.

First, I created a Paperclip processor to extract the text version of the uploaded file (mine were all PDF).

# /lib/paperclip_processors/text.rb

module Paperclip
# Handles extracting plain text from PDF file attachments
class Text < Processor

attr_accessor :whiny

# Creates a Text extract from PDF
def make
src = @file
dst = Tempfile.new([@basename, 'txt'].compact.join("."))
command = <<-end_command
"#{ File.expand_path(src.path) }"
"#{ File.expand_path(dst.path) }"
end_command

begin
success = Paperclip.run("/usr/bin/pdftotext -nopgbrk", command.gsub(/\s+/, " "))
Rails.logger.info "Processing #{src.path} to #{dst.path} in the text processor."
rescue PaperclipCommandLineError
raise PaperclipError, "There was an error processing the text for #{@basename}" if @whiny
end
dst
end
end
end

Then in my document.rb (model for the file attachment), I added the following bits:

has_attached_file :pdf,:styles => { :text => { :fake => 'variable' } }, :processors => [:text]

after_post_process :extract_text


private
def extract_text
file = File.open("#{pdf.queued_for_write[:text].path}","r")
plain_text = ""
while (line = file.gets)
plain_text << Iconv.conv('ASCII//IGNORE', 'UTF8', line)
end
self.plain_text = plain_text
end

And that was that.

Walter

>
> See model attached:
>
> Attachments:
> http://www.ruby-forum.com/attachment/7574/doc_file.rb
>
>
> --
> Posted via http://www.ruby-forum.com/.
>
> --
> You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group.
> To post to this group, send email to rubyonra...@googlegroups.com.
> To unsubscribe from this group, send email to rubyonrails-ta...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/rubyonrails-talk?hl=en-US.
>

David M.

unread,
Jul 7, 2012, 11:44:12 AM7/7/12
to rubyonra...@googlegroups.com
But...paperclip is OLD and unmaintained, and this is also a learning
project.

So is there some (best practices) way to do the following things without
having to make another pass over my doc_file or using paperclip:

1. upload .doc and store metadata
2. convert to plain text and write .txt to hard drive
3. grab contents of .txt file an store in database

Hassan Schroeder

unread,
Jul 7, 2012, 12:06:49 PM7/7/12
to rubyonra...@googlegroups.com
On Sat, Jul 7, 2012 at 8:11 AM, David M. <li...@ruby-forum.com> wrote:

> Currently the model handles those files, saves some metadata (the
> extention and orig filename) then saves the file to the hard drive. Next
> it converts the doc and docx files to plain text and saves the output to
> a txt file.
>
> My problem is I want to copy the plain text contents of those txt files
> to the :body field in my database, but by the time those files are
> written no more changes can be sent to the data base (because all the
> file handling is done in after_save)

Wouldn't the obvious answer be to do the file handling in before_save?

And is there a reason to write the text to a file in the first place if you're
just going to save it in the DB?

--
Hassan Schroeder ------------------------ hassan.s...@gmail.com
http://about.me/hassanschroeder
twitter: @hassan

David M.

unread,
Jul 7, 2012, 12:21:30 PM7/7/12
to rubyonra...@googlegroups.com
Hassan Schroeder wrote in post #1067807:
The file handling code I have doesn't seem to function unless it happens
after_save, I'm not sure why that is.

The idea about saving the txt files to disk is so that the client can
download them via ftp.

Hassan Schroeder

unread,
Jul 7, 2012, 12:58:54 PM7/7/12
to rubyonra...@googlegroups.com
On Sat, Jul 7, 2012 at 9:21 AM, David M. <li...@ruby-forum.com> wrote:

> The file handling code I have doesn't seem to function unless it happens
> after_save, I'm not sure why that is.

Well, since it's a "learning project" maybe that would be a good place
to start :-)

Alternatively, you might consider pushing the doc-to-text conversion
into a background job, which adds the text of the db record once it's
finished. Or use an Observer to add the text after after_save.

Multiple possibilities...

Colin Law

unread,
Jul 7, 2012, 1:00:25 PM7/7/12
to rubyonra...@googlegroups.com
With files it is often better just to store them in files and not in
the database. Certainly they should not be stored in both file and
database.

Colin

David M.

unread,
Jul 7, 2012, 1:02:41 PM7/7/12
to rubyonra...@googlegroups.com
Hassan Schroeder wrote in post #1067812:
> On Sat, Jul 7, 2012 at 9:21 AM, David M. <li...@ruby-forum.com> wrote:
>
>> The file handling code I have doesn't seem to function unless it happens
>> after_save, I'm not sure why that is.
>
> Well, since it's a "learning project" maybe that would be a good place
> to start :-)

Any hints?

Hassan Schroeder

unread,
Jul 7, 2012, 1:09:30 PM7/7/12
to rubyonra...@googlegroups.com
On Sat, Jul 7, 2012 at 10:02 AM, David M. <li...@ruby-forum.com> wrote:

>>> The file handling code I have doesn't seem to function unless it happens
>>> after_save, I'm not sure why that is.
>>
>> Well, since it's a "learning project" maybe that would be a good place
>> to start :-)
>
> Any hints?

Start by defining exactly what "doesn't seem to function" means :-)

Colin Law

unread,
Jul 7, 2012, 1:09:24 PM7/7/12
to rubyonra...@googlegroups.com
On 7 July 2012 18:02, David M. <li...@ruby-forum.com> wrote:
> Hassan Schroeder wrote in post #1067812:
>> On Sat, Jul 7, 2012 at 9:21 AM, David M. <li...@ruby-forum.com> wrote:
>>
>>> The file handling code I have doesn't seem to function unless it happens
>>> after_save, I'm not sure why that is.
>>
>> Well, since it's a "learning project" maybe that would be a good place
>> to start :-)
>
> Any hints?

Have a look at the Rails Guide on debugging for techniques that can be
used to debug your code. If you still can't work out what is going on
then come back with the details of the section of code that is failing
to so what you expect.

Colin

David M.

unread,
Jul 7, 2012, 1:12:34 PM7/7/12
to rubyonra...@googlegroups.com
Hassan Schroeder wrote in post #1067817:
When outside of after_save, a database entry gets created, but file_data
doesn't get saved to the hard drive.

Colin Law

unread,
Jul 7, 2012, 1:24:04 PM7/7/12
to rubyonra...@googlegroups.com
You need to do some debugging to see what is going on. Is the save
failing or is it not getting to the save statement for some reason?
Having worked out which of those is happening then do more debugging
to find out why.

Colin

Hassan Schroeder

unread,
Jul 7, 2012, 1:56:48 PM7/7/12
to rubyonra...@googlegroups.com
On Sat, Jul 7, 2012 at 10:12 AM, David M. <li...@ruby-forum.com> wrote:

> When outside of after_save, a database entry gets created, but file_data
> doesn't get saved to the hard drive.

OK, why not?

As Colin suggested, study the debugging guide (or just put logging
statements in the code to see what's happening at each step).

David M.

unread,
Jul 7, 2012, 4:24:30 PM7/7/12
to rubyonra...@googlegroups.com
I know you guys seem to be sticking to the RTFM hardline, but it seems
as though debugging in the model has very few options without importing
a bunch of gems.

Even on the page recommended there are 35 mentions of controller, and
only 4 mentions of model.

I installed debugger 'gem install debugger', but it doesn't integrate at
all with webrick ('rails s') and there apparently is no ruby-debug for
1.9.3 (ughh..)

I've put a bunch of logger.info in my model, but I now know no more than
I did before.

When store_docfile is called before after_save, it never even gets to
the first line containing the logger.info "we are now in store_docfile"
message.

I have a feeling this might be something deeper than a tiny typo *shrug*

If one of you could PLEASE just look at my model and help me figure out
what's up, it would be appreciated.

Hassan Schroeder

unread,
Jul 7, 2012, 5:18:51 PM7/7/12
to rubyonra...@googlegroups.com
On Sat, Jul 7, 2012 at 1:24 PM, David M. <li...@ruby-forum.com> wrote:

> When store_docfile is called before after_save, it never even gets to
> the first line containing the logger.info "we are now in store_docfile"
> message.

I don't see any obvious problems in your original file.

If not with after_save, how are you calling store_docfile now? You
might want to post your new code for the model (and controller).

David M.

unread,
Jul 7, 2012, 6:55:21 PM7/7/12
to rubyonra...@googlegroups.com
Hassan Schroeder wrote in post #1067836:
The controller is a typical unmodified scaffolded CRUD/REST.

The (non functional) model is attached.

Attachments:
http://www.ruby-forum.com/attachment/7575/doc_file.rb

Hassan Schroeder

unread,
Jul 7, 2012, 8:45:40 PM7/7/12
to rubyonra...@googlegroups.com
On Sat, Jul 7, 2012 at 3:55 PM, David M. <li...@ruby-forum.com> wrote:

>>> When store_docfile is called before after_save, it never even gets to
>>> the first line containing the logger.info "we are now in store_docfile"
>>> message.

In your new example file, it's no surprise you're not seeing anything --
you're never calling `store_docfile` at all. (No, that random standalone
`:store_docfile` doesn't do what you're hoping it does.)

Either invoke it from a before_save, or make it a non-private method
(at least temporarily) and invoke it explicitly from your controller and
see what happens.

Colin Law

unread,
Jul 8, 2012, 3:47:15 AM7/8/12
to rubyonra...@googlegroups.com
On 7 July 2012 21:24, David M. <li...@ruby-forum.com> wrote:
> I know you guys seem to be sticking to the RTFM hardline, but it seems
> as though debugging in the model has very few options without importing
> a bunch of gems.
>
> Even on the page recommended there are 35 mentions of controller, and
> only 4 mentions of model.
>
> I installed debugger 'gem install debugger', but it doesn't integrate at
> all with webrick ('rails s') and there apparently is no ruby-debug for
> 1.9.3 (ughh..)
>
> I've put a bunch of logger.info in my model, but I now know no more than
> I did before.
>
> When store_docfile is called before after_save, it never even gets to
> the first line containing the logger.info "we are now in store_docfile"
> message.

That is the clue then, but you are misinterpreting what you are
seeing. If it is not getting to the first line then it is not in fact
calling the method at all. Check out how you are calling it.

Colin

Matt Jones

unread,
Jul 8, 2012, 9:20:31 AM7/8/12
to rubyonra...@googlegroups.com


On Saturday, 7 July 2012 11:44:12 UTC-4, Ruby-Forum.com User wrote:
But...paperclip is OLD and unmaintained, and this is also a learning
project.


Perhaps you could start by "learning" how to decide whether a gem is unmaintained. For instance:


doesn't exactly look like "no activity" to me...

--Matt Jones
 
Reply all
Reply to author
Forward
0 new messages