The creation of the tempfile happens at utils.rb line 486:
if content_type || filename
>> body = Tempfile.new("RackMultipart")
body.binmode if body.respond_to?(:binmode)
end
The body is then written to (with the content of the request body),
rewound, put into the "data" structure, passed to normalize_params
(which inserts the "data" into the params hash), and that's it. The
params are then presumably used for the request, and nobody ever
explicitly cleans up the tempfile.
Am I looking in the wrong place?
If I'm correct, this is a bug. Tempfiles should not be relied upon to
clean themselves up in response to GC, since you don't know when GC
will fire...
- Charlie
class Tempfile < ::Tempfile
def _close
@tmpfile.close if @tmpfile
@data[1] = nil if @data
@tmpfile = nil
end
end
the original _close (in 1.8.7) does this, which should be functionally
equivalent:
def _close # :nodoc:
@tmpfile.close if @tmpfile
@tmpfile = nil
@data[1] = nil if @data
end
Is this patching a bug from 1.8.6?
FYI, in JRuby there is no _close, since it's a private impl method and
we have our own tempfile implementation.
- Charlie
Actually, I was wrong; we do have _close, but it behaves somewhat
differently than MRI.
Still seems like it's an internal method that shouldn't be monkeyed with...
- Charlie
Why not? Tempfiles are objects, too. It's perfectly reasonable
to let GC clean them up like any other object.
I don't know why RewindableInput is monkeying with Tempfile internals,
though.
--
Eric Wong
Yes. The following code doesn't work on 1.8.6:
t = Tempfile.new('foo')
t.unlink
t.close # <--- crash, because of the early unlink
Phusion Passenger terminates worker processes by calling exit!. This
does not call finalizers on MRI, causing temp files to be left behind
sometimes. There seems to be no way to call finalizers, even GC.start
doesn't work. I too would like to see Tempfiles being cleaned up
explicitly.
Other than that, I've seen system administrators who are confused by
the fact that such Tempfiles are not immediately cleaned up. Some
people who operate websites that handle a large number of concurrent
uploads are worried that they might run out of disk space because of
this.
Thanks, that at least explains one mystery.
- Charlie
It's roundly considered by every programming language community in the
world to be *really* bad form to let GC clean up IO. There's tons of
reasons for this:
* GC may not run as soon as you like, causing you to hold IO resources
too long (and under heavy load, potentially use up all descriptors for
a process)
* On some systems, old objects get promoted to heaps that are only
rarely collected, resulting in them potentially never GCing (or only
GCing extremely rarely)
* There's no reason you *shouldn't* be able to close resources when
they're no longer used. In this case, close any files opened for a
give request once the request has been processed (async libraries
using those files should expect them to be closed and deal with that
accordingly).
Bug-finding tools for several languages flag this sort of behavior as
a high-priority bug. Languages like Go, C#, and Java have or are
adding features to help guarantee you never do this. It's pretty much
universally frowned upon.
When I tweeted about Rubyists leaving IO objects to be cleaned up or
closed by GC, *everyone* agreed that it's a bug in the code, and that
nobody should ever rely on GC to release File/IO resources.
What more can I say? :)
- Charlie
We did run into some difficulties proving that MRI cleans up old files
when it GCs, mostly because of quirks in the GC (constructing the
tempfile in a method body kept it alive even when it wasn't
referenced, but constructing it in a block invocation allowed it to
GC...MRI's GC baffles me sometimes). But we did confirm it...it's just
unreliable in certain cases (and obviously if you exit! it's totally
useless, since the process just disappears and leaves files behind).
All IO-related objects should be explicitly cleaned up, without
exception...including the unlink of a tempfile once it's no longer
needed.
> Other than that, I've seen system administrators who are confused by
> the fact that such Tempfiles are not immediately cleaned up. Some
> people who operate websites that handle a large number of concurrent
> uploads are worried that they might run out of disk space because of
> this.
This is what we ran into. Because of a bug in JRuby 1.4, our new
Tempfile impl was only unlinking them on exit (or when explicitly told
to do so, of course). A user eventually filled their temp space since
various Ruby libraries did not explicitly close! Tempfile resources,
including Rack (attachment_fu was the other one...have not looked into
that yet).
Yes, JRuby was incorrect to not unlink tempfiles on GC, but Rack (and
any other Ruby library dong the same) is broken if it does not clean
up tempfiles once they're no longer needed.
I'm certainly willing to help fix Rack, but I'm not sure where would
be the best place to close! the tempfile.
- Charlie
Always? Why?
> This does not call finalizers on MRI, causing temp files to be left behind
> sometimes. There seems to be no way to call finalizers, even GC.start
> doesn't work. I too would like to see Tempfiles being cleaned up
> explicitly.
>
> Other than that, I've seen system administrators who are confused by
> the fact that such Tempfiles are not immediately cleaned up. Some
> people who operate websites that handle a large number of concurrent
> uploads are worried that they might run out of disk space because of
> this.
Doesn't Passenger have the luxury of only supporting sane platforms
that let you unlink Tempfiles and continue to use them?
--
Eric Wong
You're forgetting there are programming language communities that
consider GC to be bad form in the first place :)
Given that Ruby does have a GC, I'd prefer to be as lazy as possible and
let the runtime deal with it.
> * GC may not run as soon as you like, causing you to hold IO resources
> too long (and under heavy load, potentially use up all descriptors for
> a process)
MRI considers EMFILE/ENFILE to be analogous to ENOMEM, and would trigger
GC in that case.
> * On some systems, old objects get promoted to heaps that are only
> rarely collected, resulting in them potentially never GCing (or only
> GCing extremely rarely)
Those GCs could be made smarter about dealing with IO objects.
> * There's no reason you *shouldn't* be able to close resources when
> they're no longer used. In this case, close any files opened for a
> give request once the request has been processed (async libraries
> using those files should expect them to be closed and deal with that
> accordingly).
Of course being able to close IO objects explicitly is good. In some
cases of socket/pipe IPC, it's the _only_ way to signal end-of-input.
HTTP/1.0 and HTTP/0.9 don't work otherwise.
> Bug-finding tools for several languages flag this sort of behavior as
> a high-priority bug. Languages like Go, C#, and Java have or are
> adding features to help guarantee you never do this. It's pretty much
> universally frowned upon.
As I understand it, C# and Java have to deal with operating/file
systems that don't work with unlinked files. I guess JRuby is limited
to only being able to use things that work on those platforms, and can't
work with unlinked temporary files reliably, my condolences.
I haven't had a chance to look Go...
> When I tweeted about Rubyists leaving IO objects to be cleaned up or
> closed by GC, *everyone* agreed that it's a bug in the code, and that
> nobody should ever rely on GC to release File/IO resources.
>
> What more can I say? :)
*everyone* meaning your followers on Twitter? I suspect your followers
on Twitter are more inclined to agree with you :)
--
Eric Wong
I suspect it needs to be done a per-handler basis or inside the server.
rack/handler/fastcgi.rb and rack/handler/lsws.rb both create
RewindableInput objects and close them in the ensure block.
--
Eric Wong
I guess this deals with wonky multipart uploads that browsers still
generate these days[1]. Ugh, yeah, it's nasty...
The comment below in _call is very important.
1. Ensure all tempfiles created by Rack go into an array in env,
probably env["rack.tempfiles"]:
tempfile = Tempfile.new("foo")
(env["rack.tempfiles"] ||= []) << tempfile
2. Have a middleware wrap everything, including the response body:
class TempfileReaper < Struct.new(:app, :env, :body)
# --------------- in config.ru -----------
# use TempfileReaper
# run MyApp.new
def initialize(app)
super(app)
end
# we wrap the entire response body later on, so dup it for
# reentrancy. You can actually avoid the enforced dup as
# an optimization, probably...
def call(env)
dup._call(env)
end
# used when wrapping the response body
def each(&block)
body.each { |chunk| yield chunk }
end
# the Rack server should call this (when we're the body)
def close
tempfiles = env["rack.tempfiles"]
if tempfiles
tempfiles.each { |tmp| tmp.close! rescue nil }
end
end
# wrap the normal application call, saving env
def _call(env)
self.env = env
# XXX VERY IMPORTANT:
# you need to ensure env stays the same throughout the request,
# some middlewares overwrite/replace it instead of merge!-ing into it
status, headers, body = app.call(env)
self.body = body
[ status, headers, self ]
end
end
[1] - in my world of singing and dancing penguins, my clients use
"curl -T" to send proper PUT requests :)
--
Eric Wong
> On Mar 6, 8:55 am, Eric Wong <normalper...@yhbt.net> wrote:
>> Charles Oliver Nutter <head...@headius.com> wrote:
>>
>>> If I'm correct, this is a bug. Tempfiles should not be relied upon to
>>> clean themselves up in response to GC, since you don't know when GC
>>> will fire...
>>
>> Why not? Tempfiles are objects, too. It's perfectly reasonable
>> to let GC clean them up like any other object.
>
> Phusion Passenger terminates worker processes by calling exit!. This
> does not call finalizers on MRI, causing temp files to be left behind
> sometimes.
So use #exit. Speed isn't everything.
> There seems to be no way to call finalizers, even GC.start
> doesn't work. I too would like to see Tempfiles being cleaned up
> explicitly.
GC.start makes no promises of determinism, so expecting it to have any effect is wrong thinking, expecting calls over a longer time period to eventually do something, sure, but immediate, no.
The cleanup will have to be done by the handler stack or a wrapping middleware as Eric has pointed out.
> Other than that, I've seen system administrators who are confused by
> the fact that such Tempfiles are not immediately cleaned up. Some
> people who operate websites that handle a large number of concurrent
> uploads are worried that they might run out of disk space because of
> this.
There's no accounting for sysadmins who can't use lsof.
I agree for debugging reasons. Finding some IO from 3 requests back really raises questions I'd rather not have to raise.
To simply our code. Phusion Passenger forks-without-exec worker
processes from a parent. It does this on multiple levels. If I don't
call exit! then a SystemExit exception will propagate through the call
stack and I'd have to modify all the 'ensure' blocks in the parent to
take into account they're now being executed in a child process. It's
definitely possible to get rid of the exit! but it makes our code more
complicated.
> Doesn't Passenger have the luxury of only supporting sane platforms
> that let you unlink Tempfiles and continue to use them?
Yes but the Tempfile is created by Rack and on the Phusion Passenger
level we have no control over those tempfiles. Even if Rack
immediately unlinks those tempfiles, the actual disk space isn't
released until the file handles are closed. I believe Rack currently
does not even call #close on the Tempfiles.
James Tucker:
> So use #exit. Speed isn't everything.
It's not speed.
> There's no accounting for sysadmins who can't use lsof.
I disagree. Software is made to be used by users, whose skills might
vary significantly. Skilled sysadmins know how to use lsof and know
how to take care of problems, but not everybody is or can be a skilled
sysadmin. Server resources also vary greatly, and some people don't
and/or cannot have have a lot of disk space. I think software should
do the right thing even in the face of people who don't know how to
use lsof.
Charles's middleware looks simple enough. I approve.
Sorry. I mean Eric's. :)
Another thing that you might want to keep in mind is that users might
File.rename() a certain Tempfile, in which case you will not want to
unlink it. For example suppose one uploads a large file to the web
server, I think there are web apps out there that will simply rename()
the tempfile to the storage directory and therefore avoid an expensive
copy operation.
And this may be the best reason for cleaning up non-memory resources
explicitly (like files and file descriptors) rather than expecting GC
to do it for you: that's how the rest of the world does it. It's
unexpected (and usually frowned upon) for a program to leave dangling
non-memory resources all over the place, even if they'll eventually
(hopefully? maybe?) get cleaned up when GC fires (if it does before
something else happens, like a hard exit).
I sympathize with the desire to be "lazy" and let GC do it all, but
that's simply not how it's done. Non-memory resources should always be
explicitly cleaned up.
- Charlie
> Another thing that you might want to keep in mind is that users might
> File.rename() a certain Tempfile, in which case you will not want to
> unlink it. For example suppose one uploads a large file to the web
> server, I think there are web apps out there that will simply rename()
> the tempfile to the storage directory and therefore avoid an expensive
> copy operation.
Small nit here: if tempfile's close is unlinking the file, it doesn't matter if
something else has linked to it. It's the filesystem's responsibility, after
all, to garbage collect unlinked files.
-Randy Fischer
> On Mar 8, 12:53 am, Eric Wong <normalper...@yhbt.net> wrote:
>> Always? Why?
>
> To simply our code. Phusion Passenger forks-without-exec worker
> processes from a parent. It does this on multiple levels. If I don't
> call exit! then a SystemExit exception will propagate through the call
> stack and I'd have to modify all the 'ensure' blocks in the parent to
> take into account they're now being executed in a child process. It's
> definitely possible to get rid of the exit! but it makes our code more
> complicated.
I can see how that would be painful, totally.
>> Doesn't Passenger have the luxury of only supporting sane platforms
>> that let you unlink Tempfiles and continue to use them?
>
> Yes but the Tempfile is created by Rack and on the Phusion Passenger
> level we have no control over those tempfiles. Even if Rack
> immediately unlinks those tempfiles, the actual disk space isn't
> released until the file handles are closed. I believe Rack currently
> does not even call #close on the Tempfiles.
Correct.
Some servers do, for example, in thin: http://github.com/macournoyer/thin/blob/master/lib/thin/request.rb#L143
> I think software should
> do the right thing even in the face of people who don't know how to
> use lsof.
Just to be clear, I don't disagree at all that this needs to be fixed. I think handlers/servers should close any tempfiles they create, and the requirement for middleware should be the same.
I don't think it should be specified behavior that a tempfile *ever*
exists on disk. The mechanism of temporary storage for a large
incoming post should be a black box. Consider systems that won't
actually have a writable filesystem, like GAE; there, the "tempfile"
data would be stored in memory or in BigTable.
Rack should not require that large multipart posts or anything else
require a filesystem, nor guarantee that they'll be on any filesystem
that exists.
- Charlie
It's for this reason that I'd prefer to see this handled correctly by servers +/ handlers in general.
Middleware may use what it likes, but it takes responsibility for cleanup of whatever it creates.
>
> - Charlie
Glad we're on the same page. :) Phusion Passenger does clean up its
own tempfiles. As for the link you posted, I think that's for cleaning
up the rack.input buffer file, not multipart files.
Charles Nutter:
> I don't think it should be specified behavior that a tempfile *ever*
> exists on disk. The mechanism of temporary storage for a large
> incoming post should be a black box. Consider systems that won't
> actually have a writable filesystem, like GAE; there, the "tempfile"
> data would be stored in memory or in BigTable.
Agreed. However I think it should be *possible* to access the on-disk
file directly as an optimization for apps that need it, they just
musn't rely on it to be available. For example if your app deals with
2 GB file uploads then you really want to avoid that extra copy when
storing the file upload in a storage location on the same filesystem,
whenever possible.
Yeah, that's reasonable, so long as it's not specified that it will
always be available on disk. On systems where filesystem access *is*
allowed, we of course do basically the same thing as MRI+Rack.
- Charlie
Btw, if you guys like the middleware please don't hesitate to
make a proper patch + spec and get it into Rack.
I don't really care for it (not enough to even be credited :).
--
Eric Wong
I'm with you so far...
> 2. Have a middleware wrap everything, including the response body:
I'm a bit new to the middleware thing. Does this mean everyone would
have to configure middleware on their setups to get this
tempfile-closing behavior? And your comment below...does that mean
there are situations where this wouldn't work?
Closing and removing tempfiles when they're no longer needed should be
the default behavior, not something you have to configure. It's a bug
to not close and remove them (or at least, a bug to keep creating new
ones and expecting GC to clean everything up eventually).
> class TempfileReaper < Struct.new(:app, :env, :body)
>
> # --------------- in config.ru -----------
> # use TempfileReaper
> # run MyApp.new
> def initialize(app)
> super(app)
> end
>
> # we wrap the entire response body later on, so dup it for
> # reentrancy. You can actually avoid the enforced dup as
> # an optimization, probably...
> def call(env)
> dup._call(env)
> end
>
> # used when wrapping the response body
> def each(&block)
> body.each { |chunk| yield chunk }
> end
>
> # the Rack server should call this (when we're the body)
> def close
> tempfiles = env["rack.tempfiles"]
> if tempfiles
> tempfiles.each { |tmp| tmp.close! rescue nil }
> end
> end
By "the Rack server should call this" do you mean that if the server
doesn't call this, tempfiles Rack creates will not be cleaned up until
GC runs?
Shouldn't Rack itself be guaranteeing it doesn't leave garbage files around?
> # wrap the normal application call, saving env
> def _call(env)
> self.env = env
>
> # XXX VERY IMPORTANT:
> # you need to ensure env stays the same throughout the request,
> # some middlewares overwrite/replace it instead of merge!-ing into it
> status, headers, body = app.call(env)
> self.body = body
> [ status, headers, self ]
> end
> end
Same question as above...can badly-written middleware now cause
tempfiles to linger?
- Charlie
I would expect major frameworks to stick this into the default
middleware stack as a convenience, but some users want a more bare-bones
Rack stack can still opt out. I see it as needless overhead as the vast
majority of HTTP requests do not create tempfiles.
I don't make decisions for Rack, though.
<snip>
> > � � �# the Rack server should call this (when we're the body)
> > � � �def close
> > � � � �tempfiles = env["rack.tempfiles"]
> > � � � �if tempfiles
> > � � � � �tempfiles.each { |tmp| tmp.close! rescue nil }
> > � � � �end
> > � � �end
>
> By "the Rack server should call this" do you mean that if the server
> doesn't call this, tempfiles Rack creates will not be cleaned up until
> GC runs?
Yes, a Rack-compliant server should call body.close if it responds to
body.close at the end of the response cycle for every individual
request.
I put this in the wrapped body because body.each {} could be relying on
the tempfiles to generate the response. body.close is the absolute last
action for any Rack application.
> Shouldn't Rack itself be guaranteeing it doesn't leave garbage files around?
>
> > � � �# wrap the normal application call, saving env
> > � � �def _call(env)
> > � � � �self.env = env
> >
> > � � � �# XXX VERY IMPORTANT:
> > � � � �# you need to ensure env stays the same throughout the request,
> > � � � �# some middlewares overwrite/replace it instead of merge!-ing into it
> > � � � �status, headers, body = app.call(env)
> > � � � �self.body = body
> > � � � �[ status, headers, self ]
> > � � �end
> > � �end
>
> Same question as above...can badly-written middleware now cause
> tempfiles to linger?
Yes, it might be safer to do this before app.call above:
� env["rack.tempfiles"] ||= []
That way if env gets replaced down the stack, e.g. via:
app.call(env.merge("foo.hello" => "world"))
# env.merge! would be correct above
any use of env["rack.tempfiles"] will still point to the same array.
Well, almost...
Then again some bad middleware could do:
� env["rack.tempfiles"] += [ tmp_a, tmp_b ]
Instead of what they _should_ do:
� env["rack.tempfiles"].concat([ tmp_a, tmp_b ])
So yes, discourage future middleware authors from replacing
the rack.tempfiles array.
--
Eric Wong
Charles Oliver Nutter <hea...@headius.com> wrote:
> On Sun, Mar 7, 2010 at 7:12 PM, Eric Wong <normal...@yhbt.net> wrote:
> > I guess this deals with wonky multipart uploads that browsers still
> > generate these days[1]. �Ugh, yeah, it's nasty...
> > The comment below in _call is very important.
> >
> >
> > 1. �Ensure all tempfiles created by Rack go into an array in env,
> > � �probably env["rack.tempfiles"]:
> >
> > � �tempfile = Tempfile.new("foo")
> > � �(env["rack.tempfiles"] ||= []) << tempfile
>
> I'm with you so far...
>
> > 2. �Have a middleware wrap everything, including the response body:
>
> I'm a bit new to the middleware thing. Does this mean everyone would
> have to configure middleware on their setups to get this
> tempfile-closing behavior? And your comment below...does that mean
> there are situations where this wouldn't work?
>
> Closing and removing tempfiles when they're no longer needed should be
> the default behavior, not something you have to configure. It's a bug
> to not close and remove them (or at least, a bug to keep creating new
> ones and expecting GC to clean everything up eventually).I would expect major frameworks to stick this into the default
middleware stack as a convenience, but some users want a more bare-bones
Rack stack can still opt out. I see it as needless overhead as the vast
majority of HTTP requests do not create tempfiles.I don't make decisions for Rack, though.
<snip>
> > � � �# the Rack server should call this (when we're the body)
> > � � �def close
> > � � � �tempfiles = env["rack.tempfiles"]
> > � � � �if tempfiles
> > � � � � �tempfiles.each { |tmp| tmp.close! rescue nil }
> > � � � �end
> > � � �end
>
> By "the Rack server should call this" do you mean that if the server
> doesn't call this, tempfiles Rack creates will not be cleaned up until
> GC runs?
Yes, a Rack-compliant server should call body.close if it responds to
body.close at the end of the response cycle for every individual
request.I put this in the wrapped body because body.each {} could be relying on
the tempfiles to generate the response. body.close is the absolute last
action for any Rack application.
> Shouldn't Rack itself be guaranteeing it doesn't leave garbage files around?
>
> > � � �# wrap the normal application call, saving env
> > � � �def _call(env)
> > � � � �self.env = env
> >
> > � � � �# XXX VERY IMPORTANT:
> > � � � �# you need to ensure env stays the same throughout the request,
> > � � � �# some middlewares overwrite/replace it instead of merge!-ing into it
> > � � � �status, headers, body = app.call(env)
> > � � � �self.body = body
> > � � � �[ status, headers, self ]
> > � � �end
> > � �end
>
> Same question as above...can badly-written middleware now cause
> tempfiles to linger?
Yes, it might be safer to do this before app.call above:
� env["rack.tempfiles"] ||= []
That way if env gets replaced down the stack, e.g. via:
app.call(env.merge("foo.hello" => "world"))
# env.merge! would be correct aboveany use of env["rack.tempfiles"] will still point to the same array.
Well, almost...
Then again some bad middleware could do:
� env["rack.tempfiles"] += [ tmp_a, tmp_b ]
Instead of what they _should_ do:
� env["rack.tempfiles"].concat([ tmp_a, tmp_b ])