high 500 error rate on file metadata operations

Jason Wright

unread,

Jan 21, 2011, 7:52:57 PM1/21/11

to puppet...@googlegroups.com

Howdy,

Lately our puppetmasters have been returning 500 errors at a
reasonably high rate. On any given day, 200-300 (out of many
thousands) of our clients are reporting that they've received a 500
during a run. So far, the only firm reports I've received have been
regarding our Ubuntu clients, for which we happen to have central
logging of puppet errors. I've done some analysis of the logs that I
have and found that the majority of the 500s (~83%) received by our
Ubuntu clients are on file metadata operations:

//shell/File[/etc/csh.cshrc]: Failed to retrieve current state of
resource: Error 500 on SERVER: <!DOCTYPE HTML PUBLIC ""-//IETF//DTD
HTML 2.0//EN"">
<html><head>
<title>500 Internal Server Error</title>

[etc...]

The selection of files appears to be random and generally retrying the
puppet run succeeds.

Our puppetmasters are behind load balanced VIPs and we're running
0.25.5 on Ubuntu Lucid with the stock Lucid Apache2/Ruby/Passenger
stack. I've gone through the bug tracker and while I see some issues
that are close, they don't match our specific symptoms (most of the
bugs I saw showed the clients getting 400s, not 500s, or that affect
sites already on 2.6.x). I'd really like to say that this started
when we upgraded the puppetmasters to Lucid, but we weren't tracking
the apache error rates at the time of the upgrades, so all I have to
go on is the bug report, which was filed weeks after the Lucid
upgrades were complete.

To answer some questions I've seen while searching for previous
puppet-users threads about metadata fetch problems:

- We do use environments and every client is in a non-default
environment. That is to say that we do not have an environment named
'production' at all.

- We do use an external node classifier but I don't think that it's
issue #3910. For one thing, the clients are receiving error code 500,
not 400 as Nigel reported in #3910. Actually, I can get a 400 just by
giving '--environment production', as reported in #3910, and the
resulting stack trace is different from the ones I list below.

- We have a cronjob that runs on the puppetmasters every 10min to
update the manifests from version control which does a checkout to a
temp location and then rsyncs over to the live location. It is
possible that files are being updated during some of the 500 errors,
but our OS teams have well defined release schedules (releases every
2-4 weeks) and the 500s continue between releases.

I enabled debug logging and tracing on a few of our puppetmasters so I
could get the stack traces logged. I'm getting the stack traces in
the apache error log but they're arriving without timestamps and
without the actual exception detail message. Also, unfortunately,
we're getting a lot of exceptions right now due to an unrelated
problem with our Mac clients so the logs are very noisy and it's
difficult to tell what is what. Also, since there are no
timestamps[1], it's hard to correlate the stack traces with specific
logs from puppetmasterd. That said, here are some candidate stack
traces, which are hopefully complete:

Related to locking:
/usr/lib/ruby/1.8/puppet/external/lock.rb:36:in `flush'
/usr/lib/ruby/1.8/puppet/external/lock.rb:36:in `lock_exclusive'
/usr/lib/ruby/1.8/puppet/util/file_locking.rb:38:in `writelock'
/usr/lib/ruby/1.8/puppet/util/file_locking.rb:37:in `open'
/usr/lib/ruby/1.8/puppet/util/file_locking.rb:37:in `writelock'
/usr/lib/ruby/1.8/sync.rb:230:in `synchronize'
/usr/lib/ruby/1.8/puppet/util/file_locking.rb:36:in `writelock'
/usr/lib/ruby/1.8/puppet/indirector/yaml.rb:40:in `save'
/usr/lib/ruby/1.8/puppet/indirector/indirection.rb:262:in `save'
/usr/lib/ruby/1.8/puppet/node/facts.rb:15:in `save'
/usr/lib/ruby/1.8/puppet/indirector.rb:65:in `save'
/usr/lib/ruby/1.8/puppet/indirector/catalog/compiler.rb:26:in
`extract_facts_from_request'
/usr/lib/ruby/1.8/puppet/indirector/catalog/compiler.rb:31:in `find'

Two different paths to a "fileserver module not mounted" exception:
/usr/lib/ruby/1.8/puppet/network/handler/fileserver.rb:401:in `splitpath'
/usr/lib/ruby/1.8/puppet/network/handler/fileserver.rb:236:in `convert'
/usr/lib/ruby/1.8/puppet/network/handler/fileserver.rb:133:in `list'
/usr/lib/ruby/1.8/rubygems/custom_require.rb:31
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:52:in `call'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:52:in `protect_service'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:85
/usr/lib/ruby/1.8/xmlrpc/server.rb:338:in `call'
/usr/lib/ruby/1.8/xmlrpc/server.rb:338:in `dispatch'
/usr/lib/ruby/1.8/xmlrpc/server.rb:325:in `each'
/usr/lib/ruby/1.8/xmlrpc/server.rb:325:in `dispatch'
/usr/lib/ruby/1.8/xmlrpc/server.rb:368:in `call_method'
/usr/lib/ruby/1.8/xmlrpc/server.rb:380:in `handle'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:44:in `process'
/usr/lib/ruby/1.8/puppet/network/http/rack/xmlrpc.rb:35:in `process'

/usr/lib/ruby/1.8/puppet/network/handler/fileserver.rb:401:in `splitpath'
/usr/lib/ruby/1.8/puppet/network/handler/fileserver.rb:236:in `convert'
/usr/lib/ruby/1.8/puppet/network/handler/fileserver.rb:68:in `describe'
/usr/lib/ruby/1.8/rubygems/custom_require.rb:31
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:52:in `call'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:52:in `protect_service'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:85
/usr/lib/ruby/1.8/xmlrpc/server.rb:338:in `call'
/usr/lib/ruby/1.8/xmlrpc/server.rb:338:in `dispatch'
/usr/lib/ruby/1.8/xmlrpc/server.rb:325:in `each'
/usr/lib/ruby/1.8/xmlrpc/server.rb:325:in `dispatch'
/usr/lib/ruby/1.8/xmlrpc/server.rb:368:in `call_method'
/usr/lib/ruby/1.8/xmlrpc/server.rb:380:in `handle'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:44:in `process'
/usr/lib/ruby/1.8/puppet/network/http/rack/xmlrpc.rb:35:in `process'

And there are a few of these:
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:52:in `call'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:52:in `protect_service'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:85
/usr/lib/ruby/1.8/xmlrpc/server.rb:338:in `call'
/usr/lib/ruby/1.8/xmlrpc/server.rb:338:in `dispatch'
/usr/lib/ruby/1.8/xmlrpc/server.rb:325:in `each'
/usr/lib/ruby/1.8/xmlrpc/server.rb:325:in `dispatch'
/usr/lib/ruby/1.8/xmlrpc/server.rb:368:in `call_method'
/usr/lib/ruby/1.8/xmlrpc/server.rb:380:in `handle'
/usr/lib/ruby/1.8/puppet/network/xmlrpc/processor.rb:44:in `process'
/usr/lib/ruby/1.8/puppet/network/http/rack/xmlrpc.rb:35:in `process'
/usr/lib/ruby/1.8/puppet/network/handler/master.rb:65:in `getconfig'
/usr/lib/ruby/1.8/rubygems/custom_require.rb:31

Jason

[1] I'm not sure why there are no timestamps on the logged stack
traces. Other logs that come directly from apache (such as the server
restart message) have timestamps.

--
"His face disappeared. If someone has no face left, you know it's serious."

Daniel Pittman

unread,

Jan 23, 2011, 4:59:37 PM1/23/11

to puppet...@googlegroups.com

On Jan 22, 2011 8:46 AM, "Jason Wright" <jwr...@google.com> wrote:
>
> Lately our puppetmasters have been returning 500 errors at a
> reasonably high rate. On any given day, 200-300 (out of many
> thousands) of our clients are reporting that they've received a 500
> during a run. So far, the only firm reports I've received have been
> regarding our Ubuntu clients, for which we happen to have central
> logging of puppet errors. I've done some analysis of the logs that I
> have and found that the majority of the 500s (~83%) received by our
> Ubuntu clients are on file metadata operations:

I would love to know what the details of your configuration are, including web server version (Apache, Passenger, etc), and what timeouts you have configured at that level.

I would also be interested to know what the concurrency levels you are seeing are like on the server, and ideally what sort of I/O you have - is this a heavily contended or slow disk, and how fast are writes generally completing?

Regards,
Daniel

Jason Wright

unread,

Jan 24, 2011, 4:20:47 PM1/24/11

to puppet...@googlegroups.com

On Sun, Jan 23, 2011 at 1:59 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
> I would love to know what the details of your configuration are, including
> web server version (Apache, Passenger, etc), and what timeouts you have
> configured at that level.

Let's see here...

ii puppet 0.25.5-1gg11
centralised configuration management for networks
ii puppet-common 0.25.5-1gg11
common files for puppet and puppetmaster
ii puppetmaster 0.25.5-1gg11
centralised configuration management control daemon

(Our puppet .deb is built from the Debian 0.25.5 git tag. We have one
patch which removes the distribution config.ru from the package.)

These are all stock Ubuntu Lucid:

ii apache2-mpm-worker
2.2.14-5ubuntu8.4 Apache HTTP Server - high speed
threaded model
ii apache2-utils
2.2.14-5ubuntu8.4 utility programs for webservers
ii apache2.2-bin
2.2.14-5ubuntu8.4 Apache HTTP Server common binary
files
ii apache2.2-common
2.2.14-5ubuntu8.4 Apache HTTP Server common files
ii libapache2-mod-passenger 2.2.7debian-1
Rails and Rack support for Apache2

ii ruby1.8 1.8.7.249-2
Interpreter of object-oriented scripting language
Ruby 1
ii librack-ruby1.8 1.1.0-3
A modular Ruby webserver interface (Ruby 1.8)

Passenger config, which we've been using for awhile and I keep
forgetting to try and tune:

<IfModule mod_passenger.c>
PassengerMaxPoolSize 8
PassengerPoolIdleTime 300
PassengerStatThrottleRate 600
PassengerUseGlobalQueue on
PassengerHighPerformance on
</IfModule>

> I would also be interested to know what the concurrency levels you are
> seeing are like on the server, and ideally what sort of I/O you have - is
> this a heavily contended or slow disk, and how fast are writes generally
> completing?

Our config servers are all implemented as Xen virtual machines with 4
vCPUs and 4GB of RAM each. They are organized into VIPs with 2-4
backends per VIP. The VMs are dedicated to puppet but tend to share
physical hardware with VMs running over services. The occasional
spike aside, the typical config server VM under peak load exhibits:

- 15min load < 2.0
- CPU utilization < 20%
- No swapping
- < 15KB/s average disk throughput
- Write latency of 100-200ms or less

- 20-45 RESTful API calls per second (sorry, I don't have the detailed
breakdown of which API calls)
- < 8 catalog requests per minute across all environments

The service doesn't seem to be very loaded to me, especially compared
to the Bad Old Days where our config server VMs ran with 15min load
averages of 12-14 with the same hardware specs.

For the record, I can't actually upgrade to 2.6.x right now. My
natural inclination would have been to have done so prior to starting
this thread up in the first place but we have a defined type which
uses "stage" as a parameter. This conflicts hilariously with run
stages and I have to wait for an OS release with the fix to work its
way through the pipeline before I can resume work on a 2.6.x upgrade.

Jason

Daniel Pittman

unread,

Jan 24, 2011, 5:30:34 PM1/24/11

to puppet...@googlegroups.com

On Mon, Jan 24, 2011 at 13:20, Jason Wright <jwr...@google.com> wrote:
> On Sun, Jan 23, 2011 at 1:59 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:

>> I would love to know what the details of your configuration are, including
>> web server version (Apache, Passenger, etc), and what timeouts you have
>> configured at that level.

[...]

Thanks for sharing those details. As I expected, none of that looks
like it should have any relationship with the failure, and helps me
eliminate some "is it slow I/O" from my early mental space to explore
to understand the problem.

> For the record, I can't actually upgrade to 2.6.x right now. My
> natural inclination would have been to have done so prior to starting
> this thread up in the first place but we have a defined type which
> uses "stage" as a parameter. This conflicts hilariously with run
> stages and I have to wait for an OS release with the fix to work its
> way through the pipeline before I can resume work on a 2.6.x upgrade.

By hilariously, how terrible is it? "Fails in nasty but obvious ways"
would be OK; "sets fire to your cat" would be pretty undesirable, and
would encourage me to think about how to better handle that in future.

Regards,
Daniel
--
⎋ Puppet Labs Developer – http://puppetlabs.com
✉ Daniel Pittman <dan...@puppetlabs.com>
✆ Contact me via gtalk, email, or phone: +1 (877) 575-9775
♲ Made with 100 percent post-consumer electrons

Jason Wright

unread,

Jan 24, 2011, 5:39:54 PM1/24/11

to puppet...@googlegroups.com

On Mon, Jan 24, 2011 at 2:30 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
> Thanks for sharing those details. As I expected, none of that looks
> like it should have any relationship with the failure, and helps me
> eliminate some "is it slow I/O" from my early mental space to explore
> to understand the problem.

No worries.

> By hilariously, how terrible is it? "Fails in nasty but obvious ways"
> would be OK; "sets fire to your cat" would be pretty undesirable, and
> would encourage me to think about how to better handle that in future.

If I upgrade a puppetmaster to 2.6.3, catalog compilation fails for
all environments using this particular defined type with an error to
the effect of "you're doing run stages wrong". I can't remember the
exact message, sorry.

Daniel Pittman

unread,

Jan 24, 2011, 5:42:44 PM1/24/11

to puppet...@googlegroups.com

On Mon, Jan 24, 2011 at 14:39, Jason Wright <jwr...@google.com> wrote:
> On Mon, Jan 24, 2011 at 2:30 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:

[...]

>> By hilariously, how terrible is it? "Fails in nasty but obvious ways"
>> would be OK; "sets fire to your cat" would be pretty undesirable, and
>> would encourage me to think about how to better handle that in future.
>
> If I upgrade a puppetmaster to 2.6.3, catalog compilation fails for
> all environments using this particular defined type with an error to
> the effect of "you're doing run stages wrong". I can't remember the
> exact message, sorry.

That qualifies into the first class of problems. I don't like it, but
it isn't miscompiling the catalog and sending something entirely
broken to the client without warning, so that isn't so dire as it
could be.

Thanks,

Jason Wright

unread,

Jan 24, 2011, 6:09:15 PM1/24/11

to puppet...@googlegroups.com

On Mon, Jan 24, 2011 at 2:42 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
> That qualifies into the first class of problems. I don't like it, but
> it isn't miscompiling the catalog and sending something entirely
> broken to the client without warning, so that isn't so dire as it
> could be.

Puppet is doing the right thing. I was just giving some background on
why we weren't on 2.6.x yet.

Daniel Pittman

unread,

Jan 25, 2011, 12:24:32 AM1/25/11

to puppet...@googlegroups.com

On Mon, Jan 24, 2011 at 14:30, Daniel Pittman <dan...@puppetlabs.com> wrote:
> On Mon, Jan 24, 2011 at 13:20, Jason Wright <jwr...@google.com> wrote:
>> On Sun, Jan 23, 2011 at 1:59 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
>
>>> I would love to know what the details of your configuration are, including
>>> web server version (Apache, Passenger, etc), and what timeouts you have
>>> configured at that level.
>
> [...]
>
> Thanks for sharing those details. As I expected, none of that looks
> like it should have any relationship with the failure, and helps me
> eliminate some "is it slow I/O" from my early mental space to explore
> to understand the problem.

So, I was looking further at this to try and narrow down the places
where the failure originates, and I am vaguely suspicious of the disk
subsystem still, at least for the trace involving 'flush' / locking.
Do any of the exceptions you find elsewhere look like they are related
to I/O failures?

For the other two exceptions, do you have 'ArgumentError' "Could not
find hostname" raised anywhere, or FileServerError, "Fileserver module
%s is not mounted"? They also, ultimately, lead down to a place where
I/O subsystem errors could cause a false failure, and it would be
interesting to know if either of those two were thrown.

Right now my best guess would be that you are seeing some transient
disk errors on the system, which are causing File#flush (which calls
into the kernel to write data, but isn't fsync), and FileTest.exist?,
to return errors that are not otherwise handled.

Working out if those two errors are thrown or not would be a big help
in narrowing that down again.

Jason Wright

unread,

Jan 25, 2011, 12:47:57 PM1/25/11

to puppet...@googlegroups.com

On Mon, Jan 24, 2011 at 9:24 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
> For the other two exceptions, do you have 'ArgumentError' "Could not
> find hostname" raised anywhere, or FileServerError, "Fileserver module
> %s is not mounted"? They also, ultimately, lead down to a place where
> I/O subsystem errors could cause a false failure, and it would be
> interesting to know if either of those two were thrown.

These two lead to a fileserver module not mounted exception:

Jason

Brice Figureau

unread,

Jan 25, 2011, 1:25:17 PM1/25/11

to puppet...@googlegroups.com

xmlrpc?
Do you still have 0.24.x clients?

You omitted one important piece of information which is the kind of
exception along with the error message. Can you post it so that we can
understand what happens?
--
Brice Figureau
My Blog: http://www.masterzen.fr/

Nigel Kersten

unread,

Jan 25, 2011, 1:27:03 PM1/25/11

to puppet...@googlegroups.com

Brice, I'm pretty sure we still had some XMLRPC left in 0.25.x, I
don't believe we completely got rid of it until 2.6.x

> --
> Brice Figureau
> My Blog: http://www.masterzen.fr/
>

> --
> You received this message because you are subscribed to the Google Groups "Puppet Users" group.
> To post to this group, send email to puppet...@googlegroups.com.
> To unsubscribe from this group, send email to puppet-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
>
>

Brice Figureau

unread,

Jan 25, 2011, 1:48:14 PM1/25/11

to puppet...@googlegroups.com

Oh, I'm well aware of this. I was asking about 0.24.x clients, because I
thought the not-ported-to-REST handlers was only the filebucket.
I was pretty sure file_metadata and file_content were handled through
fullblown REST.

BTW, I really think this is a thread race, as the first trace reminds me
something I reported (and we fixed) for 2.6.

Looking to the 0.25.5 code of the xmlrpc fileserver handler, when
mounting it tries to find the current node, which might trigger a call
to the ENC, if I'm not mistaken.
If this operation lasts a long time, it is well possible that another
threads trigger the same codepath. This same codepath also uses the
environment cached_attr "module": I also discovered a thread race in
this code that was fixed in 2.6.

Would it be possible to run passenger in a mode where it won't spawn
more than one thread and see if the problem disappears?

Jason Wright

unread,

Jan 25, 2011, 2:10:12 PM1/25/11

to puppet...@googlegroups.com

On Tue, Jan 25, 2011 at 10:48 AM, Brice Figureau
<brice-...@daysofwonder.com> wrote:
>>> xmlrpc?
>>> Do you still have 0.24.x clients?

No. We're 0.25.5 across the board.

>>> You omitted one important piece of information which is the kind of
>>> exception along with the error message. Can you post it so that we can
>>> understand what happens?

No, I can't. As I originally stated, the actual exception isn't
making it into the apache error log and since the stack traces aren't
timestamped, I can't correlate to the puppetmasterd logs. I'd love to
understand why this is so I can provide better information to you.
I've received some passenger pointers from a coworker and am going to
play with the logging options to see if I can affect this.

> Would it be possible to run passenger in a mode where it won't spawn
> more than one thread and see if the problem disappears?

You mean setting PassengerMaxPoolSize to 1? I can try that on one of
the production VMs but I'll have to revert it immediately if it causes
any other problems.

Brice Figureau

unread,

Jan 25, 2011, 2:26:26 PM1/25/11

to puppet...@googlegroups.com

On 25/01/11 20:10, Jason Wright wrote:
> On Tue, Jan 25, 2011 at 10:48 AM, Brice Figureau
> <brice-...@daysofwonder.com> wrote:
>>>> xmlrpc?
>>>> Do you still have 0.24.x clients?
>
> No. We're 0.25.5 across the board.
>
>>>> You omitted one important piece of information which is the kind of
>>>> exception along with the error message. Can you post it so that we can
>>>> understand what happens?
>
> No, I can't. As I originally stated, the actual exception isn't
> making it into the apache error log and since the stack traces aren't
> timestamped, I can't correlate to the puppetmasterd logs. I'd love to
> understand why this is so I can provide better information to you.
> I've received some passenger pointers from a coworker and am going to
> play with the logging options to see if I can affect this.

OK, I missed this fact in the discussion, sorry.

>> Would it be possible to run passenger in a mode where it won't spawn
>> more than one thread and see if the problem disappears?
>
> You mean setting PassengerMaxPoolSize to 1? I can try that on one of
> the production VMs but I'll have to revert it immediately if it causes
> any other problems.

I really don't know anything about Passenger, but reading the
documentation I don't think that's the correct settings.
I couldn't find in the documentation how passenger or rack use ruby
threads, so I'm unsure about what to do (one solution would be to add
some sync on the puppet side).

Brice Figureau

unread,

Jan 26, 2011, 4:38:29 AM1/26/11

to puppet...@googlegroups.com

My analysis might be completely wrong, because reading the passenger
documentation, it looks like there should be only one thread entering a
given master process at any given time.

Still there can be some tricky multi-process issues, like the one we
fixed in 2.6:
http://projects.puppetlabs.com/issues/4923

--
Brice Figureau
Follow the latest Puppet Community evolutions on www.planetpuppet.org!

Jason Wright

unread,

Jan 26, 2011, 2:42:30 PM1/26/11

to puppet...@googlegroups.com

On Wed, Jan 26, 2011 at 1:38 AM, Brice Figureau
<brice-...@daysofwonder.com> wrote:
> Still there can be some tricky multi-process issues, like the one we
> fixed in 2.6:
> http://projects.puppetlabs.com/issues/4923

My bugfix for the manifest issue that's blocking me from upgrading to
2.6.x has progressed far enough through our internal Ubuntu release
process that I can resume testing for the upgrade, so that's good
news. I don't think I'll be able to put production traffic onto a
2.6.x server until at least next week, and that's assuming that there
are no showstoppers in the Ubuntu release.

Daniel Pittman

unread,

Jan 26, 2011, 4:17:51 PM1/26/11

to puppet...@googlegroups.com

On Wed, Jan 26, 2011 at 11:42, Jason Wright <jwr...@google.com> wrote:
> On Wed, Jan 26, 2011 at 1:38 AM, Brice Figureau
> <brice-...@daysofwonder.com> wrote:
>> Still there can be some tricky multi-process issues, like the one we
>> fixed in 2.6:
>> http://projects.puppetlabs.com/issues/4923
>
> My bugfix for the manifest issue that's blocking me from upgrading to
> 2.6.x has progressed far enough through our internal Ubuntu release
> process that I can resume testing for the upgrade, so that's good
> news. I don't think I'll be able to put production traffic onto a
> 2.6.x server until at least next week, and that's assuming that there
> are no showstoppers in the Ubuntu release.

For what it is worth I have been looking at this quietly in the
background, and come to the conclusion that to progress further I am
going to have to either reproduce this myself (failed, so far), or get
a bit of state instrumentation into that code to track down exactly
what conditions are being hit to trigger the failure.

Sadly, this is also being bitten by my need to get our current release
out the door, and other company stuff, too.

Sounds like a good next step might be for y'all to let me know when
you might look at being able to do that instrumentation, and I can try
and send you a satisfactory patch to trial?

Jason Wright

unread,

Jan 26, 2011, 4:56:16 PM1/26/11

to puppet...@googlegroups.com

On Wed, Jan 26, 2011 at 1:17 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
> For what it is worth I have been looking at this quietly in the
> background, and come to the conclusion that to progress further I am
> going to have to either reproduce this myself (failed, so far), or get
> a bit of state instrumentation into that code to track down exactly
> what conditions are being hit to trigger the failure.

I haven't been able to reproduce it either. So far, I've tried
annexing a bunch of machines and running puppetd in a tight loop
against an otherwise idle puppetmaster VM and I can get the rate of
API calls and catalog compiles up to the correct level for one of our
busy VMs, but no 500s (or even 400s) so far. If this fails, I have
some code which fetches pluginsync metadata and then proceeeds to make
fileserver calls for every .rb listed. I'll start using that generate
traffic, since these are the sorts of operations which get the most
errors.

> Sounds like a good next step might be for y'all to let me know when
> you might look at being able to do that instrumentation, and I can try
> and send you a satisfactory patch to trial?

What instrumentation would you be looking for?

Daniel Pittman

unread,

Jan 26, 2011, 5:36:54 PM1/26/11

to puppet...@googlegroups.com

Specifically, around the "not mounted" fault, in the 'splitpath'
method, identify what the value of 'mount' in the outer 'unless' is,
and what @mounts and mount_name contain. My hope would be to use that
to narrow down the possible causes, and either confirm or eliminate a
thread race or something.

I doubt that will be the complete data set, but it should help move
forward. Annoyingly, I don't have a super-solid picture of what the
problem is at this stage, because it looks like it shouldn't be
possible to hit the situation but, clearly, it is getting there...

Brice Figureau

unread,

Jan 27, 2011, 4:31:53 AM1/27/11

to puppet...@googlegroups.com

On Wed, 2011-01-26 at 13:56 -0800, Jason Wright wrote:
> On Wed, Jan 26, 2011 at 1:17 PM, Daniel Pittman <dan...@puppetlabs.com> wrote:
> > For what it is worth I have been looking at this quietly in the
> > background, and come to the conclusion that to progress further I am
> > going to have to either reproduce this myself (failed, so far), or get
> > a bit of state instrumentation into that code to track down exactly
> > what conditions are being hit to trigger the failure.
>
> I haven't been able to reproduce it either. So far, I've tried
> annexing a bunch of machines and running puppetd in a tight loop
> against an otherwise idle puppetmaster VM and I can get the rate of
> API calls and catalog compiles up to the correct level for one of our
> busy VMs, but no 500s (or even 400s) so far. If this fails, I have
> some code which fetches pluginsync metadata and then proceeeds to make
> fileserver calls for every .rb listed. I'll start using that generate
> traffic, since these are the sorts of operations which get the most
> errors.

There's no guarantee you exercise the exact same XLMRPC stuff than you
are seeing in production (especially if it's the filebucket handler).
There are great chances your test environment doesn't change, so the
filebucket is almost never used.

Brice Figureau

unread,

Jan 27, 2011, 4:42:49 AM1/27/11

to puppet...@googlegroups.com

From the stacktrace, it is highly improbable it's the filebucket. It
really looks like the XMLRPC fileserving api.

But, I did a quick check and to my knowledge pluginsync, factsync and
file serving are using the REST api which doesn't call any of the XMLRPC
handlers.

Which make me think that either you still have 0.24.x clients or I
missed some 0.25 client feature that uses XMLRPC file serving.

Brice Figureau

unread,

Jan 27, 2011, 4:49:43 AM1/27/11

to puppet...@googlegroups.com

There are some thread races in this codepath:
* we currently know that all cached_attrs (and splitpath uses one
through the module accessor of the environment) are subject to a thread
race in 0.25.
* there is another one when reading fileserver.conf (in readconfig).

But since normally passenger should make sure there is only one thread
in a given running puppet process we should be immune.

> I doubt that will be the complete data set, but it should help move
> forward. Annoyingly, I don't have a super-solid picture of what the
> problem is at this stage, because it looks like it shouldn't be
> possible to hit the situation but, clearly, it is getting there...

Yes, so we're certainly missing something, and instrumenting this
codepath will help understand the root cause.

Jason Wright

unread,

Jan 27, 2011, 2:40:25 PM1/27/11

to puppet...@googlegroups.com

On Thu, Jan 27, 2011 at 1:42 AM, Brice Figureau
<brice-...@daysofwonder.com> wrote:
> Which make me think that either you still have 0.24.x clients or I
> missed some 0.25 client feature that uses XMLRPC file serving.

All of our OS teams upgraded to 0.25.x clients months ago. I don't
even think our manifests compile for 0.24.x clients any more and we do
not support clients earlier than 0.25.x any longer. If we do have
0.24.x clients (and there probably are a few running around), they're
broken machines. In any case, the nodes which are reporting the 500s
which caused our Ubuntu team to open the bug in the first place are
definitely on 0.25.5. I don't know that I've reported the correct
stack traces, only some of the more common ones.

Brice Figureau

unread,

Jan 27, 2011, 4:46:55 PM1/27/11

to puppet...@googlegroups.com

On 27/01/11 20:40, Jason Wright wrote:
> On Thu, Jan 27, 2011 at 1:42 AM, Brice Figureau
> <brice-...@daysofwonder.com> wrote:
>> Which make me think that either you still have 0.24.x clients or I
>> missed some 0.25 client feature that uses XMLRPC file serving.
>
> All of our OS teams upgraded to 0.25.x clients months ago. I don't
> even think our manifests compile for 0.24.x clients any more and we do
> not support clients earlier than 0.25.x any longer. If we do have
> 0.24.x clients (and there probably are a few running around), they're
> broken machines. In any case, the nodes which are reporting the 500s
> which caused our Ubuntu team to open the bug in the first place are
> definitely on 0.25.5. I don't know that I've reported the correct
> stack traces, only some of the more common ones.

I believe you :)

Can you correlate the stacktrace to a given client through the access
log, and with that maybe we can find what on the client triggers this.
I did a quick grep in the 0.25.5 source-code to find what could still
use the xmlrpc fileserver, but couldn't find anything.
Adding a couple of Puppet.info() to puppet/network/handler/fileserver.rb
around line 232 to print both the client and the url would maybe help
you find who is triggering this.

Regarding the first stacktrace you posted in your first e-mail, I'm sure
this is the writelock multiprocess issue we fixed in 2.6 and that I
referred to in a previous e-mail. This is a single (ok maybe 2) line fix
that is safe for you to backport to your 0.25.5 master.

HTH,

Jason Wright

unread,

Jan 27, 2011, 6:59:48 PM1/27/11

to puppet...@googlegroups.com

On Thu, Jan 27, 2011 at 1:46 PM, Brice Figureau
<brice-...@daysofwonder.com> wrote:

> Regarding the first stacktrace you posted in your first e-mail, I'm sure
> this is the writelock multiprocess issue we fixed in 2.6 and that I
> referred to in a previous e-mail. This is a single (ok maybe 2) line fix
> that is safe for you to backport to your 0.25.5 master.

You're referring to this patch, correct?

http://projects.puppetlabs.com/projects/puppet/repository/revisions/9ba0c8a22c6f9ca856851ef6c2d38242754a7a00/diff/lib/puppet/util/file_locking.rb

I've got that patched into our package and will try to get it out to
our canary puppetmasters tomorrow.

Brice Figureau

unread,

Jan 28, 2011, 3:17:07 AM1/28/11

to puppet...@googlegroups.com

On Thu, 2011-01-27 at 15:59 -0800, Jason Wright wrote:
> On Thu, Jan 27, 2011 at 1:46 PM, Brice Figureau
> <brice-...@daysofwonder.com> wrote:
>
> > Regarding the first stacktrace you posted in your first e-mail, I'm sure
> > this is the writelock multiprocess issue we fixed in 2.6 and that I
> > referred to in a previous e-mail. This is a single (ok maybe 2) line fix
> > that is safe for you to backport to your 0.25.5 master.
>
> You're referring to this patch, correct?
>
> http://projects.puppetlabs.com/projects/puppet/repository/revisions/9ba0c8a22c6f9ca856851ef6c2d38242754a7a00/diff/lib/puppet/util/file_locking.rb
>
> I've got that patched into our package and will try to get it out to
> our canary puppetmasters tomorrow.

Yes, this is the patch I was talking about. I have great hopes it will
fix the filelock related stacktrace.

Jason Wright

unread,

Jan 31, 2011, 3:48:23 PM1/31/11

to puppet...@googlegroups.com

On Fri, Jan 28, 2011 at 12:17 AM, Brice Figureau
<brice-...@daysofwonder.com> wrote:
> Yes, this is the patch I was talking about. I have great hopes it will
> fix the filelock related stacktrace.

It's been rolled out to our canary servers. The rest of the fleet
will get it tomorrow, along with the debug patch that Daniel provided
me. I have traces enabled on one of the canaries, but it hasn't
logged anything yet from Daniel's patch. I'm also taking Daniel's
hint and adding puts() based logging on this one puppetmaster for
every unique exception we're getting, which gives me the missing
details from the raise that aren't being logged.

Jason Wright

unread,

Feb 9, 2011, 3:18:51 PM2/9/11

to puppet...@googlegroups.com

I rolled out 2.6.3 to four puppetmasters yesterday as part of a canary
and while it's been less than 24 hours, there are already encouraging
signs. The servers participating in the canary logged just short of
600 500 errors yesterday and have logged 0 so far today. For further
comparison, the remainder of our puppetmaster fleet has logged over
1000 500 errors in just twelve hours today.

Jason

Jason Wright

unread,

Feb 11, 2011, 12:06:11 PM2/11/11

to puppet...@googlegroups.com

I've completed rolling out 2.6.3 and it's completely resolved our issues.

Thanks for the help,
Jason

--

Nigel Kersten

unread,

Feb 11, 2011, 1:18:35 PM2/11/11

to puppet...@googlegroups.com

On Fri, Feb 11, 2011 at 9:06 AM, Jason Wright <jwr...@google.com> wrote:
> I've completed rolling out 2.6.3 and it's completely resolved our issues.
>
> Thanks for the help,

Wahoo! That's awesome to hear Jason.

It might be worth starting canarying 2.6.5rc2, as you're doing report
submissions, and we've got a fix in there for larger report submission
problems.

> Jason
>
> On Wed, Feb 9, 2011 at 12:18 PM, Jason Wright <jwr...@google.com> wrote:
>> I rolled out 2.6.3 to four puppetmasters yesterday as part of a canary
>> and while it's been less than 24 hours, there are already encouraging
>> signs. The servers participating in the canary logged just short of
>> 600 500 errors yesterday and have logged 0 so far today. For further
>> comparison, the remainder of our puppetmaster fleet has logged over
>> 1000 500 errors in just twelve hours today.
>>
>> Jason
>>
>
>
>
> --
> "His face disappeared. If someone has no face left, you know it's serious."
>

Brice Figureau

unread,

Feb 11, 2011, 1:37:15 PM2/11/11

to puppet...@googlegroups.com

On 11/02/11 18:06, Jason Wright wrote:
> I've completed rolling out 2.6.3 and it's completely resolved our issues.

That's terriffic!
Thanks for sharing this good news.

Reply all

Reply to author

Forward