Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Bugzilla 4.4.6 / CentOS 6.2 / sendmail - stops processing outbound e-mail notifications, queue grows (> 6k msgs)

236 views
Skip to first unread message

Brent Rolland

unread,
Dec 23, 2015, 7:07:17 PM12/23/15
to support-...@lists.mozilla.org
We just completed an upgrade to Bugzilla 4.4.6 (from 3.6.2) on CentOS 6.2.  The OS shows very light loads, and nearly 3.5 GB of free RAM (out of 4 GB on the system).

We're using sendmail to handle outbound notification e-mail - and having a problem ...
With an empty bugmail queue, email will move along reasonably well for awhile, then suddenly (no obvious errors or cause) sendmail simply stops processing the queue.
If we restart sendmail, it will pick up 10 - 15 messages and process them, then stop again.
If we empty the bugmail queue (clear the ts_id table), messages will be processed for awhile, and then stop.
If you simply attempt to send mail from (your favorite mail client) on the same host, that works fine.  It's just the bugzilla queue that's getting stuck.
The jobqueue.pl script is running - in fact, that's how we get the queue-size status info.

We've not been able to find any pattern yet beyond this (can't say if it's 10 messages or 15 or 20 ... it's just "some small number").
The bugmail queue then grows to 4k - 6k jobs in the table -- fairly quickly (it's a fairly active Bugzilla instance).
We're stumped.  Our IT team is stumped.
Anyone have any ideas on things to look at that might help?  I suspect we're missing something really obvious ...

Thanks!Brent

Thorsten Schöning

unread,
Dec 24, 2015, 5:35:37 AM12/24/15
to support-...@lists.mozilla.org
Guten Tag Brent Rolland,
am Donnerstag, 24. Dezember 2015 um 01:06 schrieben Sie:

> We just completed an upgrade to Bugzilla 4.4.6 (from 3.6.2) on
> CentOS 6.2.

Any reasons for not using the latest stable version 5?

> The OS shows very light loads[...]

Only CPU or especially I/O as well?

> We're using sendmail to handle outbound notification e-mail - and having a problem ...

Please be more detailed about your configuration: Which mail sending
method is configured in Bugzilla and and how exactly do you run
sendmail? It sounds like you use it as a processing daemon with it's
own queue and not just as a command line app to feed messages to e.g.
postfix, which in my experience would be more common these days.

> With an empty bugmail queue, email will move along reasonably well
> for awhile, then suddenly (no obvious errors or cause) sendmail
> simply stops processing the queue.
> If we restart sendmail, it will pick up 10 - 15 messages and process them, then stop again.
> If we empty the bugmail queue (clear the ts_id table), messages
> will be processed for awhile, and then stop.

This sounds like there's some bottleneck in sendmail which leads to a
deadlock for some reason. In your case I would first activate/improve
logging in jobqueue.pl (-d, look at the file) and sendmail (LogLevel)
and if you already did you could provide some more information of
those logs. Additionally get to know how many workers are configured
at max for sendmail (ForkEachJob, MaxDaemonChildren, ConnectionRateThrottle)
and for jobqueue.pl. I think the latter only uses exactly 1 all the
time per running instance of jobqueue.pl, because I didn't see any
configuration and JobQueue::subprocess_worker reads that way, but I
may be wrong or you may be running multiple instances or whatever. In
debugging mode the logs should tell you how many processes where
spawned in which order.

https://www.bugzilla.org/docs/tip/en/html/api/jobqueue.html
http://www.sendmail.org/~ca/email/man/sendmail.html

Additionally, don't just clear the bugzilla queue, but instead look at
the table data for error codes, exit status and delays per job if your
mails got stuck. I don't see where the delays are stored, but Bugzilla
clearly provides some in case of errors. Additionally there's a rather
long timeout of 5 minutes until a job is recognized as failed, so it
may simply take a long time if only 1 worker is used to see any
changed in your jobs status.

Bugzilla::Job::Mailer::retry_delay
http://search.cpan.org/~jfearn/TheSchwartz-1.12/lib/TheSchwartz/Job.pm#$job->failed(_$msg,_$exit_status_)

> If you simply attempt to send mail from (your favorite mail client)
> on the same host, that works fine.

But the interesting part is what happens if you feed mails to sendmail
very quickly, especially quicker than it processes it outgoing, which
jobqueue.pl is able to do, your manually driven client surely not, and
how Bugzilla and your client feed mails to sendmail and maybe even
what happens if sendmail is stuck already and you are trying to feed
some additional mails using your client.

> It's just the bugzilla queue that's getting stuck.

I don't think so, instead I'm guessing Bugzilla is at first feeding
mails to sendmail very quickly until sendmail stucks for any reason.
the comes the interesting part: Is sendmail still able to queue the
jobs of jobqueue.pl, but just not processing them outgoing anymore, or
not and jobqueue.pl wiats for 5 minutes per job to get an error
message, that sendmail is not even able to queue new mails anymore?
That should be reflected in the status and exit codes and such of the
jobs in Bugzilla's database and by querying the queue of sendmail,
which is constantly increasing or not.

Mit freundlichen Grüßen,

Thorsten Schöning

--
Thorsten Schöning E-Mail: Thorsten....@AM-SoFT.de
AM-SoFT IT-Systeme http://www.AM-SoFT.de/

Telefon...........05151- 9468- 55
Fax...............05151- 9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow

socce...@gmail.com

unread,
Dec 24, 2015, 2:43:27 PM12/24/15
to
Thanks for the ideas - a bit more from additional debugging this morning:

We tried manually running the "jobqueue -f -d once" command to see what happens. Also turned on SMTP debug messages. We can run the command in rapid series 6 times -- then it hangs. I include the debug output below for the 6th and 7th attempt.

=====
[root@bugsprd01 html]# ./jobqueue.pl -f -d once
PIDFILE=./data/jobqueue.pl.pid
Starting up...
TheSchwartz::work_once got job of class 'Bugzilla::Job::Mailer', priority 5
Working on Bugzilla::Job::Mailer ...
Net::SMTP>>> Net::SMTP(2.31)
Net::SMTP>>> Net::Cmd(2.29)
Net::SMTP>>> Exporter(5.72)
Net::SMTP>>> IO::Socket::INET(1.31)
Net::SMTP>>> IO::Socket(1.31)
Net::SMTP>>> IO::Handle(1.28)
Net::SMTP=GLOB(0x4a17768)<<< 220 host.domain.com Microsoft ESMTP MAIL Service ready at Thu, 24 Dec 2015 09:46:04 -0800
Net::SMTP=GLOB(0x4a17768)>>> EHLO bugz.domain.com
Net::SMTP=GLOB(0x4a17768)<<< 250-host.domain.com Hello [192.168.nnn.nn]
Net::SMTP=GLOB(0x4a17768)<<< 250-SIZE
Net::SMTP=GLOB(0x4a17768)<<< 250-PIPELINING
Net::SMTP=GLOB(0x4a17768)<<< 250-DSN
Net::SMTP=GLOB(0x4a17768)<<< 250-ENHANCEDSTATUSCODES
Net::SMTP=GLOB(0x4a17768)<<< 250-STARTTLS
Net::SMTP=GLOB(0x4a17768)<<< 250-X-ANONYMOUSTLS
Net::SMTP=GLOB(0x4a17768)<<< 250-AUTH NTLM LOGIN
Net::SMTP=GLOB(0x4a17768)<<< 250-X-EXPS GSSAPI NTLM
Net::SMTP=GLOB(0x4a17768)<<< 250-8BITMIME
Net::SMTP=GLOB(0x4a17768)<<< 250-BINARYMIME
Net::SMTP=GLOB(0x4a17768)<<< 250-CHUNKING
Net::SMTP=GLOB(0x4a17768)<<< 250-XEXCH50
Net::SMTP=GLOB(0x4a17768)<<< 250-XRDST
Net::SMTP=GLOB(0x4a17768)<<< 250 XSHADOW
Net::SMTP=GLOB(0x4a17768)>>> MAIL FROM:<sen...@domain.com>
Net::SMTP=GLOB(0x4a17768)<<< 250 2.1.0 Sender OK
Net::SMTP=GLOB(0x4a17768)>>> RCPT TO:<re...@domain.com>
Net::SMTP=GLOB(0x4a17768)<<< 250 2.1.5 Recipient OK
Net::SMTP=GLOB(0x4a17768)>>> DATA
Net::SMTP=GLOB(0x4a17768)<<< 354 Start mail input; end with <CRLF>.<CRLF>
Net::SMTP=GLOB(0x4a17768)>>> From: sen...@domain.com
Net::SMTP=GLOB(0x4a17768)>>> To: re...@domain.com
Net::SMTP=GLOB(0x4a17768)>>> Reply-To: bugmail.domain.com
Net::SMTP=GLOB(0x4a17768)>>> Subject: [Bug xxxxx] Messages: Obfuscated
Net::SMTP=GLOB(0x4a17768)>>> Limit mesesages
Net::SMTP=GLOB(0x4a17768)>>> Date: Wed, 23 Dec 2015 10:32:24 +0000

< intervening bug / message detail deleted for privacy reasons >

Net::SMTP=GLOB(0x4a17768)>>> .
Net::SMTP=GLOB(0x4a17768)<<< 250 2.6.0 <bug-421208-39...@https.bugx.domain.com/> [InternalId=204588524] Queued mail for delivery
Net::SMTP=GLOB(0x4a17768)>>> QUIT
Net::SMTP=GLOB(0x4a17768)<<< 221 2.0.0 Service closing transmission channel
job completed

Then the 7th does this:

[root@bugsprd01 html]# ./jobqueue.pl -f -d once
PIDFILE=./data/jobqueue.pl.pid
Starting up...
TheSchwartz::work_once got job of class 'Bugzilla::Job::Mailer', priority 5
Working on Bugzilla::Job::Mailer ...

< and here it hangs forever >

=====

It's as if it's not even trying to connect on attempt #7. AND - it never times out. Well, OK, I waited well beyond 5 minutes -- almost 15 minutes -- and it never un-stuck. I had to kill the script process.

Interesting to note that if I then invoke the script again, it works fine for 6 more invocations.

Could this have anything to do with the GMAIL "throttle" that was added to prevent GMAIL from shutting Bugzilla e-mail out?

Collecting other data - hope to have more for people to look at and hopefully help drive this one to ground. Trying to find where jobqueue logs it's errors in the database ...

Brent

socce...@gmail.com

unread,
Dec 24, 2015, 3:03:45 PM12/24/15
to
A related question on this point -- we're seeing multiple jobqueue / bugzilla-queue processes on the system.

Am I correct in assuming that one is the parent and the other is just a child / worker to be monitored / killed / restarted automatically as determined by the parent?

My IT guy is asking why 2 processes / 2 PID files being created ...

Thanks!

Thorsten Schöning

unread,
Dec 26, 2015, 6:02:25 AM12/26/15
to support-...@lists.mozilla.org
Guten Tag socce...@gmail.com,
am Donnerstag, 24. Dezember 2015 um 21:03 schrieben Sie:

> Am I correct in assuming that one is the parent and the other is
> just a child / worker to be monitored / killed / restarted
> automatically as determined by the parent?

Yes, at least that's how I understand JobQueue::subprocess_worker.

Thorsten Schöning

unread,
Dec 26, 2015, 7:14:50 AM12/26/15
to support-...@lists.mozilla.org
Guten Tag socce...@gmail.com,
am Donnerstag, 24. Dezember 2015 um 20:43 schrieben Sie:

> Also turned on SMTP debug messages.
[...]
> Microsoft ESMTP MAIL Service ready

Didn't you tell us about sendmail in your first mail? This looks like
you are talking SMTP to your own exchange relay.

> We can run the command in rapid series 6 times -- then it hangs.

So what exactly does that mean, are you waiting each time until the
former process exited, so that any given time only one process is
active or are you executing each new process in parallel? Your given
command doesn't seem to put anything in the background or such, so I
assume that there's always only exactly one worker executing, speaking
SMTP to your relay, finishing and on the 7. execution it hangs even
while there's no former worker active anymore.

What happens if you use the "TEST" method instead of STMP? In such a
case the mail should be queued as well, but never forwarded to your
Exchange and if that works for the same rapid 7 invocations of
jobqueue.pl like before, your Exchange may be the problem. The results
of "TEST" are written to "mailer.testfile" in Bugzilla's data dir.

> Then the 7th does this:
[...]
> Working on Bugzilla::Job::Mailer ...

> < and here it hangs forever >

Looks like you need to add some debugging manually in
Bugzilla::Mailer::MessageToMTA, that should be ultimately called by
the worker. Building, formatting and sending messages has been
reworked in Bugzilla 5, so you might want to give that a try to see if
things change.

> It's as if it's not even trying to connect on attempt #7.

Looking at Net::SMTP::new I guess it is, because it's only logging the
first debug messages AFTER the first successful attempt to connect:

> Net::STMP::new:
> unless ($obj->response() == CMD_OK) {
> $obj->close();
> return undef;
> }

> Net::CMD::response:
> sub response {
> my $cmd = shift;
> my ($code, $more) = (undef) x 2;
>
> ${*$cmd}{'net_cmd_resp'} ||= [];
>
> while (1) {
> my $str = $cmd->getline();
>
> return CMD_ERROR
> unless defined($str);
>
> $cmd->debug_print(0, $str)
> if ($cmd->debug);
>[...]

No idea why it doesn't time it on the connection or because of
TheSchwartz, but this looks to me that it can't establish the initial
connection properly. You could try to check that using Wireshark on
your Exchange to see if some traffic happens on attempt #7.

> Interesting to note that if I then invoke the script again, it
> works fine for 6 more invocations.

This may be pure coincidence, try the same without executing the
command the 7. time, just wait some minutes after the 6. attempt and
see what happens.

> Could this have anything to do with the GMAIL "throttle" that was
> added to prevent GMAIL from shutting Bugzilla e-mail out?

Which bugs are you talking about? I didn't see any special GMAIL code
in Bugzilla::Mailer or Bugzilla::BugMail.

> Collecting other data - hope to have more for people to look at and
> hopefully help drive this one to ground.

Don't just focus on Bugzilla, but your SMTP server and its logs as
well. Additionally have a look at your database logs, if it reports
any deadlocks or such during transactions. Postgres e.g. does such
things, but would report an error on the deadlocked connections as
well.

> Trying to find where
> jobqueue logs it's errors in the database ...

The ts_* tables should be the place to go.

bkro...@gmail.com

unread,
Dec 26, 2015, 11:54:42 AM12/26/15
to
Thanks - also very helpful.

My apologies on the e-mailer - we changed out sendmail for SMTP as part of our troubleshooting, to see whether it was a sendmail config problem on our end. The "good news" for us is that it's not that ...

On the GMAIL comment - there was an item on the Planet.bugzilla.org page that says:

"GMail support

To support Mozilla's transition to GMail, we added two features. First, we now limit the number of emails sent to a user per minute and per hour, since GMail will temporarily disable accounts that receive too much mail, and some BMO users receive a lot of bugmail."

Not sure whether that applies to us in 4.4.6 or not ...

As to why we didn't go to 5.x ... the team actually started this migration 9 months ago - and failed due to a performance problem that was discovered during deployment (which turned out to be a patched bugzilla bug ...). At any rate, we wanted to finish that upgrade and not introduce any other variables. Or so we hoped ...

You have correctly understood what we tried -- we simply invoke the "once" option to the script and let it run in the foreground, then run it again, and again, and again from the command prompt -- until it hangs at iteration 7.

I'll try the TEST mode -- and adding some more debug code as well.

Our database team will be providing me with that analysis -- hopefully we'll uncover a deadlock, or at least get more error data from the table(s) -- and we are also looking at the SMTP logs to see what's happening on that end. Thus far the focus has been on the app side ...

Thanks again for all the suggestions thus far - this all good info to work from!

Brent

Thorsten Schöning

unread,
Dec 28, 2015, 5:15:23 AM12/28/15
to support-...@lists.mozilla.org
Guten Tag bkro...@gmail.com,
am Samstag, 26. Dezember 2015 um 17:54 schrieben Sie:

> My apologies on the e-mailer - we changed out sendmail for SMTP as
> part of our troubleshooting, to see whether it was a sendmail config
> problem on our end. The "good news" for us is that it's not that ...

And how was sendmail configured, did it really used its own queue,
sent mails directly to the recipients and such or was it a todays
default scenario of sendmail just forwarding mail to your mentioned
Exchange in the last mail? In the latter case that might be one more
hint that the problem is with your Exchange/Windows Server and you
might run out of TCP connections or such.

> On the GMAIL comment - there was an item on the Planet.bugzilla.org page that says:
[...]
> Not sure whether that applies to us in 4.4.6 or not ...

According to the mentioned bug, it's only planned for Bugzilla 6.

https://bugzilla.mozilla.org/show_bug.cgi?id=1062739

> [...]and we are also looking at the SMTP logs to see
> what's happening on that end. Thus far the focus has been on the app side ...

And consider using Wireshark at least on your Exchange, preferable try
to have a look at open outgoing ports/traffic on your Bugzilla machine
as well. I have the strong feeling that the problem is not with
Bugzilla directly because you already quoted the GMAIL thing for BMO:
They are sending a lot of mails to a lot of recipients of different
providers. I guess a bug in such a core functionality would have been
seen before.

הדס שרון-אלפרוביץ

unread,
Dec 28, 2015, 7:06:38 AM12/28/15
to socce...@gmail.com, support-...@lists.mozilla.org
How do i delete my Bugzilla account?
Or - How do i change my email address?
I do not want to receive mail anymore to this email
Thank you
Hadas
> _______________________________________________
> support-bugzilla mailing list
> support-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/support-bugzilla
> PLEASE put support-...@lists.mozilla.org in the To: field when you
> reply.
>

Thorsten Schöning

unread,
Dec 28, 2015, 9:07:00 AM12/28/15
to support-...@lists.mozilla.org
Guten Tag הדס שרון-אלפרוביץ,
am Montag, 28. Dezember 2015 um 13:06 schrieben Sie:

> I do not want to receive mail anymore to this email

Please don't answer to arbitrary threads with unrelated new questions.
To unsubscribe, you need to follow the link presented at the bottom of
each mail and unsubscribe yourself.

http://www.catb.org/esr/faqs/smart-questions.html
https://lists.mozilla.org/listinfo/support-bugzilla

bkro...@gmail.com

unread,
Jan 4, 2016, 11:06:38 PM1/4/16
to
Eureka -- we have found it!

The underlying problem is: https://bugzilla.mozilla.org/show_bug.cgi?id=502625

We applied the use of Email::Sender and our problem is solved!

Thanks for the help on this, hopefully by having this thread here we'll save the next poor guy from finding this the hard way.

Brent
0 new messages