Q4M queries becomes non-responsive

James McGill

unread,

Nov 10, 2009, 2:46:36 PM11/10/09

to q4m-g...@googlegroups.com

hi,

We're experiencing an intermittent problem with our MySQL+Q4M server. In the last month it has happened 4 times that our Q4M queries have stopped responding. Selects and inserts seemed to block forever.

MySQL server was still very responsive and not experiencing any significant load. There's nothing in any of the logs to indicate that anything was wrong except that the queries were blocking forever. Restarting MySQL did not rectify the problem fully, at this point some inserts were proceeding but most selects were not.

Recreating the queue table is required to get things working again. The data in the table data file does not appear to be corrupt, we are able to restore the events by renaming the table and inserting the events into the new queue table.

I've not yet been able to reproduce the problem.

Has anybody experienced something like this with Q4M before?

Here's our setup and usage details:

Hardware:
Chassis: HP Proliant DL385 g1
CPU: 2x AMD Opteron 270 2.0 GHz
Memory: 16GB ECC
Disk: Fibre attached via Qlogic QLA2342 HBA

Software:
Red Hat Enterprise Linux AS 4.7
Kernel revision: 2.6.9-78ELsmp x86_64
MySQL version: 5.1.33
Q4M version: 0.8.5

Schema:
CREATE TABLE `data_to_route` (
`type` int(10) unsigned DEFAULT NULL,
`message` blob
) ENGINE=QUEUE DEFAULT CHARSET=latin1

Usage:
About 7-10 million events per day are inserted into the queue like this:
insert into data_to_route(type,message) values(101,"JSON encoded string");

Events are pulled out of the queue using conditions on 'type':
SELECT type,message FROM data_to_route WHERE queue_wait("data_to_route:type=10001 OR type=10001 OR type=10060 OR type=10061 OR type=10050 OR type=9999 OR type=10000 OR type=10002 OR type=10003 OR type=10004 OR type=10005 OR type=10006 OR type=10007 OR type=10008 OR type=10009 OR type=10010 OR type=10011 OR type=10012 OR type=10051")
SELECT queue_end();

2 servers are dedicated to pulling the events out of the queue and processing the messages, about 40-50 concurrent processes total using variations of the above query

~22 servers are inserting events to the queue from apache web servers and long running daemons written in PHP5

MySQL Event Scheduler push 4-6 events per minute into the queue table

we do not currently use any message relays or prioritized subscriptions across tables

Any feedback or pointers would be very much apreciated!!

cheers,
james

Kazuho Oku

unread,

Nov 13, 2009, 3:32:13 AM11/13/09

to q4m-g...@googlegroups.com

Below are the problems I know, and my advices. Also if you could dump
and send backtraces of all threads of mysqld (using gdb), I would be
happy to look into the problem (to find out what is causing the
lockup).

Problems with known workarounds:

- Prior to 0.8.9, Q4M had a race condition bug that causes table hangup

Prior to 0.8.9, Q4M had a race condition bug between SELECT FROM TABLE
queries and queue_wait() called right after the client is being
connected. Suggested workaround for this bug is to issue SELECT
COUNT(*) FROM TABLE queries right after your workers connect to the
MySQL server, against the Q4M tables you are going to use.

Also, it is a good idea to adjust table_open_cache to a larger value,
larger than max_connections. This is because the queue_wait()
function is a blocking call.

Following are the unresolved bugs as far as I know of:

- server may crash when using conditional subscription

One user reported very frequent server crash when using conditional
subscription. Mysqld became stable when the user stopped using
conditional subscription. However, this does not seems to be your
case since your problem is not frequent, and since it's not a crash.

- server hangup when not using conditional subscription

I have heard from one of the heaviest users of Q4M that one of two Q4M
installations stop occasionally, once per month or so. The reason of
the problem has not been figured out yet. IMO this might be the same
problem as you are seeing.

2009/11/11 James McGill <jbmc...@gmail.com>:

--
Kazuho Oku

James McGill

unread,

Nov 13, 2009, 7:03:53 PM11/13/09

to q4m-g...@googlegroups.com

Thanks for your feedback Kazuho, I'm very appreciative.

This afternoon we observed this to happen again, unfortunately we were not able to get an 'info threads' however we did get a 'backtrace full' and an strace of the process. I've posted the dumps here:

http://www.pgpin.com/mysqld_17928_thread.dmp
http://www.pgpin.com/mysqld_17928_strace.dmp

The lack of thread info is unfortunate, I do not know if this provides much value to you.

I've put a request in to increase table_open_cache (currently at 256) to be greater than max_connections (currently 1500) and will upgrade to the latest Q4M next week in our dev environments. We're also planning on reducing some of the throughput next week.

I've recommended that when this next happens we fail over to a secondary and preserve the state of the primary for further analysis however it might be awhile before we experience this again and there is no guarantee that the state will be preserved when we detach our workers from it.

If we can successively bring our primary offline in this state is there any other information that you would like?

I very much like the performance of Q4M and would like to grow our usage of it.

cheers,
james

Kazuho Oku

unread,

Nov 16, 2009, 5:24:18 AM11/16/09

to q4m-g...@googlegroups.com

Hi,

Thank you for the traces. Unfortunately the backtrace included only
that of the main thread of mysql, so it was unable to trace what
caused the problem (in other threads).

I am sorry it is a pain to get backtraces of all other threads by
switching the threads using "thread" command of gdb and taking "bt"s
for each of them (I have never tried to automate it), but if you could
provide such information, it expect it to be very helpful.

2009/11/14 James McGill <jbmc...@gmail.com>:

--
Kazuho Oku

James McGill

unread,

Dec 21, 2009, 2:45:25 PM12/21/09

to q4m-g...@googlegroups.com

hi,

We have still been experiencing the intermittent issue with queue freeze up. We've upgraded to 0.8.9 which does not seem to have alleviated the problem.

We've noticed that during the periods that queue queries are hanging they are blocking in a state of 'optimizing'. We're used to seeing the queries blocking in a state of 'optimizing' up to 60 seconds when there are no messages in the queue that fit the condition, as soon as a message that matches the condition appears in the queue the query unblocks and returns properly - this is the normal and expected behavior. We've seen that these queries block for much longer than the expected 60 seconds while we're experiencing our issue, they seem to block forever until we kill mysql and recreate the tables.

Also of note, when we KILL a query while mysql is in this state the query is put into a state of 'Killed' but still hangs around until the server is restarted.

We've got a thread dump that seems to indicate that the threads are all blocking, I'm not familiar enough yet with the Q4M code to understand the significance of this blocking.

http://www.pgpin.com/queuedump-dec19.out

Thread 66 is interesting to me, this is a "select type,count(*) from data_to_route group by type" query that was run in the middle of the incident by our admin. This thread is blocking in queue_share_t::lock_reader. Our admin reports that the query hung without returning any results.

Any pointers or comments would be very much appreciated.

cheers,
james

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Q4M - a Message Queue for MySQL" group.
To post to this group, send email to q4m-g...@googlegroups.com
To unsubscribe from this group, send email to q4m-general...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/q4m-general?hl=en
-~----------~----~----~----~------~----~------~--~---

Kazuho Oku

unread,

Jan 6, 2010, 1:44:01 AM1/6/10

to q4m-g...@googlegroups.com

Hi,

Sorry for my late response.

The bug has been identified and a fix has been committed and will be
included in upcoming Q4M 0.9.0.
The details of the bug (and the fix) is as follows. In your case,
thread 5 and 78 were causing deadlocks.

Thank you very much for your report and sorry to respond so late.

$ svn log -r 276
------------------------------------------------------------------------
r276 | kazuho | 2010-01-06 15:30:30 +0900 (水, 06 1 2010) | 13 lines

fix deadlock on listener_mutex on a1 -> b3 -> a2 (by do not triggering
compaction from queue_wait)

queue_wait(...) does the following:
a1) _queue_wait_core locks listener_mutex
a2) queue_share_t::unlock_reader triggers compaction (if necessary),
and waits for completion
a3) unlock listener_mutex

writer thread does the following:
b1) commit to disk
b2) if requested, perform compaction
b3) lock(listener_mutex) -> notify waiting conns. -> unlock(listener_mutex)
b4) goto b1

------------------------------------------------------------------------

2009/12/22 James McGill <jbmc...@gmail.com>:

> --

>
> You received this message because you are subscribed to the Google Groups
> "Q4M - a Message Queue for MySQL" group.

> To post to this group, send email to q4m-g...@googlegroups.com.

> To unsubscribe from this group, send email to

> q4m-general...@googlegroups.com.

> For more options, visit this group at

> http://groups.google.com/group/q4m-general?hl=en.
>

--
Kazuho Oku

James McGill

unread,

Jan 11, 2010, 5:16:51 PM1/11/10

to q4m-g...@googlegroups.com

hi Kazuho,

I just wanted to thank you for the work that went into the 0.9 release of Q4M and let you know that we have upgraded our Q4M servers to this release. So far we have pushed about 25M messages through the 0.9 release without incident. I will let you and the list know if this issue resurfaces.

Our IT folks now have a process in place to capture thread dumps and system traces on any Q4M problems, should we find problems we will publish them for your reference.

Thanks again for looking at this and for keeping Q4M fast and stable!!

cheers,
james

2010/1/5 Kazuho Oku <kazu...@gmail.com>

Reply all

Reply to author

Forward