Invalid_Frame_End_Marker

1,290 views
Skip to first unread message

Tristan Hyams

unread,
Jul 30, 2018, 9:33:40 AM7/30/18
to rabbitmq-users
Have 3 nodes in HA.
All 3 nodes are CentOS 6.9.
Server version 3.6.15.
Erlang version 19.3.6.8.

RabbitMQ DotNetClient 5.1
RawRabbit RC5 (latest) client.


Over 200 queues in use it's only this Connection:QueueName.

Average byte size of payloads is around 8.4 KB, ~8400 bytes.

The nodes sit behind a load balancer (F5) and all 3 nodes are VMs on separate physical hosts, but same network partition (I think). They are essentially physically next door.

Log Subset From Master Node:
=INFO REPORT==== 22-Jul-2018::11:59:30 ===
accepting AMQP connection <0.11905.5336> (LOADBALANCER_IP:53528 -> MASTERNODE_IP:5672)
=INFO REPORT==== 22-Jul-2018::11:59:30 ===
Connection <0.11905.5336> (LOADBALANCER_IP:53528 -> MASTERNODE_IP:5672) has a client-provided name: REDACTED:THEQUEUENAME
=INFO REPORT==== 22-Jul-2018::11:59:30 ===
connection <0.11905.5336> (LOADBALANCER_IP:53528 -> MASTERNODE_IP:5672 - REDACTED:THEQUEUENAME): user 'REDACTED' authenticated and granted access to vhost 'REDACTED'
=ERROR REPORT==== 22-Jul-2018::11:59:56 ===
Error on AMQP connection <0.11905.5336> (LOADBALANCER_IP:53528 -> MASTERNODE_IP:5672 - REDACTED:THEQUEUENAME, vhost: 'REDACTED', user: 'REDACTED', state: running), channel 4:
operation none caused a connection exception frame_error: "type 3, first 16 octets = <<\"{\\\"EventID\\\":\\\"4c83\">>: {invalid_frame_end_marker,\n                                                      34}"
=ERROR REPORT==== 22-Jul-2018::11:59:59 ===
closing AMQP connection <0.11905.5336> (LOADBALANCER_IP:53528 -> MASTERNODE_IP:5672 - REDACTED:THEQUEUENAME):
fatal_frame_error
=INFO REPORT==== 22-Jul-2018::12:00:26 ===
accepting AMQP connection <0.20997.542> (LOADBALANCER_IP:54107 -> MASTERNODE_IP:5672)
=INFO REPORT==== 22-Jul-2018::12:00:26 ===
Connection <0.20997.542> (LOADBALANCER_IP:54107 -> MASTERNODE_IP:5672) has a client-provided name: REDACTED:THEQUEUENAME
=INFO REPORT==== 22-Jul-2018::12:00:26 ===
connection <0.20997.542> (LOADBALANCER_IP:54107 -> MASTERNODE_IP:5672 - REDACTED:THEQUEUENAME): user 'REDACTED' authenticated and granted access to vhost 'REDACTED'
=ERROR REPORT==== 22-Jul-2018::12:00:41 ===
Error on AMQP connection <0.20997.542> (LOADBALANCER_IP:54107 -> MASTERNODE_IP:5672 - REDACTED:THEQUEUENAME, vhost: 'REDACTED', user: 'REDACTED', state: running), channel 40:
operation none caused a connection exception frame_error: "type 3, first 16 octets = <<\"{\\\"EventID\\\":\\\"7f28\">>: {invalid_frame_end_marker,\n                                                      100}"
=ERROR REPORT==== 22-Jul-2018::12:00:44 ===
closing AMQP connection <0.20997.542> (LOADBALANCER_IP:54107 -> MASTERNODE_IP:5672 - REDACTED:THEQUEUENAME):
fatal_frame_error

Looking for probable causes, is this perhaps due to improper uses? Cross-thread use of the same channel?

Just wondering why it would be this one specific queue and only one queue. 

Thanks for your time!
Message has been deleted

Tristan Hyams

unread,
Jul 30, 2018, 11:39:58 AM7/30/18
to rabbitmq-users
Additional Info

Calling Stack:
at RabbitMQ.Client.Impl.SimpleBlockingRpcContinuation.GetReply(TimeSpan timeout) at RabbitMQ.Client.Impl.ModelBase.ModelRpc(MethodBase method, ContentHeaderBase header, Byte[] body) at RabbitMQ.Client.Framing.Impl.Model._Private_ChannelOpen(String outOfBand) at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.CreateNonRecoveringModel() at RabbitMQ.Client.Framing.Impl.AutorecoveringConnection.CreateModel()
Server Side (Publisher) Error Logged As:
RabbitMQ.Client.Exceptions.OperationInterruptedException: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=501, text="FRAME_ERROR - type 3, first 16 octets = <<"{\"EventID\":\"cc98">>: {invalid_frame_end_marker, 97}"

Michael Klishin

unread,
Jul 30, 2018, 4:43:12 PM7/30/18
to rabbitm...@googlegroups.com
Every frame in AMQP 0-9-1 must contain a special byte called the frame end marker.
If it doesn't, both server and clients must consider it to be an unrecoverable error.

Our team does not support RawRabbit in any capacity but having developed and maintained RabbitMQ clients
for some 7 years now I have a decent guess: your or RawRabbit share a channel (IModel) or connection between threads
and data interleaving on the wire ends up being incorrect. Usually the manifestation of such issues is different
and specifically 5.1.0 introduces a number of internal changes to make this less likely but the recommendation stands:

 * Never share channels between threads
 * Avoid sharing connections if you can (in particular when you run into obscure framing issues)

A traffic capture with Wireshark [1] or tcpdump is the only way to verify that hypothesis.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Tristan Hyams

unread,
Aug 6, 2018, 9:59:53 AM8/6/18
to rabbitmq-users
Other than a couple of TCP retransmissions or Ack Segment Unseens, this is the only thing that stands out:


Tristan Hyams

unread,
Aug 6, 2018, 10:22:40 AM8/6/18
to rabbitmq-users

Tristan Hyams

unread,
Aug 6, 2018, 10:46:28 AM8/6/18
to rabbitmq-users
I have tracked down the usage of this channel where many messages were published in parallel.

The connection is shared, in a batch of up to 100 messages, but each message gets a unique channel that is transient.

Luke Bakken

unread,
Aug 6, 2018, 11:11:24 AM8/6/18
to rabbitmq-users
Hi Tristan,

Thanks for the information. Pretty much the only way to address this is to have code that can reproduce this issue. If code can't be provided, a detailed description of the publish process that does reproduce the issue so we could write a test application. At this point the most likely explanation is a client-side threading issue that affects message framing.

You mention that this issue affects only one queue. Is the code that publishes to the problematic queue exactly the same code as what publishes to other queues that don't exhibit this issue?

Thanks,
Luke

Tristan Hyams

unread,
Aug 6, 2018, 12:24:51 PM8/6/18
to rabbitmq-users
Its extremely complicated. I will see if I can create a sample project. The problem is that the environment, the queue, the data load, are all unique to Production. Even this queue, that has the issue, fluctuates heavily in terms of usage, message body size.

In addition to that, it isn't a pure/raw implementation of RabitMQ DotNetclient. At work we use a .NET client wrapper RawRabbit. I have traced my fingers through the entire call stack, and for all the complexity the pseudo code is this:

Function on Timer:

foreach(message in messagesQueuedForSending)
{
     actionTaskList.Add(SendMessageAsync(message))
}

await actionTaskList.WhenAll()

The Chain of Events
SendMessageAsync => BusClient.InvokeAsync => Execute
GetChannelAsync => GetConnectionInBus => CreateModel

Begin Execute
var channel = await GetChannelAsync()

using (channel)
{
     channel.BasicPublish(message);
}
End Execute

That all fires up to 100 times per batch.

Michael Klishin

unread,
Aug 6, 2018, 12:32:05 PM8/6/18
to rabbitm...@googlegroups.com
So there is a very good chance of concurrent publishing then. Can synchronisation for those be introduced one way or another,
at the very least to compare?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tristan Hyams

unread,
Apr 5, 2019, 7:03:00 PM4/5/19
to rabbitmq-users
I managed to design a ChannelPool backed by a ConnectionPool that resolved this issue.

All my work is located here.
https://github.com/houseofcat/CookedRabbit


On Monday, August 6, 2018 at 12:32:05 PM UTC-4, Michael Klishin wrote:
So there is a very good chance of concurrent publishing then. Can synchronisation for those be introduced one way or another,
at the very least to compare?
On Mon, Aug 6, 2018 at 7:24 PM, Tristan Hyams <t.h...@gmail.com> wrote:
Its extremely complicated. I will see if I can create a sample project. The problem is that the environment, the queue, the data load, are all unique to Production. Even this queue, that has the issue, fluctuates heavily in terms of usage, message body size.

In addition to that, it isn't a pure/raw implementation of RabitMQ DotNetclient. At work we use a .NET client wrapper RawRabbit. I have traced my fingers through the entire call stack, and for all the complexity the pseudo code is this:

Function on Timer:

foreach(message in messagesQueuedForSending)
{
     actionTaskList.Add(SendMessageAsync(message))
}

await actionTaskList.WhenAll()

The Chain of Events
SendMessageAsync => BusClient.InvokeAsync => Execute
GetChannelAsync => GetConnectionInBus => CreateModel

Begin Execute
var channel = await GetChannelAsync()

using (channel)
{
     channel.BasicPublish(message);
}
End Execute

That all fires up to 100 times per batch.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Luke Bakken

unread,
Apr 8, 2019, 6:20:38 PM4/8/19
to rabbitmq-users
Thank you! That looks like a very interesting and useful project.
Reply all
Reply to author
Forward
0 new messages