Freeswitch kazoo module timeout waiting for initial dialplan response (park)

624 views
Skip to first unread message

Dave Singer

unread,
Sep 23, 2014, 1:45:08 PM9/23/14
to 2600h...@googlegroups.com, Jed Stafford
I'm having a major problem on FreeSwitch. Out of the last 167 call attempts, 102 timed out waiting on the requested dialplan from the app server. There are also 101 "Recieved fetch reply for an unknown/expired UUID:".
I've looked at the timing of the packets (packet capture) on both the app server and FS. Processing time on app server is consistent, less than 1/2 sec. The packet shows up in pcap expected time (consistent with ping time) after sent from app server.There is a big delay between the timestamp in the pcap and the FS log indicated by the "Recieved fetch reply for an unknown/expired UUID: <request id>". Usually 5 to 7 sec after request was being sent.
Our FS server in our first zone in Seattle is not having any issues. Some differences: Sea has 1 app server, 1 rabbit server, FS sys is on bare metal with 16 cores and 12GB RAM. NY zone has 2 app servers, 2 rabbit servers (one currently shut down but hasn't helped), 2 FS servers each in cloud hosted virtual machine with  2 cores 2GB RAM and 1GB swap on SSD drive.There are not more than 3 calls at a time, CPU is 99+% idle and load average is always less than 1. We tried upping to 8 cores and 8GB Ram but that didn't help.
Is there something that kazoo_fetch_agent is waiting for before processing the response as soon as it is available?

James Aimonetti

unread,
Sep 23, 2014, 2:11:44 PM9/23/14
to 2600h...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Can you supply the Kazoo versions you're running.

Can you also add links to a pastie with the logs you're looking at for
a given request? Full debug logs please.

This is definitely atypical behaviour!

Thanks!

- --
James Aimonetti
Lead Systems Architect / Impressionable Scallywag
"I thought I fixed that"

2600Hz | http://2600hz.com
sip:ja...@2600hz.com
tel:415.886.7905
irc:mc_ @ freenode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJUIbfbAAoJENTKa+JPXCVgGeoH/1vaL7DS3QaD+3MGRVT1UNFr
9vPDJKOMVZGkvqpEW1HDLVrA+hvROW9ZIUV64km65vN6z6Vd2byCTAmOlTf/wjbb
Kz+t6CSytZBWSqHWf5C4OoHIzCenbDVUg77W8E7K3HTA/3dMFd+4yIq6bpjvBkgr
dnvhETZ5S8V4E69T5WZyi/WrpGq1HD/DP4Z6qnPDA4CfuiNIIAy3Yl+e3kGlNYhV
w4Nn0NSqSIZ0RU9I0D9KiRCaY6cvWQwSuDfQ3Nn2sWP7tM7ExGFppy/hnfXj2FnI
NBFJBg0vtxChWksZ2bWQI+y2AU1cuNy1beLzs0BoW4hJe/7p+oQ+IgkgYkJ96ic=
=4zXy
-----END PGP SIGNATURE-----

Dave Singer

unread,
Sep 25, 2014, 12:55:14 PM9/25/14
to 2600h...@googlegroups.com, ja...@2600hz.com
In preparing the logs and looking at the pcap that I was going to include I discovered that my evaluation of the timing of the response packet was wrong. It was not quick. It was 5+ sec after the initial request.
So I did more digging in the app log.
Logs are in pastebin ny2fs2 ny2apps2 ny2apps1. There is a bunch of stuff that is probably not pertinent to the call but I left them in just in case. Note that the only servers in the same zone have names that start with ny2. So amqp connection failures to ams2... and sg1... I think should be irrelevant.

Running kazoo git version 3.08 with some minor customizations. Mainly adding extra logging to help debug call processing.

The sample call's callid is: 213055064@customerlocalIP
Here are some highlights of what I found looking mainly at logs for ny2apps2 to get you started.

It finds a routing solution in the no_match call flow and logs this line:
Sep 23 08:11:01 ny2apps2 2600hz[24027]: |213055064@customerlocalIP|cf_route_req:133 (<0.5439.222>) callflows knows how to route the call! sent park response
I guessed it was sending the message at that time to FS containing the park xml. I was wrong. :( And it made me look in the wrong place for a long time.

Three sec after FS sent the dialplan request, it times out waiting for the dialplan response. It apparently sends a message to the app server which it logs starting with:
Sep 23 08:11:03 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|ecallmgr_call_events:514 (<0.30817.176>) publishing call event channel_destroy
.....
Sep 23 08:11:00 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_channel:143 (<0.491.0>) published to callmgr(amqp://amusr:am...@ny2apps1.mydomain.us:5672) exchange (routing key authz.authorize) via <0.13360.3>
Sep 23 08:11:00 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_worker:345 (<0.491.0>) published request with msg id 011ec30e2228d97c for <0.30749.176>

A little further down in the same second, wh_amqp_worker has a "badarg error when publishing:" and does a stacktrace.
I'm thinking it is trying to create a CDR for the channel destroy received from FS but it was expecting a channel create and is missing appropriate data. Might be good fix code to handle this case better for records.

5 sec after sending the response (that I originally thought was going to FS) it logs a timeout and proceeds to build and send the actual XML dialplan. Better late than never? :)
Sep 23 08:11:05 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_worker:545 (<0.491.0>) request timeout exceeded for msg id: 011ec30e2228d97c
Sep 23 08:11:05 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|ecallmgr_fs_authz:138 (<0.30749.176>) authz request lookup failed: timeout

Things are still not all lining up. The code I followed in exallmge_fs_authz.erl seemed to indicate a 5 sec timeout that would explain the timing of the timeout 5 sec after initial dialplan req but the message id "011ec30e2228d97c" seems to be linked to the FS update from it's timeout.
code snippet:
    ReqResp = wh_amqp_worker:call(authz_req(Props)
                                  ,fun wapi_authz:publish_authz_req/1
                                  ,fun wapi_authz:authz_resp_v/1
                                  ,5000),
Some questions in my mind...
   Not sure why FS times out after 3 sec when ecallmgr processing has timeouts that are 5 sec. (assuming 5 sec timeout is correct)
   Is it trying to ny2apps1 to do the authz before continuing processing and sending the dialplan. If so why is the request failing.
   What was ecallmgr_fs_authz waiting for.
   Do I have rabbit configured incorrectly.

from config.ini:
[zone]
name = zone_ny

[whistle_apps]
zone = zone_ny
cookie = bogus_cluster_cookie

[whistle_apps]
zone = zone_ny
cookie = bogus_cluster_cookie

[ecallmgr]
zone = zone_ny
cookie = bogus_cluster_cookie

[ecallmgr]
zone = zone_ny
cookie = bogus_cluster_cookie

Dave Singer

unread,
Sep 26, 2014, 11:08:36 AM9/26/14
to 2600h...@googlegroups.com, ja...@2600hz.com
Is there anything else I can provide to help with diagnosis? Or a suggestion of where to look further?

Thanks
Dave


On Thursday, September 25, 2014 9:55:14 AM UTC-7, Dave Singer wrote:
In preparing the logs and looking at the pcap that I was going to include I discovered that my evaluation of the timing of the response packet was wrong. It was not quick. It was 5+ sec after the initial request.
So I did more digging in the app log.
Logs are in pastebin ny2fs2 ny2apps2 ny2apps1. There is a bunch of stuff that is probably not pertinent to the call but I left them in just in case. Note that the only servers in the same zone have names that start with ny2. So amqp connection failures to ams2... and sg1... I think should be irrelevant.

Running kazoo git version 3.08 with some minor customizations. Mainly adding extra logging to help debug call processing.

The sample call's callid is: 213055064@customerlocalIP
Here are some highlights of what I found looking mainly at logs for ny2apps2 to get you started.

It finds a routing solution in the no_match call flow and logs this line:
Sep 23 08:11:01 ny2apps2 2600hz[24027]: |213055064@customerlocalIP|cf_route_req:133 (<0.5439.222>) callflows knows how to route the call! sent park response
I guessed it was sending the message at that time to FS containing the park xml. I was wrong. :( And it made me look in the wrong place for a long time.

Three sec after FS sent the dialplan request, it times out waiting for the dialplan response. It apparently sends a message to the app server which it logs starting with:
Sep 23 08:11:03 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|ecallmgr_call_events:514 (<0.30817.176>) publishing call event channel_destroy
.....
Sep 23 08:11:00 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_channel:143 (<0.491.0>) published to callmgr(amqp://amusr:ampw@ny2apps1.mydomain.us:5672) exchange (routing key authz.authorize) via <0.13360.3>
Sep 23 08:11:00 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_worker:345 (<0.491.0>) published request with msg id 011ec30e2228d97c for <0.30749.176>

A little further down in the same second, wh_amqp_worker has a "badarg error when publishing:" and does a stacktrace.
I'm thinking it is trying to create a CDR for the channel destroy received from FS but it was expecting a channel create and is missing appropriate data. Might be good fix code to handle this case better for records.

5 sec after sending the response (that I originally thought was going to FS) it logs a timeout and proceeds to build and send the actual XML dialplan. Better late than never? :)
Sep 23 08:11:05 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_worker:545 (<0.491.0>) request timeout exceeded for msg id: 011ec30e2228d97c
Sep 23 08:11:05 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|ecallmgr_fs_authz:138 (<0.30749.176>) authz request lookup failed: timeout

Things are still not all lining up. The code I followed in exallmge_fs_authz.erl seemed to indicate a 5 sec timeout that would explain the timing of the timeout 5 sec after initial dialplan req but the message id "011ec30e2228d97c" seems to be linked to the FS update from it's timeout.
code snippet:
    ReqResp = wh_amqp_worker:call(authz_req(Props)
                                  ,fun wapi_authz:publish_authz_req/1
                                  ,fun wapi_authz:authz_resp_v/1
                                  ,5000),
Some questions in my mind...
   Not sure why FS times out after 3 sec when ecallmgr processing has timeouts that are 5 sec. (assuming 5 sec timeout is correct)
   Is it trying to ny2apps1 to do the authz before continuing processing and sending the dialplan. If so why is the request failing.
   What was ecallmgr_fs_authz waiting for.
   Do I have rabbit configured incorrectly.

from config.ini:
[zone]
name = zone_ny

Te Matau

unread,
Sep 26, 2014, 8:16:27 PM9/26/14
to 2600h...@googlegroups.com, ja...@2600hz.com
Based on my understanding FreeSwitch is trying to get a response from ecallmgr. That in turn will try to retrieve a document from the database unless it's cached (hazy on the caching). So what does the database say? Can you see a timely request to the database from ECM? Is it possible that you have a few DB nodes and some of the nodes are responding properly and others not?

Also I'm intrigued by your config.ini. How does this work in practice having 2 Rabbits for a zone? If a device registration is held in ny2apps1 then a call comes in where the request is sent to ny2apps2 it won't know about the registration and so the call won't go to the device. Am I misunderstanding something?

Dave Singer

unread,
Sep 29, 2014, 1:10:35 PM9/29/14
to 2600h...@googlegroups.com, ja...@2600hz.com
Thanks for the reply.
From previous post, logs are in pastebins ny2fs2ny2apps2, and ny2apps1.
From what I see, it sends the park response in the second after request was received. To where, I'm not don't know. But it seems to wait for something. 
Someone (James?) please shed some light.

The rabbit question... kazoo supports failover rabbit server. There is no way to specify which is primary or secondary so it is important to have the config.ini identical on each server using the rabbit server so erlang will choose the same primary rabbit every time. The Kamailio  kazoo module can only specify 1 rabbit. You need to figure out which rabbit the kazoo apps use as primary and setup Kamailio to connect to it.
That is a good question though. On a server in the zone of interest, do "sup whistle_amqp_maintenance broker_summary". It will show a lot of "Float Channels" Assigned to 1 server. That is the primary. It should be the same on all servers in the same zone.
Kamailio does have a db_cluster module which does support failover (and other) multi db connection middle layer. Since the kazoo module uses a "db connection" to talk to rabbit, it might be able to use the db_cluster module to specify primary and failover db connections. I haven't tried that yet. Love to hear if any one has. Plan to, just haven't had time.

Dave Singer

unread,
Sep 30, 2014, 7:50:15 PM9/30/14
to 2600h...@googlegroups.com, ja...@2600hz.com
Still hoping someone can take a look at this and help. I'm digging through code but it is very slow figuring out where each thing jumps, what it's affect might be and if that is what is causing my issue.
Please someone help.
Thanks
Dave

On Thursday, September 25, 2014 9:55:14 AM UTC-7, Dave Singer wrote:
In preparing the logs and looking at the pcap that I was going to include I discovered that my evaluation of the timing of the response packet was wrong. It was not quick. It was 5+ sec after the initial request.
So I did more digging in the app log.
Logs are in pastebin ny2fs2 ny2apps2 ny2apps1. There is a bunch of stuff that is probably not pertinent to the call but I left them in just in case. Note that the only servers in the same zone have names that start with ny2. So amqp connection failures to ams2... and sg1... I think should be irrelevant.

Running kazoo git version 3.08 with some minor customizations. Mainly adding extra logging to help debug call processing.

The sample call's callid is: 213055064@customerlocalIP
Here are some highlights of what I found looking mainly at logs for ny2apps2 to get you started.

It finds a routing solution in the no_match call flow and logs this line:
Sep 23 08:11:01 ny2apps2 2600hz[24027]: |213055064@customerlocalIP|cf_route_req:133 (<0.5439.222>) callflows knows how to route the call! sent park response
I guessed it was sending the message at that time to FS containing the park xml. I was wrong. :( And it made me look in the wrong place for a long time.

Three sec after FS sent the dialplan request, it times out waiting for the dialplan response. It apparently sends a message to the app server which it logs starting with:
Sep 23 08:11:03 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|ecallmgr_call_events:514 (<0.30817.176>) publishing call event channel_destroy
.....
Sep 23 08:11:00 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_channel:143 (<0.491.0>) published to callmgr(amqp://amusr:ampw@ny2apps1.mydomain.us:5672) exchange (routing key authz.authorize) via <0.13360.3>
Sep 23 08:11:00 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_worker:345 (<0.491.0>) published request with msg id 011ec30e2228d97c for <0.30749.176>

A little further down in the same second, wh_amqp_worker has a "badarg error when publishing:" and does a stacktrace.
I'm thinking it is trying to create a CDR for the channel destroy received from FS but it was expecting a channel create and is missing appropriate data. Might be good fix code to handle this case better for records.

5 sec after sending the response (that I originally thought was going to FS) it logs a timeout and proceeds to build and send the actual XML dialplan. Better late than never? :)
Sep 23 08:11:05 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|wh_amqp_worker:545 (<0.491.0>) request timeout exceeded for msg id: 011ec30e2228d97c
Sep 23 08:11:05 ny2apps2 2600hz[24084]: |213055064@customerlocalIP|ecallmgr_fs_authz:138 (<0.30749.176>) authz request lookup failed: timeout

Things are still not all lining up. The code I followed in exallmge_fs_authz.erl seemed to indicate a 5 sec timeout that would explain the timing of the timeout 5 sec after initial dialplan req but the message id "011ec30e2228d97c" seems to be linked to the FS update from it's timeout.
code snippet:
    ReqResp = wh_amqp_worker:call(authz_req(Props)
                                  ,fun wapi_authz:publish_authz_req/1
                                  ,fun wapi_authz:authz_resp_v/1
                                  ,5000),
Some questions in my mind...
   Not sure why FS times out after 3 sec when ecallmgr processing has timeouts that are 5 sec. (assuming 5 sec timeout is correct)
   Is it trying to ny2apps1 to do the authz before continuing processing and sending the dialplan. If so why is the request failing.
   What was ecallmgr_fs_authz waiting for.
   Do I have rabbit configured incorrectly.

from config.ini:
[zone]
name = zone_ny

James Aimonetti

unread,
Sep 30, 2014, 11:26:50 PM9/30/14
to Dave Singer, 2600h...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Dave,

We're gearing up for KazooCon over here at 2600Hz which is why we
haven't been able to take a closer look. I've got some time in the
morning tomorrow; will take a look then and see if something jumps out
as wrong.

Thanks again for getting the logs together.

More to come!

James

Le 2014-09-30 16:50, Dave Singer a écrit :
> Still hoping someone can take a look at this and help. I'm digging
> through code but it is very slow figuring out where each thing
> jumps, what it's affect might be and if that is what is causing my
> issue. Please someone help. Thanks Dave
>
> On Thursday, September 25, 2014 9:55:14 AM UTC-7, Dave Singer
> wrote:
>>
>> In preparing the logs and looking at the pcap that I was going to
>> include I discovered that my evaluation of the timing of the
>> response packet was wrong. It was not quick. It was 5+ sec after
>> the initial request. So I did more digging in the app log. Logs
>> are in pastebin ny2fs2 <http://pastebin.com/MU1qnaqL> ny2apps2
>> <http://pastebin.com/QEep5109> ny2apps1
>> <http://pastebin.com/qK3sR5TR>. There is a bunch of stuff that is
>> probably not pertinent to the call but I left them in just in
>> case. Note that the only servers in the same zone have names that
>> start with ny2. So amqp connection failures to ams2... and sg1...
>> I think should be irrelevant.
>>
>> Running kazoo git version 3.08 with some minor customizations.
>> Mainly adding extra logging to help debug call processing.
>>
>> The sample call's callid is: 213055064@customerlocalIP Here are
>> some highlights of what I found looking mainly at logs for
>> ny2apps2 to get you started.
>>
>> It finds a routing solution in the no_match call flow and logs
>> this line: Sep 23 08:11:01 ny2apps2 2600hz[24027]:
>> |213055064@customerlocalIP|cf_route_req:133 (<0.5439.222>)
>> callflows knows how to route the call! sent park response I
>> guessed it was sending the message at that time to FS containing
>> the park xml. I was wrong. :( And it made me look in the wrong
>> place for a long time.
>>
>> Three sec after FS sent the dialplan request, it times out
>> waiting for the dialplan response. It apparently sends a message
>> to the app server which it logs starting with: Sep 23 08:11:03
>> ny2apps2 2600hz[24084]:
>> |213055064@customerlocalIP|ecallmgr_call_events:514
>> (<0.30817.176>) publishing call event channel_destroy ..... Sep
>> 23 08:11:00 ny2apps2 2600hz[24084]:
>> |213055064@customerlocalIP|wh_amqp_channel:143 (<0.491.0>)
>> published to callmgr(amqp://amusr:am...@ny2apps1.mydomain.us:5672)
>> "amqp://amusr:am...@ny2apps1.mydomain.us:5672" amqp_uri =
>> "amqp://amusr:am...@ny2apps2.mydomain.us:5672"
sip:ja...@2600hz.com
tel:415.886.7905
irc:mc_ @ freenode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJUK3R3AAoJENTKa+JPXCVgFusIAKS6/vZ+3VBT+XH/s9BDmB3m
iN8iKlw/X7bO2yOQa546KaXp4neHI4lkqcOPXBmsg632MKPsguB/1+RnUOawWWLc
BCVM4XyO0uEgdL7SU0gPnjMSeTaiYR3VTqYfyF07poAKL5Z0sw/EL4wFRMi5WybZ
OPb4ASoFWuJvq/4x/dl9u6uKPFUJca22oYyVyvX0kqdE5uUlqUVllToxMcSOFf/z
CLR26G5FbIu4BROVNFxevH1GpprmQmqnNmJ6KrUTQxARBghzD8nI8wxlw14u6cgl
mFabrx6Xh/r3AZboHYQ5XG/yJYmE4Vn9AYqi+i58493HslgIDhqwVw8QZidWDKc=
=/hHx
-----END PGP SIGNATURE-----

Te Matau

unread,
Sep 30, 2014, 11:33:18 PM9/30/14
to 2600h...@googlegroups.com, ja...@2600hz.com
Here's a successful call on one of our servers.  So it comes back quickly with "received affirmative route response for request"

I'm still intrigued to know what bigcouch is saying. The logs you pasted don't have anything from that.

Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_call_events:573 (<0.17089.3>) publishing call event channel_create
Oct  1 02:50:34 server_10 2600hz[3156]: |freeswitch@server_20.domain.io|ecallmgr_fs_route:131 (<0.849.0>) processing dialplan fetch request b7494980-4915-11e4-94e4-031562863200 (call 7537...@192.168.20.16) from freeswitch@server_20.domain.io
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_fs_authz:91 (<0.17091.3>) config ecallmgr.authz is disabled
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_fs_authz:190 (<0.17091.3>) channel authorization succeeded, allowing call
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_fs_authz:29 (<0.17091.3>) channel is authorized
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:143 (<0.249.0>) published to callmgr(amqp://guest:gu...@127.0.0.1:5672) exchange (routing key route.req.realm.io.0015655219932) via <0.403.0>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_worker:345 (<0.249.0>) published request with msg id b7494980-4915-11e4-94e4-031562863200 for <0.17090.3>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:143 (<0.17089.3>) published to callevt(amqp://guest:gu...@127.0.0.1:5672) exchange (routing key call.CHANNEL_CREATE.753703178%40192%2E168%2E20%2E16) via <0.812.0>
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|omnip_subscriptions:145 (<0.3877.5>) received channel create, checking for subscribers
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|omnip_subscriptions:500 (<0.3877.5>) no subscriptions for 001565...@realm.io (state early)
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|acdc_agent_handler:251 (<0.3880.5>) new channel in acct be24f63fb8d4248d2a39a5xxxxxxxxx: from 0015655219932 to 2002(2002)
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|cf_route_req:23 (<0.3882.5>) received a request asking if callflows can route this call
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|cf_util:433 (<0.3882.5>) searching for callflow in account%2Fbe%2F24%2Ff63fb8d4248d2a39a5c3c0e3f8b9 to satisfy '2002'
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|ts_route_req:24 (<0.3892.5>) received a request asking if trunkstore can route this call
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|ts_route_req:35 (<0.3892.5>) call began on the network
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|ts_callflow:48 (<0.3893.5>) request is not for a trunkstore account
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|cf_route_req:40 (<0.3882.5>) callflow 7d0aa9b373362d539024a158a14006ee in be24f63fb8d4248d2a39a5xxxxxxxxx satisfies request
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|kz_buckets:113 (<0.3882.5>) bucket be24f63fb8d4248d2a39a5xxxxxxxxx:7d0aa9b373362d539024a158a14006ee missing, starting
Oct  1 02:50:34 server_10 2600hz[3159]: |7537...@192.168.20.16|cf_route_req:141 (<0.3882.5>) callflows knows how to route the call! sent park response
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_fs_route:245 (<0.17090.3>) received affirmative route response for request b7494980-4915-11e4-94e4-031562863200
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_fs_route:247 (<0.17090.3>) sending XML to freeswitch@server_20.domain.io: <document type="freeswitch/xml"><section name="dialplan" description="Route Park Response"><context name="context_2"><extension name="park"><condition><action application="log" data="NOTICE log|${uuid}|ecallmgr@server_10.domain.io won call control"/><action application="export" data="ecallmgr_Ecallmgr-Node=ecallmgr@server_10.domain.io"/><condition field="variable_ecallmgr_Bridge-ID" expression="^$"><action application="export" data="ecallmgr_Bridge-ID=${UUID}"/></condition><action application="set" data="ringback=%(400,200,400,450);%(400,2000,400,450)"/><action application="set" data="transfer_ringback=%(400,200,400,450);%(400,2000,400,450)"/><action application="park"/></condition></extension></context></section></document>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_fs_route:251 (<0.17090.3>) node freeswitch@server_20.domain.io accepted route response for request b7494980-4915-11e4-94e4-031562863200
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:143 (<0.17094.3>) published to callevt(amqp://guest:gu...@127.0.0.1:5672) exchange (routing key call.publisher_usurp.753703178%40192%2E168%2E20%2E16) via <0.10684.3>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_call_control:244 (<0.17097.3>) starting call control listener
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|ecallmgr_call_control:1084 (<0.17097.3>) binding to call 7537...@192.168.20.16 events on node freeswitch@server_20.domain.io
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:240 (<0.17097.3>) declared queue ecallmgr@server_10.domain.io-<0.17097.3>-27a52582 via channel <0.10680.3>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:240 (<0.17094.3>) declared queue ecallmgr@server_10.domain.io-<0.17094.3>-8ebd814e via channel <0.10684.3>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:266 (<0.17094.3>) created consumer amq.ctag-YgRE24Wx0kC0TO1WTbiyFA via channel <0.10684.3>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|gen_listener:815 (<0.17094.3>) queue started: ecallmgr@server_10.domain.io-<0.17094.3>-8ebd814e
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|wh_amqp_channel:266 (<0.17097.3>) created consumer amq.ctag-9gFthkc7qAASBEFP-2xUSg via channel <0.10680.3>
Oct  1 02:50:34 server_10 2600hz[3156]: |7537...@192.168.20.16|gen_listener:815 (<0.17097.3>) queue started: ecallmgr@server_10.domain.io-<0.17097.3>-27a52582

James Aimonetti

unread,
Oct 1, 2014, 12:45:15 PM10/1/14
to Dave Singer, 2600h...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ok, as near as I can tell, here's what happened with call-id
213055064@customerlocalIP:

> 08:11:00: INVITE recv from customer

The customer is auth-by-IP and was allowed to progress by FreeSWITCH:
> IP customerExtIP.com Approved by acl "trusted[]". Access Granted

> 08:11:00.363454 [DEBUG] kazoo_fetch_agent.c:216 Sending dialplan
> XML
request (ae7e9310-431a-11e4-b389-ab14f54ca636) to
ecal...@ny2apps2.mydomain.us

So ecallmgr@ny2apps2 is tasked with processing the call, but
ultimately times out after 3099ms and FreeSWITCH hangs the call up
with NO_ROUTE_DESTINATION.

Sad story.

Let's look at ecallmgr and see what it tried:

At 8:11:00, ecallmgr_fs_authz publishes an authz.authorize message
> (<0.491.0>) published to
callmgr(amqp://amusr:am...@ny2apps1.mydomain.us:5672) exchange (routing
key authz.authorize)

At 8:11:05, that request timed out:
> 213055064@customerlocalIP|wh_amqp_worker:545 (<0.491.0>) request
timeout exceeded for msg id: 011ec30e2228d97c

Ok, now the 5 second wait is definitely something to be fixed (if
FreeSWITCH waits 3s, ecallmgr requests should probably only wait 2.8s
or so. I've filed a ticket for this sprint:
https://2600hz.atlassian.net/browse/KAZOO-2989

Let's look at the processing of that message id(011ec30e2228d97c):

Nothing on ny2apps2. Ok, how about ny2apps1. Hmm, appears nothing as
well. This makes me suspect that jonny5 is not running, or if it is
running, it does not have a connection to the same broker and
ecallmgr. For reference, ecallmgr published the request to
(amqp://amusr:am...@ny2apps1.mydomain.us:5672)

Now, because the authz request fails to get a response, and the
timeout waiting for that response is larger than FreeSWTICH waits, the
call fails. But, we see later that the ecallmgr_fs_authz process (pid
0.30749.176) actually would have authz'd the request and the call
would have processed appropriately.

You can disable ecallmgr's authz process so it bypasses the
authz.authorize message entirely, or you could try patching the
timeouts on the route/authz requests to be under 3 seconds and let us
know if that addresses the issue. Or you can start jonny5 if you are
wanting to authz.

Hope that helps you along the way of solving this issue.

James

Le 2014-09-30 16:50, Dave Singer a écrit :
> Still hoping someone can take a look at this and help. I'm digging
> through code but it is very slow figuring out where each thing
> jumps, what it's affect might be and if that is what is causing my
> issue. Please someone help. Thanks Dave
>
> On Thursday, September 25, 2014 9:55:14 AM UTC-7, Dave Singer
> wrote:
>>
>> In preparing the logs and looking at the pcap that I was going to
>> include I discovered that my evaluation of the timing of the
>> response packet was wrong. It was not quick. It was 5+ sec after
>> the initial request. So I did more digging in the app log. Logs
>> are in pastebin ny2fs2 <http://pastebin.com/MU1qnaqL> ny2apps2
>> <http://pastebin.com/QEep5109> ny2apps1
>> <http://pastebin.com/qK3sR5TR>. There is a bunch of stuff that is
>> probably not pertinent to the call but I left them in just in
>> case. Note that the only servers in the same zone have names that
>> start with ny2. So amqp connection failures to ams2... and sg1...
>> I think should be irrelevant.
>>
>> Running kazoo git version 3.08 with some minor customizations.
>> Mainly adding extra logging to help debug call processing.
>>
>> The sample call's callid is: 213055064@customerlocalIP Here are
>> some highlights of what I found looking mainly at logs for
>> ny2apps2 to get you started.
>>
>> It finds a routing solution in the no_match call flow and logs
>> this line: Sep 23 08:11:01 ny2apps2 2600hz[24027]:
>> |213055064@customerlocalIP|cf_route_req:133 (<0.5439.222>)
>> callflows knows how to route the call! sent park response I
>> guessed it was sending the message at that time to FS containing
>> the park xml. I was wrong. :( And it made me look in the wrong
>> place for a long time.
>>
>> Three sec after FS sent the dialplan request, it times out
>> waiting for the dialplan response. It apparently sends a message
>> to the app server which it logs starting with: Sep 23 08:11:03
>> ny2apps2 2600hz[24084]:
>> |213055064@customerlocalIP|ecallmgr_call_events:514
>> (<0.30817.176>) publishing call event channel_destroy ..... Sep
>> 23 08:11:00 ny2apps2 2600hz[24084]:
>> |213055064@customerlocalIP|wh_amqp_channel:143 (<0.491.0>)
>> published to callmgr(amqp://amusr:am...@ny2apps1.mydomain.us:5672)
>> "amqp://amusr:am...@ny2apps1.mydomain.us:5672" amqp_uri =
>> "amqp://amusr:am...@ny2apps2.mydomain.us:5672"
sip:ja...@2600hz.com
tel:415.886.7905
irc:mc_ @ freenode
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJULC+YAAoJENTKa+JPXCVgSkMIAJqLjfGy9NMN1zFGD+qWOkGh
gx68KEmlhJQnAe9pYA0Uh/HZiom7DdDgys2ItWeNKrgbE1GbQE1bfgSYc3HobAkc
jt0s8eWjYLPMiSelN9gk6O8wNYSQZ4d7bNfgt8WT/2BblhTj1d03Lpj1w8LymtJX
1h+9AORr34Aj0lukeShUgjq4dW60of558Hv2p5aW9FF834xKziMMaAHANCqx+VLO
PrjS3cwvzjeOddY2lq98zs1Ql35+KFh8BmuTPJlGGSmU6srHmGu/suc74KuEnq8P
koniritsI7FRNIDwZ18OMPBPZHO2Btgp++DJuJ4ExZtmnNg4AcovlZ/+UXuyPoA=
=nlbU
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages