Test with APNS simulator --> process killed out of memory too many opened files

229 views
Skip to first unread message

Serafino Sorleti

unread,
Sep 11, 2014, 12:39:14 PM9/11/14
to uni...@googlegroups.com
Hi,

I'm testing uniqush-push application (great app, I'm studying the code and it has a very good design...). I did some mods (logging features) and I started to do some push notification tests. I wrote a little shell script that works in the following way (an infinite loop):

- it chooses in a random way between two push service providers A and B
- it sends to n (n is calculated in a random way: max is 1000) subscribers (not wildcard but using subscribers list into http request call to uniqush push) a simple "Hello World" push notification
- it sleeps 1 second 

After about 8 hours (and some millions of notifications) it stopped and checking with system logs I found it was killed for the reason i wrote in the subject.

So i retried relaunching uniqush-push and checking using netstat i saw about 300 established connection from apns-simulator and uniqush-push.

I studied the apns.go and pool.go code and i understood the following (I make a very little resume...):

1) in main.go the apns handling code is launched and the pushMux code is launched and waits for push notification requests 
2) when Push function of pushBackend.go is called, the Push function in apns.go is called and so pushMux code create a worker object (of type *pushWorkerInfo  ,if there isn't already one for that push provider) and call pushWorker function that creates an apns connection manager object and a connection pool (so each worker has its connection pool) and call singlePush code for each delivery point. 
3) a worker's pool is closed when pushWorker function returns (when the two channels, in the select case inside the for loop, are closed)

So I have  questions:

1 - Can you explain (if you are able to understand the reasons of the issue I encountered) the behaviour of uniqush-push during my tests?

2 - Is the apns.go and pool.go logic the one i described above?

3 - Searching for worker's pool closing code It seems to me that the pool, for each worker, won't be closed at all until uniqush-push is running (because pushMux for loop, waiting for push requests for delivering to workers, will exit only when the *apnsPushService.reqChan will be closed). Is it right?

4 - Yesterday I did one test. I sent 2 push requests (some hundred of subscribers to one service and some hundreds to another one) to apns-simulator --> 26 established connections. No more push requests after that. I left uniqush-push and apns-simulator opened and Today i still have the 26 connections alive (established as described by netstat). Is this the normal behaviour? Does the actual Apple pans server close inactive connections with uniqush-push (and so behaves in a different way than apns-simulator)?

5 - Today I did another test (always with the two services and some hundreds of subscribers each). After sending the two push notifications, as described in the point 4 above, I waited some minutes and I sent another push notification request, to the service different from the one of the last push request, to uniqush-push. I checked with netstat again and I counted another 13 (the pool max number connection value I found in the code is 13)  connections more. I tried again more times and now I'm counting 118 connections. This could be the reason of too many opened file issue (but the script don't wait minutes between two requests...) i think but why this behaviour? Any ideas?

Thanks a lot in advance for your replies...

Bye,
Serafino.

Monnand

unread,
Sep 12, 2014, 1:32:36 PM9/12/14
to uni...@googlegroups.com
Thanks for your detailed description and test! It is so helpful!

I see your problem and trying to find the root cause now. I received
several bug reports about the OOM problem. However, I was working on
other projects for my employer and not be able to make too much
changes for uniqush due to possible copyright issues. Right now, I'm
kinda free and will dig into this problem soon.

In general, I do think the connpool package is over-designed making it
hard to extend and/or fix. I may use a different approach in the apns
package to handle multiple connections.

Specific to your questions, please check out my answers below:

On Thu, Sep 11, 2014 at 12:39 PM, Serafino Sorleti
<serafino...@acotel.com> wrote:
> Hi,
>
> I'm testing uniqush-push application (great app, I'm studying the code and
> it has a very good design...). I did some mods (logging features) and I
> started to do some push notification tests. I wrote a little shell script
> that works in the following way (an infinite loop):
>
> - it chooses in a random way between two push service providers A and B
> - it sends to n (n is calculated in a random way: max is 1000) subscribers
> (not wildcard but using subscribers list into http request call to uniqush
> push) a simple "Hello World" push notification
> - it sleeps 1 second

This is a very nice test! This is easy enough for me to reproduce the
bug. Thank you very much!

<snip>
> So I have questions:
>
> 1 - Can you explain (if you are able to understand the reasons of the issue
> I encountered) the behaviour of uniqush-push during my tests?

As you said, it is caused by large amount of TCP connections. However,
this situation should be handled by the connpool package, which, as I
said, is over-designed making it hard to debug.
>
> 2 - Is the apns.go and pool.go logic the one i described above?

Exactly. You are correct. The pool.go is supposed to limit the number
of connections so that apns could use it to communicate with the
server without creating too many TCP connections.

>
> 3 - Searching for worker's pool closing code It seems to me that the pool,
> for each worker, won't be closed at all until uniqush-push is running
> (because pushMux for loop, waiting for push requests for delivering to
> workers, will exit only when the *apnsPushService.reqChan will be closed).
> Is it right?

Yes. The pool will not be closed. Connections are handled by the pool
itself. It will only maintain a specific number of connections. When
someone gets a connection from the pool, one has to Close() the
connection which will return the connection back to the pool. If all
connections are assigned to others and someone tries to get a
connection from the pool, the caller will be blocked until a
connection is available.

>
> 4 - Yesterday I did one test. I sent 2 push requests (some hundred of
> subscribers to one service and some hundreds to another one) to
> apns-simulator --> 26 established connections. No more push requests after
> that. I left uniqush-push and apns-simulator opened and Today i still have
> the 26 connections alive (established as described by netstat). Is this the
> normal behaviour? Does the actual Apple pans server close inactive
> connections with uniqush-push (and so behaves in a different way than
> apns-simulator)?
>

Well.. here's some rant: Apple has some hand waiving description in
their documentation. It says that it *may* close a connection at any
point. Apple itself does not guarantee anything. In short, APNS is a
really bad service: It does not mention how many connections could be
established between App Server and APNS server, but it says it may
close any connection at any time under whatever condition. It says
that App server *may* receive a reply for a message, but it does not
guarantee. This means the App server, or uniqush-push in our case, may
receive nothing after sending the message to the APNS. It is valid
that APNS cannot guarantee that a message will be delivered to the
mobile side; however, it is absolutely insane that APNS will tell me
nothing when I send a message to it --- it should at least tell me
that the message is received by APNS. In Apple's dictionary, there's
no such a thing called SLA. It's wild west and people could shoot each
other as they saw fit.

Sorry for my rant. Back to our discussion: In practice, APNS will
close connections. We just don't know when a connection will be
closed.

> 5 - Today I did another test (always with the two services and some hundreds
> of subscribers each). After sending the two push notifications, as described
> in the point 4 above, I waited some minutes and I sent another push
> notification request, to the service different from the one of the last push
> request, to uniqush-push. I checked with netstat again and I counted another
> 13 (the pool max number connection value I found in the code is 13)
> connections more. I tried again more times and now I'm counting 118
> connections. This could be the reason of too many opened file issue (but the
> script don't wait minutes between two requests...) i think but why this
> behaviour? Any ideas?

I think it is caused by the connpool.

To be honest, I used uniqush to learn Go. I was not quite familiar
with this language. In fact, there were not too many people familiar
with Go back in early 2010 when I started writing uniqush. I played
too much with goroutines and channels in the connpool package.
Although I re-designed uniqush for several times, I'm still not quite
happy with its current status. This makes me decide to re-write it
from scratch. It's still an on-going work but everything is in public
now. The new code could be found in v2.0 branch in the github repo.

With that being said, I will look into your problem. However, I think
a good way to solve it is to re-write apns.go without using connpool.

-Nan
>
> Thanks a lot in advance for your replies...
>
> Bye,
> Serafino.
>
> --
> You received this message because you are subscribed to the Google Groups
> "uniqush" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to uniqush+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Serafino Sorleti

unread,
Sep 12, 2014, 3:13:41 PM9/12/14
to uni...@googlegroups.com
Hi,

Thanks a lot for your detailed reply...

I will send a reply in the next days (hoping in the meanwhile you will be able to replicate the bug I encountered as you can obviously understand better than me the connection pooler) as I want to examine what you wrote, but there is one thing about the worker's pool I don't understand (it is one of the test already described in my first post):

If a worker is associated (by a map structure) to the psp name and I send one push to the service A (some hundreds of subscribers) and after that one push to service B (some hundreds of subscribers) the result (using the apns simulator) is 13 connection opened for the first push and 13 for the second. Then if I wait some minutes I repeat the first push to service A (a service different than the one of the last push and equal to the one of the first push with already 13 connection opened). I thought that this push used the 13 connections already opened by the first push or blocks (because max active connection for the pool had been reached). The third push opened other 13 connections instead even if there is a worker's pool associated with the same psp name already opened.
Did I made a mistake thinking about the pool reuse, thinking that there are 13 connection more than the right number?

You already told me that could be a connection pool issue, this question is only to understand my knowledge of your code.

Thanks,
Serafino.

Uscher

unread,
Sep 14, 2014, 12:19:29 PM9/14/14
to uni...@googlegroups.com
Hi

Our public messenger app is using uniqush and we currently have a release candidate in the "last" test round. We did not test uniqush as extensive as Serafino did. Thank you Serafino. This bug is critical to us. So we delay the release... I am looking forward to any stable but not too risky fix.

Urs

Serafino Sorleti

unread,
Sep 16, 2014, 4:17:46 AM9/16/14
to uni...@googlegroups.com
Hi Monnand,

did you have time to test the issue? I know that a fix needs time but it could be useful for me to know if you was be ble to replicate it...

Thanks a lot for your support.

Serafino.

Serafino Sorleti

unread,
Sep 16, 2014, 12:31:14 PM9/16/14
to uni...@googlegroups.com
Hi Monnand,

can you give a look at the following code in pool.go :

func (self *Pool) popIdle() *pooledConn {
for len(self.idle) > 0 {
conn := self.idle[len(self.idle)-1]
self.idle[len(self.idle)-1] = nil
self.idle = self.idle[:len(self.idle)-1]
// Ignore any connection which is in idle state more than 15
// min.
if time.Now().Sub(conn.timestamp) > 15*time.Minute {
continue
}
self.nrActiveConn++
return conn
}
return nil
}

it seems to me that these connections (in idle state more than 15 min) are simply forgotten (the connection is popped out the  idle queue but it is not closed and the reference to this connection is lost) and so their number grows during uniqush-push running life.

Is it right or you close them in your code (but i didn't find the code that do that...)?

Thanks in advance for your reply...

Serafino.

Monnand

unread,
Sep 16, 2014, 1:00:33 PM9/16/14
to uni...@googlegroups.com
Hi Serafino,

Sorry for my late reply. I wrote some unit tests for connpool during
this weekends (see my latest commits). But it cannot be reproduced
within that level. However, I did not noticed the code you mentioned.
Let me try to confirm your assumption with a unit test.

Thank you so much for your help!

-Nan

Serafino Sorleti

unread,
Sep 16, 2014, 3:57:52 PM9/16/14
to uni...@googlegroups.com
Hi Monnand,

don't worry :-). Thanks for your replies instead...

I hope the unit test you will write, about the piece of code i wrote above (popIdle func), will allow us to understand the final destiny :-) of those long time idle connections.

Bye,
Serafino

Monnand

unread,
Sep 20, 2014, 12:41:35 AM9/20/14
to uni...@googlegroups.com
Hi Serafino,

Sorry for my late reply. I think you are right: I did not close the
connection when it timeout. I just made a simple change on connpool.
Would you please build uniqush-push against the most recent connpool
and try to run your test again?

Regards,
-Nan

On Tue, Sep 16, 2014 at 3:57 PM, Serafino Sorleti

Serafino Sorleti

unread,
Sep 20, 2014, 3:59:05 AM9/20/14
to uni...@googlegroups.com
Hi Monnand,

Thanks for paying attention to my posts. In the meanwhile I was waiting for your reply I already made a change in the point you did but I didn't use the dropConn function but part of its code putting it directly where you put the dropConn code. The reason (but it could be wrong, tell me if I made a mistake please...) Is the following:

Before popIdle is called we have n Active connections and m idle connections.

Considering the popidle in the case long time idle connection we have:

N the same (as there is the continue that skips the n increment code)
M - 1 
Idle connection discarded lost in the heap 

Considering your new popidle we have:
N  + 1 (but the continue is called the same so no connection returned and so it seems to me that the N increment is wrong)
M - 1
Idle connection closed (right, but using a call to dropConn there is the problem I wrote above for N + 1)

In my fix I simply put the dropConn code but without the N increment that is inside the dropConn code. I put the code where you put the self.dropConn one before the continue into popIdle. 

I'm on mobile with a bad connection so I cannot check if you did other changes other than the popidle one I saw. And I cannot do a diff to see in a right way things, but I prefer to suggest to check as you could have no time to do it during the working days when I will have my nice PC :)

Thanks a lot for your help,
Bye,
Serafino.
You received this message because you are subscribed to a topic in the Google Groups "uniqush" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/uniqush/sC_EB3mmiVA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to uniqush+u...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Serafino Sorleti
mailto:serafino...@acotel.com

Acotel Spa
http://www.acotel.com
Via della Valle dei Fontanili, 29
00168 Roma
Tel +390661141200
Fax +39066149936

Le informazioni contenute nella comunicazione che precede possono essere riservate e sono, comunque, destinate esclusivamente alla persona o all’ente sopraindicati. La diffusione, distribuzione e/o copiatura non autorizzata del documento trasmesso da parte di qualsiasi soggetto è proibita. La sicurezza e la correttezza dei messaggi di posta elettronica non possono essere garantite. Se avete ricevuto questo messaggio per errore, Vi preghiamo di contattarci immediatamente. Grazie.

This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No confidentiality or privilege is waived or lost by any transmission. If you receive this message in error, please immediately delete it and all copies of it from your system, destroy any hard copies of it and notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, or copy any part of this message if you are not the intended recipient. Thanks


Serafino Sorleti

unread,
Sep 20, 2014, 5:13:36 AM9/20/14
to uni...@googlegroups.com
Hi Monnand,

I made a mistake explaining my fix ( as I didn't remember that dropConn decrement nractiveconnection instead of increment) so I copy and paste what I wrote in the post above but without the mistake:

Considering your new popidle we have:
N  - 1 (the idle connection is dropped and not reused and dropConn is not dropping an active connection but an idle one and so it seems to me that the N decrement is wrong)
M - 1
Idle connection closed (right, but using a call to dropConn there is the problem I wrote above for N - 1)

In my fix I simply put the dropConn code but without the N decrement that is inside the dropConn code. I put the code where you put the self.dropConn one before the continue into popIdle

I apologise for the mistake in my post..

Thanks,
Serafino

Monnand

unread,
Sep 20, 2014, 1:51:19 PM9/20/14
to uni...@googlegroups.com
Ah, you are right. Thank you! I will make a change soon.

Nan 

Serafino Sorleti

unread,
Sep 24, 2014, 4:58:03 AM9/24/14
to uni...@googlegroups.com
Hi Monnand,

I'm still doing load tests on uniqush and I noticed one thing (I'm checking in the meanwhile but probably you know what happening...):

- 2 uniqush machines, 1 apns simulator, 1 script (the one i already wrote in the first post of this thread but now the sleep time is 10 seconds to understand if reducing push notification per time interval changes something about uniqush running lifetime). After about 36 hours (and more than 5.500.000 notifications) both uniqush instances logs stopped updating (but uniqush instances are still alive). So I checked the scripts and it seemed to me the curl hanged. I killed the script and sent only one curl (the push with some hundreds of subscribers) and it hanged. Checking uniqush logs (the one of the uniqush instances receiving the request) I read the following logs lines(note that, i don't know if it is a coincidence, the pool has 13 as the max number of connections allowed and the log lines about closed connections are 13):

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection

[Push][Info] 2014/09/24 09:16:15 Connection closed by APNS: read tcp 10.20.50.121:8080: use of closed network connection


after that nothing (the curl hangs waiting for a reply i think. 10.20.50.121 is the apns simulator machine ip)
Both instances uniqush logs updating stopped at about 8 am and, while i was writing this post, the waiting curl completed and logs restart updating (not only the one related to the uniqush instance receiving the request but both instances logs). One of the uniqush instance (looking at the log) received the push request at about 9 am and it seems that apns simulator restarted giving successful push notifications feedbacks at 9.54 am. 
As both uniqush instance logs restarted updating about successful push notification at 9.54 (and the hanged curl ha been obviously sent to only one instance) it seems to me that the problem could be the apns simulator.

I'm starting thinking that after the idle connection fix the load test weak link is the apns simulator (he had to handle more than 5.500.000 notifications), but it is a simulator and not the actual apns server... What do you think? 

Thanks,
Serafino.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "uniqush" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/uniqush/sC_EB3mmiVA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to uniqush+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Serafino Sorleti
mailto:serafino.sorleti@acotel.com

Acotel Spa
http://www.acotel.com
Via della Valle dei Fontanili, 29
00168 Roma
Tel +390661141200
Fax +39066149936

Le informazioni contenute nella comunicazione che precede possono essere riservate e sono, comunque, destinate esclusivamente alla persona o all’ente sopraindicati. La diffusione, distribuzione e/o copiatura non autorizzata del documento trasmesso da parte di qualsiasi soggetto è proibita. La sicurezza e la correttezza dei messaggi di posta elettronica non possono essere garantite. Se avete ricevuto questo messaggio per errore, Vi preghiamo di contattarci immediatamente. Grazie.

This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No confidentiality or privilege is waived or lost by any transmission. If you receive this message in error, please immediately delete it and all copies of it from your system, destroy any hard copies of it and notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, or copy any part of this message if you are not the intended recipient. Thanks




--
Serafino Sorleti
mailto:serafino.sorleti@acotel.com

Acotel Spa
http://www.acotel.com
Via della Valle dei Fontanili, 29
00168 Roma
Tel +390661141200
Fax +39066149936

Le informazioni contenute nella comunicazione che precede possono essere riservate e sono, comunque, destinate esclusivamente alla persona o all’ente sopraindicati. La diffusione, distribuzione e/o copiatura non autorizzata del documento trasmesso da parte di qualsiasi soggetto è proibita. La sicurezza e la correttezza dei messaggi di posta elettronica non possono essere garantite. Se avete ricevuto questo messaggio per errore, Vi preghiamo di contattarci immediatamente. Grazie.

This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No confidentiality or privilege is waived or lost by any transmission. If you receive this message in error, please immediately delete it and all copies of it from your system, destroy any hard copies of it and notify the sender. You must not, directly or indirectly, use, disclose, distribute, print, or copy any part of this message if you are not the intended recipient. Thanks


--
You received this message because you are subscribed to the Google Groups "uniqush" group.
To unsubscribe from this group and stop receiving emails from it, send an email to uniqush+unsubscribe@googlegroups.com.

Monnand

unread,
Sep 28, 2014, 4:52:05 PM9/28/14
to uni...@googlegroups.com
Hi Serafino,

Thank you so much! It seems that the simulator is the bottleneck this time. Do you have the log message from the simulator?

Monnand

unread,
Sep 28, 2014, 5:04:05 PM9/28/14
to uni...@googlegroups.com
I grep'ed the error message within the code. The error is normal. It is generated from apns.go, function resultCollector():

In this function, it will try to read data from the connection. If there's an error, then it will return. In our case, because we will close the connection if the connection is idle for more than 15min, the resultCollector() will return error due to a closed connection.

However, I do see the problem here: 1. It is not an error and should not report an error. 2. The design over-used goroutines and channels, making it hard to read.

-Nan


On Wednesday, September 24, 2014 4:58:03 AM UTC-4, Serafino Sorleti wrote:

Serafino Sorleti

unread,
Sep 30, 2014, 6:47:58 AM9/30/14
to uni...@googlegroups.com
Hi Monnand,

thanks for your very useful replies...

I'm continuing testing about the memory issue. I did the following test today:

- I downloaded some hours ago the github code (uniqush push and its packages). So we are aligned and you can replicate better the issue.
- I started my little script (I past it below. I already described how it works in this thread: it simulates a continuous flow of push notifications related to 2 services and sends the push to a number<=1000 of subscribers. I obviously already generated the 2000 device tokens, registered the related subscribers and the two services):
--------------------------------------------------------
#!/bin/sh

#command line arguments
#uniqush-push ip
machine_ip=$1
#max_range
max_range=$2

#subscribers list strings
subscribers=""
special_char="."
separator=","

#push service providers names array
ARRAY_SERVICES=(simserv simserv2)
ARRAY_SUBSCRIBERS_PREFIX_BY_SERVICE=(usr usr_psp2)
range_service_array=2
echo "Created push service providers names array: "${ARRAY_SERVICES[*]}

#subscribers array range
range_max_users_for_single_push=$max_range


#infinite loop
index=0
while [ $index -lt 1 ]
do
#set push provider by service name
chosen_service=`expr $RANDOM % $range_service_array`
#chosen_service=0
echo "Chosen service: "${ARRAY_SERVICES[$chosen_service]}
echo "and so subscriber id prefix: "${ARRAY_SUBSCRIBERS_PREFIX_BY_SERVICE[$chosen_service]}
#set max user for each single push request
max_user=`expr $RANDOM % $range_max_users_for_single_push`
echo "Number of subscriber for this push: "$max_user
#generate subscribers list
subscriber_name=${ARRAY_SUBSCRIBERS_PREFIX_BY_SERVICE[$chosen_service]}
subscribers=""
counter=0
while [ $counter -le $max_user ]
do
 if [ $counter -eq 0 ]
 then
   subscribers=$subscribers$subscriber_name$special_char$counter
 else
subscribers=$subscribers$separator$subscriber_name$special_char$counter
 fi

 counter=`expr $counter + 1`
done
#send http call to uniqush with the random chosen service name and random subscribers list items number
echo curl http://$machine_ip:9898/push -d service=${ARRAY_SERVICES[$chosen_service]} -d subscriber=$subscribers -d msg="Hello World"
curl http://$machine_ip:9898/push -d service=${ARRAY_SERVICES[$chosen_service]} -d subscriber=$subscribers -d msg="Hello World"
sleep 1
done
--------------------------------------------------------

- I left the script running for about 5-6 minutes and i noticed the following (i took pprof heap data about each minute):
Total: 14.5 MB
     5.5  37.9%  37.9%      5.5  37.9% reflect.mapassign
     2.0  13.8%  51.7%      3.5  24.1% github.com/uniqush/cache.(*Cache).Set
     1.5  10.3%  62.1%      1.5  10.3% container/list.(*List).insertValue
     1.5  10.3%  72.4%      1.5  10.3% makemap_c
     1.0   6.9%  79.3%      1.0   6.9% encoding/json.(*decodeState).literalStore
     1.0   6.9%  86.2%      1.0   6.9% fmt.Sprintf
     0.5   3.4%  89.7%      0.5   3.4% allocg
     0.5   3.4%  93.1%      0.5   3.4% encoding/json.stateBeginValue
     0.5   3.4%  96.6%      0.5   3.4% github.com/uniqush/uniqush-push/push.NewEmptyDeliveryPoint
     0.5   3.4% 100.0%      0.5   3.4% syscall.anyToSockaddr
-------     
Total: 28.5 MB
    10.5  36.8%  36.8%     10.5  36.8% reflect.mapassign
     4.0  14.0%  50.9%      4.0  14.0% container/list.(*List).insertValue
     4.0  14.0%  64.9%      4.0  14.0% encoding/json.(*decodeState).literalStore
     2.5   8.8%  73.7%      6.5  22.8% github.com/uniqush/cache.(*Cache).Set
     2.0   7.0%  80.7%      2.0   7.0% fmt.Sprintf
     2.0   7.0%  87.7%      2.0   7.0% github.com/uniqush/uniqush-push/push.NewEmptyDeliveryPoint
     2.0   7.0%  94.7%      2.0   7.0% makemap_c
     0.5   1.8%  96.5%      0.5   1.8% allocg
     0.5   1.8%  98.2%      0.5   1.8% encoding/json.stateBeginValue
     0.5   1.8% 100.0%      0.5   1.8% syscall.anyToSockaddr
-------      
Total: 43.5 MB
    18.0  41.4%  41.4%     18.0  41.4% reflect.mapassign
     5.5  12.6%  54.0%      5.5  12.6% makemap_c
     4.0   9.2%  63.2%      4.0   9.2% container/list.(*List).insertValue
     4.0   9.2%  72.4%      4.0   9.2% encoding/json.(*decodeState).literalStore
     4.0   9.2%  81.6%      8.0  18.4% github.com/uniqush/cache.(*Cache).Set
     3.0   6.9%  88.5%      3.0   6.9% github.com/uniqush/uniqush-push/push.NewEmptyDeliveryPoint
     2.5   5.7%  94.3%      2.5   5.7% fmt.Sprintf
     1.5   3.4%  97.7%      1.5   3.4% encoding/json.stateBeginValue
     0.5   1.1%  98.9%      0.5   1.1% allocg
     0.5   1.1% 100.0%      0.5   1.1% syscall.anyToSockaddr
------- 
Total: 61.0 MB
    26.5  43.4%  43.4%     26.5  43.4% reflect.mapassign
     7.5  12.3%  55.7%      7.5  12.3% makemap_c
     5.5   9.0%  64.8%      5.5   9.0% container/list.(*List).insertValue
     5.0   8.2%  73.0%      5.0   8.2% encoding/json.(*decodeState).literalStore
     5.0   8.2%  81.1%     10.5  17.2% github.com/uniqush/cache.(*Cache).Set
     4.0   6.6%  87.7%      4.0   6.6% github.com/uniqush/uniqush-push/push.NewEmptyDeliveryPoint
     3.5   5.7%  93.4%      3.5   5.7% fmt.Sprintf
     3.0   4.9%  98.4%      3.0   4.9% encoding/json.stateBeginValue
     0.5   0.8%  99.2%      0.5   0.8% allocg
     0.5   0.8% 100.0%      0.5   0.8% syscall.anyToSockaddr
------- 

- i killed the script:
Top Output at the time i killed the test_push_20141609.sh script  
     
PID   COMMAND      %CPU TIME     #TH  #WQ  #POR #MREG MEM   RPRVT PURG CMPR
2718  uniqush-push 0.0  01:25.90 10   0    45   1483  146M  145M  0B   0B 

- I checked uniqush-push memory usage with pprof and top after 30 minutes without any push notifications:
Total: 72.5 MB
    30.5  42.1%  42.1%     30.5  42.1% reflect.mapassign
     9.0  12.4%  54.5%      9.0  12.4% makemap_c
     8.0  11.0%  65.5%      8.0  11.0% encoding/json.(*decodeState).literalStore
     6.0   8.3%  73.8%      6.0   8.3% github.com/uniqush/uniqush-push/push.NewEmptyDeliveryPoint
     5.5   7.6%  81.4%      5.5   7.6% container/list.(*List).insertValue
     5.0   6.9%  88.3%     10.5  14.5% github.com/uniqush/cache.(*Cache).Set
     3.5   4.8%  93.1%      3.5   4.8% encoding/json.stateBeginValue
     3.5   4.8%  97.9%      3.5   4.8% fmt.Sprintf
     0.5   0.7%  98.6%      0.5   0.7% allocg
     0.5   0.7%  99.3%      0.5   0.7% github.com/uniqush/uniqush-push/srv.clearRequest
     
-------
 
Top
PID   COMMAND      %CPU TIME     #TH  #WQ  #POR #MREG MEM   RPRVT PURG CMPR
2718  uniqush-push 0.0  01:29.20 10   0    45   1483  146M  145M  0B   0B

It seems to me that  the memory usage grows during push notifications handling (and it is right) but there is something allocated during push notifications handling that won't be released even if after the push notifications requests have been done... If it was the request queue the memory should descrease after some time I killed the script, but nothing as you can see.... 

Any suggestions? 

Thanks a lot for your support...

Bye,
Serafino.

Uscher

unread,
Oct 28, 2014, 5:09:48 AM10/28/14
to uni...@googlegroups.com
Hi

What is the status of this issue? I am concerned about that and I wonder if I have to restart Uniqush periodically in order to avoid a sudden crash.

Thank you,
Urs

Misha Nasledov

unread,
Jun 3, 2015, 1:58:05 AM6/3/15
to uni...@googlegroups.com
Unfortunately there's still dragons in this code. There was a recent fix by a colleague to the memory leak, but now it fails much faster on production under a fair load (I think APNS only gets 50qps) with too many open files it seems.

I'm looking into figuring it out, but I'm also looking at moving my iOS clients to GCM now that GCM supports that. The client side modification is minimal and there's a net win to being able to talk to all my mobile clients the exact same way.

Serafino Sorleti

unread,
Jun 3, 2015, 4:37:23 AM6/3/15
to uni...@googlegroups.com
Hi (and Usher sorry for the delay of my reply...),

I investigated the issue and it seems related to the goroutine launched for each device id (device token) for push request to APNS. It seems that goroutines memory are not released, or at least not released in a useful time. Goroutines use x kylobytes of stack each, only for being launched, so the scenario seems to be the following:

- 1000 notifications (a push notification to 1000 device tokens) are sent
- about 1000 * x kylobytes used
- this memory is not released in a "useful time"
- other 1000 notifications 
- previous 1000 * x kylobytes + current 1000 * x kylobytes memory is now busy

As I said It "seems" to me that it is the issue, but I couldn't go deeply inside the issue. It could be due to non closed channels but as i said i didn't investigate anymore. If you compile with newer version of golang compiler the stack used by each goroutine is lower (i tried passing from 1.3 to 1.4) and so memory consumption is lower... But it is not a solution...

I hope Monnand gives a look to such thing...

Bye,
Serafino.

Misha Nasledov

unread,
Jun 3, 2015, 11:52:38 AM6/3/15
to uni...@googlegroups.com
You know, I realized I may not have had enough instances running when I turned on the APNS floodgates after the latest fix we submitted:

https://github.com/uniqush/uniqush-push/pull/60 -- missing close() on a channel in apns.go
https://github.com/uniqush/uniqush-push/pull/58 -- close() gets called on a nil conn potentially and Bad Things(tm) happen

These have both been merged by Monnand. Please give it a whirl?

Misha Nasledov

unread,
Jun 3, 2015, 5:08:19 PM6/3/15
to uni...@googlegroups.com
So in my case, the memory usage is now stable and it works until something goes wrong. The code that sends the pushes occasionally sends something too long that Apple doesn't like causing Apple to shut the connection. As soon as it happens, it doesn't seem to recover and further APNS pushes fail. So there's an issue with re-opening the connection with Apple but otherwise looking better.

Monnand

unread,
Jun 4, 2015, 11:06:17 PM6/4/15
to uni...@googlegroups.com
Hi Misha,

You are amazing! I'm so sorry that I could not reply your email soon. I'm currently working on my thesis and graduation stuff. Lots of work.

Specific to the problem you mentioned, I think it may be caused by some bug in the connpool package. I now realize that it's a bad idea to implement a connection pool in that way. A better approach would be to use a set of goroutines each maintains a TCP connection with APNS server. Doing the re-connection work within a goroutine would be way much easier than the current approach.

However, I currently do not have time to make this change. I will have a month of free time in Aug. So will work on it during that time.

-Monnand

--
Reply all
Reply to author
Forward
0 new messages