We chose Zotonic a while ago as the target framework for building our
web site.
Everything was nice and smooth while developping on our pcs (both
under Windows and Debian boxes under VirtualBox).
And then... we decided it was time to deploy Zotonic on our dedicated
server.
And then... Zotonic refused to start.
In the logs, we had something like :
=INFO REPORT==== 29-Jan-2012::14:57:07 ===
DEBUG: z_install_data:196 "Inserting username for the admin"
Over several days we tried (almost) everything to get it right:
- zotonic 0.7.3
- zotonic 0.7.4
- hacking the code in all sorts of inventive ways
- playing with pgsql settings
- creating new dbs
- re-installing pgsql
- formatting the server and reinstalling the OS (i'm quite ashamed of
that one)
Nothing would do. And then I said to myself maybe i could use my
brains a little bit for once... doh!
Everything we had experienced was consistent with the z_ids gen_server
stopped:
{noproc,{gen_server,call,[z_ids,{id,10}]}}}},
What could have stopped the z_ids gen_server? Since z_ids is under a
supervision tree, it had to be another process that exited
prematurely.
I used the dbg module to trace process events:
dbg:tracer(),
dbg:p(all,[p]).
This logs all the spawn/exit process events. In the logs, I searched
the first occurence of exit that was not "exit normal". I got:
(<0.135.0>) unregister z_sites_manager
Terminating due to shutdown
(<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>,
{start_child,xxx}]}}
So something went wrong inside start_child and caused a time out.
I searched the spawn event that created process <0.135.0>. I got:
(<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup,
[<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>,
{local,z_sites_manager},
z_sites_manager,[],
[{timeout,3600000}]])
(<0.101.0>) link <0.135.0>
(<0.135.0>) register z_sites_manager
This confirmed the issue was inside z_sites_manager:start_child.
I followed the code:
In z_sites_manager:
handle_cast({start, Site}, State) ->
z_supervisor:start_child(State#state.sup, Site), ...
In z_supervisor:
start_child(Pid, Name) ->
gen_server:call(Pid, {start_child, Name}).
The gen_server documentation confirmed what i remembered:
gen_server:call can time out with a default timeout of 5s.
I realised that the dedicated server is not as powerful as our desktop
beasts. So it's normal it takes longer to populate zotonic's tables at
first startup. Definitely more than 5s.
So I changed the code in z_supervisor and recompiled:
Et voilà! Zotonic started for the first time on our dedicated server
and we lived happily ever after with the shame of all the dumb things
we did to get this sorted.
Note: this last piece of code in z_supervisor is different in 0.7.3
and 0.7.4 but they both suffer from the same problem and the same fix
is applicable in both versions.
I think z_sites_manager should not assume anything regarding the time
needed by a given site to start. So it's seems perfectly logical to
have no timeout and let the site itself decide if its start time is
too much.
A better fix would be:
In z_supervisor:
start_child(Pid, Name, Timeout) ->
gen_server:call(Pid, {start_child, Name}, Timeout).
thanks for your detailed report. There's nothing wrong with it, however it might have lingered in mailing list queue for a while (we review all messages first time somebody sends something, to prevent spam).
The issue that sometimes a site takes too long to startup is indeed very annoying and leads to strange behaviour.
I am tempted to commit your solution (changing z_supervisor to allow a longer startup time for childs) but I'd first like some feedback from the other developers (Marc, Andreas, Maas?)
One more thought: choosing 'infinity' means you trust all sites (and modules!) not to do anything stupid in their startup code: the whole system may fail to start up if one site decides to block.
> I've had not feedback from my post below. Is something wrong with it?
> Kind regards,
> Emmanuel
> -----Message d'origine----- > De : eboutin [mailto:emmanuel.bou...@free.fr] > Envoy : samedi 4 f vrier 2012 14:52 > : Zotonic developers > Objet : The story of an annoying bug that got nailed and fixed ;-)
> Hi,
> We chose Zotonic a while ago as the target framework for building our > web site. > Everything was nice and smooth while developping on our pcs (both > under Windows and Debian boxes under VirtualBox).
> And then... we decided it was time to deploy Zotonic on our dedicated > server. > And then... Zotonic refused to start.
> In the logs, we had something like :
> =INFO REPORT==== 29-Jan-2012::14:57:07 === > DEBUG: z_install_data:196 "Inserting username for the admin"
> Over several days we tried (almost) everything to get it right: > - zotonic 0.7.3 > - zotonic 0.7.4 > - hacking the code in all sorts of inventive ways > - playing with pgsql settings > - creating new dbs > - re-installing pgsql > - formatting the server and reinstalling the OS (i'm quite ashamed of > that one)
> Nothing would do. And then I said to myself maybe i could use my > brains a little bit for once... doh!
> Everything we had experienced was consistent with the z_ids gen_server > stopped: > {noproc,{gen_server,call,[z_ids,{id,10}]}}}},
> What could have stopped the z_ids gen_server? Since z_ids is under a > supervision tree, it had to be another process that exited > prematurely.
> I used the dbg module to trace process events: > dbg:tracer(), > dbg:p(all,[p]).
> This logs all the spawn/exit process events. In the logs, I searched > the first occurence of exit that was not "exit normal". I got: > (<0.135.0>) unregister z_sites_manager > Terminating due to shutdown > (<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>, > {start_child,xxx}]}}
> So something went wrong inside start_child and caused a time out.
> I searched the spawn event that created process <0.135.0>. I got: > (<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup, > [<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>, > {local,z_sites_manager}, > z_sites_manager,[], > [{timeout,3600000}]]) > (<0.101.0>) link <0.135.0> > (<0.135.0>) register z_sites_manager
> This confirmed the issue was inside z_sites_manager:start_child.
> The gen_server documentation confirmed what i remembered: > gen_server:call can time out with a default timeout of 5s.
> I realised that the dedicated server is not as powerful as our desktop > beasts. So it's normal it takes longer to populate zotonic's tables at > first startup. Definitely more than 5s.
> So I changed the code in z_supervisor and recompiled:
> Et voil ! Zotonic started for the first time on our dedicated server > and we lived happily ever after with the shame of all the dumb things > we did to get this sorted.
> Note: this last piece of code in z_supervisor is different in 0.7.3 > and 0.7.4 but they both suffer from the same problem and the same fix > is applicable in both versions.
> I think z_sites_manager should not assume anything regarding the time > needed by a given site to start. So it's seems perfectly logical to > have no timeout and let the site itself decide if its start time is > too much.
I understand your point: one site should not be allowed to affect the
others.
My point is that zotonic should not kill itself because of a timeout
at the z_sites_manager/z_supervisor level.
You can always have a situation where a "should be long enough"
timeout is not long enough.
So in my mind:
- if a site takes too long to start because of the zotonic code then
that's a zotonic bug and it should be fixed in zotonic
- if a site takes too long to start because of the way it behaves in
its private code, then zotonic should be able to protect itself from
bad site behaviour
Maybe one cheap way of doing this would be to have timeout checks in
place at the places where zotonic calls site code.
The timeout could then come from the site config so that the site has
a chance to say what "long enough" means for itself.
I'm not familiar enough with zotonic insides to know where those
transitions from zotonic code to site code are made.
A cheaper alternative would be to use the 'infinity' timeout only when
the site starts for the first time but I found no easy way to
implement this: we only know if the site starts for the first time
later, inside z_installer, where we have access to the site's db.
Regards,
Emmanuel
On 12 fév, 08:24, Arjan Scherpenisse <ar...@scherpenisse.net> wrote:
> thanks for your detailed report. There's nothing wrong with it, however
> it might have lingered in mailing list queue for a while (we review all
> messages first time somebody sends something, to prevent spam).
> The issue that sometimes a site takes too long to startup is indeed very
> annoying and leads to strange behaviour.
> I am tempted to commit your solution (changing z_supervisor to allow a
> longer startup time for childs) but I'd first like some feedback from
> the other developers (Marc, Andreas, Maas?)
> One more thought: choosing 'infinity' means you trust all sites (and
> modules!) not to do anything stupid in their startup code: the whole
> system may fail to start up if one site decides to block.
> Cheers,
> Arjan
> On 02/11/2012 09:53 PM, Emmanuel Boutin wrote:
> > Hi!
> > I've had not feedback from my post below. Is something wrong with it?
> > Kind regards,
> > Emmanuel
> > -----Message d'origine-----
> > De : eboutin [mailto:emmanuel.bou...@free.fr]
> > Envoy : samedi 4 f vrier 2012 14:52
> > : Zotonic developers
> > Objet : The story of an annoying bug that got nailed and fixed ;-)
> > Hi,
> > We chose Zotonic a while ago as the target framework for building our
> > web site.
> > Everything was nice and smooth while developping on our pcs (both
> > under Windows and Debian boxes under VirtualBox).
> > And then... we decided it was time to deploy Zotonic on our dedicated
> > server.
> > And then... Zotonic refused to start.
> > In the logs, we had something like :
> > =INFO REPORT==== 29-Jan-2012::14:57:07 ===
> > DEBUG: z_install_data:196 "Inserting username for the admin"
> > Over several days we tried (almost) everything to get it right:
> > - zotonic 0.7.3
> > - zotonic 0.7.4
> > - hacking the code in all sorts of inventive ways
> > - playing with pgsql settings
> > - creating new dbs
> > - re-installing pgsql
> > - formatting the server and reinstalling the OS (i'm quite ashamed of
> > that one)
> > Nothing would do. And then I said to myself maybe i could use my
> > brains a little bit for once... doh!
> > Everything we had experienced was consistent with the z_ids gen_server
> > stopped:
> > {noproc,{gen_server,call,[z_ids,{id,10}]}}}},
> > What could have stopped the z_ids gen_server? Since z_ids is under a
> > supervision tree, it had to be another process that exited
> > prematurely.
> > I used the dbg module to trace process events:
> > dbg:tracer(),
> > dbg:p(all,[p]).
> > This logs all the spawn/exit process events. In the logs, I searched
> > the first occurence of exit that was not "exit normal". I got:
> > (<0.135.0>) unregister z_sites_manager
> > Terminating due to shutdown
> > (<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>,
> > {start_child,xxx}]}}
> > So something went wrong inside start_child and caused a time out.
> > I searched the spawn event that created process <0.135.0>. I got:
> > (<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup,
> > [<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>,
> > {local,z_sites_manager},
> > z_sites_manager,[],
> > [{timeout,3600000}]])
> > (<0.101.0>) link <0.135.0>
> > (<0.135.0>) register z_sites_manager
> > This confirmed the issue was inside z_sites_manager:start_child.
> > The gen_server documentation confirmed what i remembered:
> > gen_server:call can time out with a default timeout of 5s.
> > I realised that the dedicated server is not as powerful as our desktop
> > beasts. So it's normal it takes longer to populate zotonic's tables at
> > first startup. Definitely more than 5s.
> > So I changed the code in z_supervisor and recompiled:
> > Et voil ! Zotonic started for the first time on our dedicated server
> > and we lived happily ever after with the shame of all the dumb things
> > we did to get this sorted.
> > Note: this last piece of code in z_supervisor is different in 0.7.3
> > and 0.7.4 but they both suffer from the same problem and the same fix
> > is applicable in both versions.
> > I think z_sites_manager should not assume anything regarding the time
> > needed by a given site to start. So it's seems perfectly logical to
> > have no timeout and let the site itself decide if its start time is
> > too much.
The change, as far as I can see from a quick glance, is to give sites a full minute to start; and a few more changes... Perhaps back-port this to 0.7.x?
> thanks for your detailed report. There's nothing wrong with it, however > it might have lingered in mailing list queue for a while (we review all > messages first time somebody sends something, to prevent spam).
> The issue that sometimes a site takes too long to startup is indeed very > annoying and leads to strange behaviour.
> I am tempted to commit your solution (changing z_supervisor to allow a > longer startup time for childs) but I'd first like some feedback from > the other developers (Marc, Andreas, Maas?)
> One more thought: choosing 'infinity' means you trust all sites (and > modules!) not to do anything stupid in their startup code: the whole > system may fail to start up if one site decides to block.
> Cheers, > Arjan
> On 02/11/2012 09:53 PM, Emmanuel Boutin wrote: >> Hi!
>> I've had not feedback from my post below. Is something wrong with it?
>> Kind regards,
>> Emmanuel
>> -----Message d'origine----- >> De : eboutin [mailto:emmanuel.bou...@free.fr] >> Envoyé : samedi 4 février 2012 14:52 >> À : Zotonic developers >> Objet : The story of an annoying bug that got nailed and fixed ;-)
>> Hi,
>> We chose Zotonic a while ago as the target framework for building our >> web site. >> Everything was nice and smooth while developping on our pcs (both >> under Windows and Debian boxes under VirtualBox).
>> And then... we decided it was time to deploy Zotonic on our dedicated >> server. >> And then... Zotonic refused to start.
>> In the logs, we had something like :
>> =INFO REPORT==== 29-Jan-2012::14:57:07 === >> DEBUG: z_install_data:196 "Inserting username for the admin"
>> Over several days we tried (almost) everything to get it right: >> - zotonic 0.7.3 >> - zotonic 0.7.4 >> - hacking the code in all sorts of inventive ways >> - playing with pgsql settings >> - creating new dbs >> - re-installing pgsql >> - formatting the server and reinstalling the OS (i'm quite ashamed of >> that one)
>> Nothing would do. And then I said to myself maybe i could use my >> brains a little bit for once... doh!
>> Everything we had experienced was consistent with the z_ids gen_server >> stopped: >> {noproc,{gen_server,call,[z_ids,{id,10}]}}}},
>> What could have stopped the z_ids gen_server? Since z_ids is under a >> supervision tree, it had to be another process that exited >> prematurely.
>> I used the dbg module to trace process events: >> dbg:tracer(), >> dbg:p(all,[p]).
>> This logs all the spawn/exit process events. In the logs, I searched >> the first occurence of exit that was not "exit normal". I got: >> (<0.135.0>) unregister z_sites_manager >> Terminating due to shutdown >> (<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>, >> {start_child,xxx}]}}
>> So something went wrong inside start_child and caused a time out.
>> I searched the spawn event that created process <0.135.0>. I got: >> (<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup, >> [<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>, >> {local,z_sites_manager}, >> z_sites_manager,[], >> [{timeout,3600000}]]) >> (<0.101.0>) link <0.135.0> >> (<0.135.0>) register z_sites_manager
>> This confirmed the issue was inside z_sites_manager:start_child.
>> The gen_server documentation confirmed what i remembered: >> gen_server:call can time out with a default timeout of 5s.
>> I realised that the dedicated server is not as powerful as our desktop >> beasts. So it's normal it takes longer to populate zotonic's tables at >> first startup. Definitely more than 5s.
>> So I changed the code in z_supervisor and recompiled:
>> Et voilà! Zotonic started for the first time on our dedicated server >> and we lived happily ever after with the shame of all the dumb things >> we did to get this sorted.
>> Note: this last piece of code in z_supervisor is different in 0.7.3 >> and 0.7.4 but they both suffer from the same problem and the same fix >> is applicable in both versions.
>> I think z_sites_manager should not assume anything regarding the time >> needed by a given site to start. So it's seems perfectly logical to >> have no timeout and let the site itself decide if its start time is >> too much.
> The change, as far as I can see from a quick glance, is to give sites > a full minute to start; and a few more changes... > Perhaps back-port this to 0.7.x?
> //Andreas
> 2012/2/12 Arjan Scherpenisse <ar...@scherpenisse.net>: >> Hi Emmanuel,
>> thanks for your detailed report. There's nothing wrong with it, however >> it might have lingered in mailing list queue for a while (we review all >> messages first time somebody sends something, to prevent spam).
>> The issue that sometimes a site takes too long to startup is indeed very >> annoying and leads to strange behaviour.
>> I am tempted to commit your solution (changing z_supervisor to allow a >> longer startup time for childs) but I'd first like some feedback from >> the other developers (Marc, Andreas, Maas?)
>> One more thought: choosing 'infinity' means you trust all sites (and >> modules!) not to do anything stupid in their startup code: the whole >> system may fail to start up if one site decides to block.
>> Cheers, >> Arjan
>> On 02/11/2012 09:53 PM, Emmanuel Boutin wrote: >>> Hi!
>>> I've had not feedback from my post below. Is something wrong with it?
>>> Kind regards,
>>> Emmanuel
>>> -----Message d'origine----- >>> De : eboutin [mailto:emmanuel.bou...@free.fr] >>> Envoy : samedi 4 f vrier 2012 14:52 >>> : Zotonic developers >>> Objet : The story of an annoying bug that got nailed and fixed ;-)
>>> Hi,
>>> We chose Zotonic a while ago as the target framework for building our >>> web site. >>> Everything was nice and smooth while developping on our pcs (both >>> under Windows and Debian boxes under VirtualBox).
>>> And then... we decided it was time to deploy Zotonic on our dedicated >>> server. >>> And then... Zotonic refused to start.
>>> In the logs, we had something like :
>>> =INFO REPORT==== 29-Jan-2012::14:57:07 === >>> DEBUG: z_install_data:196 "Inserting username for the admin"
>>> Over several days we tried (almost) everything to get it right: >>> - zotonic 0.7.3 >>> - zotonic 0.7.4 >>> - hacking the code in all sorts of inventive ways >>> - playing with pgsql settings >>> - creating new dbs >>> - re-installing pgsql >>> - formatting the server and reinstalling the OS (i'm quite ashamed of >>> that one)
>>> Nothing would do. And then I said to myself maybe i could use my >>> brains a little bit for once... doh!
>>> Everything we had experienced was consistent with the z_ids gen_server >>> stopped: >>> {noproc,{gen_server,call,[z_ids,{id,10}]}}}},
>>> What could have stopped the z_ids gen_server? Since z_ids is under a >>> supervision tree, it had to be another process that exited >>> prematurely.
>>> I used the dbg module to trace process events: >>> dbg:tracer(), >>> dbg:p(all,[p]).
>>> This logs all the spawn/exit process events. In the logs, I searched >>> the first occurence of exit that was not "exit normal". I got: >>> (<0.135.0>) unregister z_sites_manager >>> Terminating due to shutdown >>> (<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>, >>> {start_child,xxx}]}}
>>> So something went wrong inside start_child and caused a time out.
>>> I searched the spawn event that created process <0.135.0>. I got: >>> (<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup, >>> [<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>, >>> {local,z_sites_manager}, >>> z_sites_manager,[], >>> [{timeout,3600000}]]) >>> (<0.101.0>) link <0.135.0> >>> (<0.135.0>) register z_sites_manager
>>> This confirmed the issue was inside z_sites_manager:start_child.
>>> The gen_server documentation confirmed what i remembered: >>> gen_server:call can time out with a default timeout of 5s.
>>> I realised that the dedicated server is not as powerful as our desktop >>> beasts. So it's normal it takes longer to populate zotonic's tables at >>> first startup. Definitely more than 5s.
>>> So I changed the code in z_supervisor and recompiled:
>>> Et voil ! Zotonic started for the first time on our dedicated server >>> and we lived happily ever after with the shame of all the dumb things >>> we did to get this sorted.
>>> Note: this last piece of code in z_supervisor is different in 0.7.3 >>> and 0.7.4 but they both suffer from the same problem and the same fix >>> is applicable in both versions.
>>> I think z_sites_manager should not assume anything regarding the time >>> needed by a given site to start. So it's seems perfectly logical to >>> have no timeout and let the site itself decide if its start time is >>> too much.
OK I looked into merging Marc's "module stability fix" (769e1999afa) but it is part of a bigger change (module dependencies and changing the module startup sequence) that is too hard to merge properly. I'm afraid it cannot be backported to the 0.7 series.
Emmanuel, you are suggested to run on the git master while we are preparing the next stable 0.8 release ("real soon now!")
>> The change, as far as I can see from a quick glance, is to give sites >> a full minute to start; and a few more changes... >> Perhaps back-port this to 0.7.x?
>> //Andreas
>> 2012/2/12 Arjan Scherpenisse <ar...@scherpenisse.net>: >>> Hi Emmanuel,
>>> thanks for your detailed report. There's nothing wrong with it, however >>> it might have lingered in mailing list queue for a while (we review all >>> messages first time somebody sends something, to prevent spam).
>>> The issue that sometimes a site takes too long to startup is indeed very >>> annoying and leads to strange behaviour.
>>> I am tempted to commit your solution (changing z_supervisor to allow a >>> longer startup time for childs) but I'd first like some feedback from >>> the other developers (Marc, Andreas, Maas?)
>>> One more thought: choosing 'infinity' means you trust all sites (and >>> modules!) not to do anything stupid in their startup code: the whole >>> system may fail to start up if one site decides to block.
>>> Cheers, >>> Arjan
>>> On 02/11/2012 09:53 PM, Emmanuel Boutin wrote: >>>> Hi!
>>>> I've had not feedback from my post below. Is something wrong with it?
>>>> Kind regards,
>>>> Emmanuel
>>>> -----Message d'origine----- >>>> De : eboutin [mailto:emmanuel.bou...@free.fr] >>>> Envoy : samedi 4 f vrier 2012 14:52 >>>> : Zotonic developers >>>> Objet : The story of an annoying bug that got nailed and fixed ;-)
>>>> Hi,
>>>> We chose Zotonic a while ago as the target framework for building our >>>> web site. >>>> Everything was nice and smooth while developping on our pcs (both >>>> under Windows and Debian boxes under VirtualBox).
>>>> And then... we decided it was time to deploy Zotonic on our dedicated >>>> server. >>>> And then... Zotonic refused to start.
>>>> In the logs, we had something like :
>>>> =INFO REPORT==== 29-Jan-2012::14:57:07 === >>>> DEBUG: z_install_data:196 "Inserting username for the admin"
>>>> Over several days we tried (almost) everything to get it right: >>>> - zotonic 0.7.3 >>>> - zotonic 0.7.4 >>>> - hacking the code in all sorts of inventive ways >>>> - playing with pgsql settings >>>> - creating new dbs >>>> - re-installing pgsql >>>> - formatting the server and reinstalling the OS (i'm quite ashamed of >>>> that one)
>>>> Nothing would do. And then I said to myself maybe i could use my >>>> brains a little bit for once... doh!
>>>> Everything we had experienced was consistent with the z_ids gen_server >>>> stopped: >>>> {noproc,{gen_server,call,[z_ids,{id,10}]}}}},
>>>> What could have stopped the z_ids gen_server? Since z_ids is under a >>>> supervision tree, it had to be another process that exited >>>> prematurely.
>>>> I used the dbg module to trace process events: >>>> dbg:tracer(), >>>> dbg:p(all,[p]).
>>>> This logs all the spawn/exit process events. In the logs, I searched >>>> the first occurence of exit that was not "exit normal". I got: >>>> (<0.135.0>) unregister z_sites_manager >>>> Terminating due to shutdown >>>> (<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>, >>>> {start_child,xxx}]}}
>>>> So something went wrong inside start_child and caused a time out.
>>>> I searched the spawn event that created process <0.135.0>. I got: >>>> (<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup, >>>> [<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>, >>>> {local,z_sites_manager}, >>>> z_sites_manager,[], >>>> [{timeout,3600000}]]) >>>> (<0.101.0>) link <0.135.0> >>>> (<0.135.0>) register z_sites_manager
>>>> This confirmed the issue was inside z_sites_manager:start_child.
>>>> The gen_server documentation confirmed what i remembered: >>>> gen_server:call can time out with a default timeout of 5s.
>>>> I realised that the dedicated server is not as powerful as our desktop >>>> beasts. So it's normal it takes longer to populate zotonic's tables at >>>> first startup. Definitely more than 5s.
>>>> So I changed the code in z_supervisor and recompiled:
>>>> Et voil ! Zotonic started for the first time on our dedicated server >>>> and we lived happily ever after with the shame of all the dumb things >>>> we did to get this sorted.
>>>> Note: this last piece of code in z_supervisor is different in 0.7.3 >>>> and 0.7.4 but they both suffer from the same problem and the same fix >>>> is applicable in both versions.
>>>> I think z_sites_manager should not assume anything regarding the time >>>> needed by a given site to start. So it's seems perfectly logical to >>>> have no timeout and let the site itself decide if its start time is >>>> too much.
> OK I looked into merging Marc's "module stability fix" (769e1999afa) but
> it is part of a bigger change (module dependencies and changing the
> module startup sequence) that is too hard to merge properly.
> I'm afraid it cannot be backported to the 0.7 series.
> Emmanuel, you are suggested to run on the git master while we are
> preparing the next stable 0.8 release ("real soon now!")
> Arjan
> On 02/13/2012 02:27 PM, Arjan Scherpenisse wrote:
> > That sounds like a good idea..
> > I'm actually never experiencing this startup annoyance anymore, but that
> > is probably because I only run master :-)
> > If no further objections I'll backport it and churn out a release
> > somewhere this week.
> >> The change, as far as I can see from a quick glance, is to give sites
> >> a full minute to start; and a few more changes...
> >> Perhaps back-port this to 0.7.x?
> >> //Andreas
> >> 2012/2/12 Arjan Scherpenisse <ar...@scherpenisse.net>:
> >>> Hi Emmanuel,
> >>> thanks for your detailed report. There's nothing wrong with it, however
> >>> it might have lingered in mailing list queue for a while (we review all
> >>> messages first time somebody sends something, to prevent spam).
> >>> The issue that sometimes a site takes too long to startup is indeed very
> >>> annoying and leads to strange behaviour.
> >>> I am tempted to commit your solution (changing z_supervisor to allow a
> >>> longer startup time for childs) but I'd first like some feedback from
> >>> the other developers (Marc, Andreas, Maas?)
> >>> One more thought: choosing 'infinity' means you trust all sites (and
> >>> modules!) not to do anything stupid in their startup code: the whole
> >>> system may fail to start up if one site decides to block.
> >>>> I've had not feedback from my post below. Is something wrong with it?
> >>>> Kind regards,
> >>>> Emmanuel
> >>>> -----Message d'origine-----
> >>>> De : eboutin [mailto:emmanuel.bou...@free.fr]
> >>>> Envoy : samedi 4 f vrier 2012 14:52
> >>>> : Zotonic developers
> >>>> Objet : The story of an annoying bug that got nailed and fixed ;-)
> >>>> Hi,
> >>>> We chose Zotonic a while ago as the target framework for building our
> >>>> web site.
> >>>> Everything was nice and smooth while developping on our pcs (both
> >>>> under Windows and Debian boxes under VirtualBox).
> >>>> And then... we decided it was time to deploy Zotonic on our dedicated
> >>>> server.
> >>>> And then... Zotonic refused to start.
> >>>> In the logs, we had something like :
> >>>> =INFO REPORT==== 29-Jan-2012::14:57:07 ===
> >>>> DEBUG: z_install_data:196 "Inserting username for the admin"
> >>>> Over several days we tried (almost) everything to get it right:
> >>>> - zotonic 0.7.3
> >>>> - zotonic 0.7.4
> >>>> - hacking the code in all sorts of inventive ways
> >>>> - playing with pgsql settings
> >>>> - creating new dbs
> >>>> - re-installing pgsql
> >>>> - formatting the server and reinstalling the OS (i'm quite ashamed of
> >>>> that one)
> >>>> Nothing would do. And then I said to myself maybe i could use my
> >>>> brains a little bit for once... doh!
> >>>> Everything we had experienced was consistent with the z_ids gen_server
> >>>> stopped:
> >>>> {noproc,{gen_server,call,[z_ids,{id,10}]}}}},
> >>>> What could have stopped the z_ids gen_server? Since z_ids is under a
> >>>> supervision tree, it had to be another process that exited
> >>>> prematurely.
> >>>> I used the dbg module to trace process events:
> >>>> dbg:tracer(),
> >>>> dbg:p(all,[p]).
> >>>> This logs all the spawn/exit process events. In the logs, I searched
> >>>> the first occurence of exit that was not "exit normal". I got:
> >>>> (<0.135.0>) unregister z_sites_manager
> >>>> Terminating due to shutdown
> >>>> (<0.135.0>) exit {timeout,{gen_server,call,[<0.136.0>,
> >>>> {start_child,xxx}]}}
> >>>> So something went wrong inside start_child and caused a time out.
> >>>> I searched the spawn event that created process <0.135.0>. I got:
> >>>> (<0.101.0>) spawn <0.135.0> as proc_lib:init_p(zotonic_sup,
> >>>> [<0.79.0>],gen,init_it,[gen_server,<0.101.0>,<0.101.0>,
> >>>> {local,z_sites_manager},
> >>>> z_sites_manager,[],
> >>>> [{timeout,3600000}]])
> >>>> (<0.101.0>) link <0.135.0>
> >>>> (<0.135.0>) register z_sites_manager
> >>>> This confirmed the issue was inside z_sites_manager:start_child.
> >>>> The gen_server documentation confirmed what i remembered:
> >>>> gen_server:call can time out with a default timeout of 5s.
> >>>> I realised that the dedicated server is not as powerful as our desktop
> >>>> beasts. So it's normal it takes longer to populate zotonic's tables at
> >>>> first startup. Definitely more than 5s.
> >>>> So I changed the code in z_supervisor and recompiled:
> >>>> Et voil ! Zotonic started for the first time on our dedicated server
> >>>> and we lived happily ever after with the shame of all the dumb things
> >>>> we did to get this sorted.
> >>>> Note: this last piece of code in z_supervisor is different in 0.7.3
> >>>> and 0.7.4 but they both suffer from the same problem and the same fix
> >>>> is applicable in both versions.
> >>>> I think z_sites_manager should not assume anything regarding the time
> >>>> needed by a given site to start. So it's seems perfectly logical to
> >>>> have no timeout and let the site itself decide if its start time is
> >>>> too much.