"Failed to fetch ads from <my IP address:9501>"

1,272 views
Skip to first unread message

Nate

unread,
Oct 16, 2009, 3:14:55 PM10/16/09
to Archer User's Group
condor_q reports:

-- Failed to fetch ads from: <5.95.101.95:9501> : C095101095.ipop
CEDAR:6001:Failed to connect to <5.95.101.95:9501>

C095101095 is my host, and 5.95.101.95 is my IP address. I restarted
the machine, to no avail. This happened Wednesday and is still
happening today.

Any suggestions?
- Nathan

David Isaac Wolinsky

unread,
Oct 16, 2009, 4:11:59 PM10/16/09
to archer-us...@googlegroups.com
I think this is your problem:
https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/msg01095.shtml

note, the file they refer to is located under
/opt/condor/var/spool

short answer: delete /opt/condor/var/spool/job_queue.log followed by
/opt/condor/sbin/condor_master (as root) should fix the problem

Sorry for the inconvenience, maybe the Condor guys can give more context.

Regards,
David

Nathan Blythe

unread,
Oct 16, 2009, 8:51:26 PM10/16/09
to archer-us...@googlegroups.com
Wow, yes, that fixed it, thanks!

I'm curious how you divined that particular solution from my rather
vague description. The linked thread doesn't mention the error I
received; the only connection seems to be the "corruption" message
(which, upon inspection, I did have).

Thanks again.

David Isaac Wolinsky

unread,
Oct 17, 2009, 10:31:15 AM10/17/09
to archer-us...@googlegroups.com
Nathan Blythe wrote:
> Wow, yes, that fixed it, thanks!
>
> I'm curious how you divined that particular solution from my rather
> vague description. The linked thread doesn't mention the error I
> received; the only connection seems to be the "corruption" message
> (which, upon inspection, I did have).
>
Hidden away in antiquity:
http://www.grid-appliance.org/wiki/index.php/Archer:FAQs#What_privilege_does_a_job_from_remote_user_.22X.22_have_when_running_in_my_resource.3F

We can ssh into your machine to assist in debugging. It probably would
have been more proper to iterate with you through the problem and not
log into your machine. Though this is what I did...

- ps uax | grep condor
there was no condor_schedd
- check /opt/condor/var/log/
Something was stopping the condor_schedd from starting
- tried to manually restart the entire condor stack, no luck
- google, solution, but I didn't want to mess up with what you were
doing, so I left resolution up to you.

Cheers,
David

Nathan Blythe

unread,
Oct 17, 2009, 7:20:30 PM10/17/09
to archer-us...@googlegroups.com
Thanks for the info, good to know how to approach the problem. I
didn't realize you had root access to everybody's machines in Archer;
I'm actually kind of surprised about that, but I guess it does make
debugging a lot easier as you don't have to walk every user through
the details of the VM and condor, when most people are probably just
interested in getting their jobs running.

Thanks again,
Nathan

rjo...@gmail.com

unread,
Oct 17, 2009, 8:53:06 PM10/17/09
to Archer User's Group


On Oct 17, 7:20 pm, Nathan Blythe <nbly...@gmail.com> wrote:
> Thanks for the info, good to know how to approach the problem.  I
> didn't realize you had root access to everybody's machines in Archer;
> I'm actually kind of surprised about that, but I guess it does make
> debugging a lot easier as you don't have to walk every user through
> the details of the VM and condor, when most people are probably just
> interested in getting their jobs running.

If I recall correctly we explain in the terms of use box that Archer
management has admin. access to the appliances for troubleshooting, we
should double-check this is still there. We should also update the
FAQ.

(Users can disable this by removing the Archer admin ssh key from the
appliance, but it's an important tool for us to debug problems).

--rf


>
> Thanks again,
>  Nathan
>
> On 10/17/09, David Isaac Wolinsky <davi...@ufl.edu> wrote:
>
>
>
> > Nathan Blythe wrote:
> >> Wow, yes, that fixed it, thanks!
>
> >> I'm curious how you divined that particular solution from my rather
> >> vague description.  The linked thread doesn't mention the error I
> >> received; the only connection seems to be the "corruption" message
> >> (which, upon inspection, I did have).
>
> > Hidden away in antiquity:
> >http://www.grid-appliance.org/wiki/index.php/Archer:FAQs#What_privile...
>
> > We can ssh into your machine to assist in debugging.  It probably would
> > have been more proper to iterate with you through the problem and not
> > log into your machine.  Though this is what I did...
>
> > - ps uax | grep condor
> > there was no condor_schedd
> > - check /opt/condor/var/log/
> > Something was stopping the condor_schedd from starting
> > - tried to manually restart the entire condor stack, no luck
> > - google, solution, but I didn't want to mess up with what you were
> > doing, so I left resolution up to you.
>
> > Cheers,
> > David
>
> >> Thanks again.
>
> >> On 10/16/09, David Isaac Wolinsky <davi...@ufl.edu> wrote:
>
> >>> I think this is your problem:
> >>>https://lists.cs.wisc.edu/archive/condor-users/pre-2004-June/msg01095...
Reply all
Reply to author
Forward
0 new messages