Wget request from Linux-PHP [expert] ?

Pseudonyme

unread,

Jun 17, 2009, 9:19:03 AM6/17/09

to avi...@adverland.com

Hi,

I am working in an admin department where I have to retrieve the
information from a State Website regarding 1.500 companies on a daily
basis. Manually doing the job represents a couple of hours. I try to
computerize that process to save energy.

I try to wget the data from a Linux/PHP server. Do you know how to
retrieve information from complicated WGET ?

Eg : the KOUGLOFF company.

State website is : http://www.infogreffe.fr/infogreffe/reset.do
Siren entry is : 448676973
Manually doing the job provide the information regarding KOUGLOFF.

Once, I get a session ID, I can copy and paste that link to retireve
the info :
http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973

I created a unix application ready to retrieve my daily 1.500
companies using recurring WGET.

Problem is that, as I do not have a session ID, my server cannot WGET
the file. I try to add a parallel session ID to access the file but
that does not lead me to the content :
http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973&bjsessionid=fzyzK4fLT8nn96Qv7L1TmLKVBKwv1Km0QnXqLQq53DgYhqGRFwjD!-1192967962!1225242595

From your knowledge and experience : Do you know how I could retrieve
the information ?
Adding a sessionid to the URL did not lead to a success for a Linux
script.

Thank you very much for any help, solution or advice.

Norman.

Erwin Moller

unread,

Jun 17, 2009, 10:20:40 AM6/17/09

to

Pseudonyme schreef:
> Hi,
>

Hello Norman,

> I am working in an admin department where I have to retrieve the
> information from a State Website regarding 1.500 companies on a daily
> basis. Manually doing the job represents a couple of hours. I try to
> computerize that process to save energy.

Makes sense.

>
> I try to wget the data from a Linux/PHP server. Do you know how to
> retrieve information from complicated WGET ?

Why use WGET if you are on PHP?
Why not simply use file() or file_get_contents() and feed it an URL?
http://nl3.php.net/manual/en/function.file.php

>
> Eg : the KOUGLOFF company.
>
> State website is : http://www.infogreffe.fr/infogreffe/reset.do
> Siren entry is : 448676973
> Manually doing the job provide the information regarding KOUGLOFF.
>
> Once, I get a session ID, I can copy and paste that link to retireve
> the info :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973
>
>
> I created a unix application ready to retrieve my daily 1.500
> companies using recurring WGET.
>
> Problem is that, as I do not have a session ID, my server cannot WGET
> the file. I try to add a parallel session ID to access the file but
> that does not lead me to the content :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973&bjsessionid=fzyzK4fLT8nn96Qv7L1TmLKVBKwv1Km0QnXqLQq53DgYhqGRFwjD!-1192967962!1225242595
>

OK, that makes sense.
You must first log in, so you have an active session.
After that you can get the info.

A few remarks:
1) Judging by the length of the sessionid, this is not a standard PHP
generated sessionid which are shorter.
2) maybe they only accept php sessionid via a cookie instead of GET.

If I were you I would start by figuring out how the sessionid is
transferred to you. Is it in a cookie? In a form? (Appearantly it is not
meant to be in the URL)

Maybe consider using CURL instead of WGET or file() as I suggested above.
http://nl3.php.net/manual/en/book.curl.php

Using CURL, you can add cookies that contain sessionid, and also mimic
POSTS reliably.

Regards,
Erwin Moller

--
"There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult."
-- C.A.R. Hoare

Nick Birnie

unread,

Jun 17, 2009, 10:51:51 AM6/17/09

to

On 06/17/2009 02:19 PM, Pseudonyme wrote:
> Hi,
>
> I am working in an admin department where I have to retrieve the
> information from a State Website regarding 1.500 companies on a daily
> basis. Manually doing the job represents a couple of hours. I try to
> computerize that process to save energy.
>
> I try to wget the data from a Linux/PHP server. Do you know how to
> retrieve information from complicated WGET ?
>
> Eg : the KOUGLOFF company.
>
> State website is : http://www.infogreffe.fr/infogreffe/reset.do
> Siren entry is : 448676973 Manually doing the job provide the
> information regarding KOUGLOFF.
>
> Once, I get a session ID, I can copy and paste that link to retireve
> the info :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973
>

Looking
>
at the HTTP headers (http://pastebin.com/f8ebfee) retrieving
the above page involves 2 redirects and 1 cookie.

>
> I created a unix application ready to retrieve my daily 1.500
> companies using recurring WGET.
>
> Problem is that, as I do not have a session ID, my server cannot WGET
> the file. I try to add a parallel session ID to access the file but
> that does not lead me to the content :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973&bjsessionid=fzyzK4fLT8nn96Qv7L1TmLKVBKwv1Km0QnXqLQq53DgYhqGRFwjD!-1192967962!1225242595

If
>
wget is giving you problems, then you can either try and fix them,
or else use another API. Something like libcurl will handle your network
IO for you and provide an API to read and set the HTTP headers.

> From your knowledge and experience : Do you know how I could retrieve
> the information ? Adding a sessionid to the URL did not lead to a
> success for a Linux script.

I would look into using libcurl based on what you've said and the
headers. It's be a simple matter to parse the cookies and sessionID to
use in subsequent requests.

Pseudonyme

unread,

Jun 17, 2009, 12:25:13 PM6/17/09

to

Hi,
thanks : so, carefully reading the CURL command instead of the WGET
command.

The CURL instruction can be asked at least from LINUX and PHP. LINUX
commands are working much better, to my initial opinion. The website
listed above seems not to use cookies.
The access to the KOUGLOFF company is available through session
recognition. It seems that the principle is : no session equals no
access to the company details.

Problem is that the session principles using CURL under Linux is not
so easy reading : http://linux.about.com/od/commands/l/blcmdl1_curl.htm
There are informations regarding cookies but nothing regarding
sessions.

curl_init(), curl_setopt(), curl_exec(), curl_close() seem to be only
available using PHP.

Thank you very much for any help, operating solutions or advice.

Norman.

Erwin Moller

unread,

Jun 18, 2009, 3:13:53 AM6/18/09

to

Pseudonyme schreef:

What part excactly in my previous answer did you not understand?

Pseudonyme

unread,

Jun 18, 2009, 4:02:56 AM6/18/09

to

Hi,
To retrieve the content, I will not use : wget, file() nor
file_get_contents. I believe CURL using a UNIX script is more
effective.
The point is that the CURL has to open a session to access the
detailed content.
From the documentation, opening a session with UNIX/CURL command is
not documented.
Do you know how to open a session using a UNIX/CURL command ?
Norman

Jerry Stuckle

unread,

Jun 18, 2009, 6:48:23 AM6/18/09

to

http is a stateless protocol - there is no such thing as a session in
the protocol. The only "session" is that which is defined by the server
software. Therefore, there is no special API for establishing a session.

For this session to work, the server sends a session id to the client,
and the client responds with this session id on each request. The
session id may be stored on the client in a cookie (the most common
case), or it may be a parameter in the URI (generally when the client
does not support cookies). cURL can handle cookies just fine; if
instead it's part of the URI you'll have to parse the page containing
the link to get the session id.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstu...@attglobal.net
==================

Danny Wilkerson

unread,

Jun 18, 2009, 8:44:03 AM6/18/09

to

You feel like you are beating your head against a wall? Hehe. This
groups is php based so it's ok to assume we are talking about php.
Some people just don't understand what is going on and they expect you
to do their work for them. It's not hard. Your answer was perfect and
if he/she does not get it, let them go somewhere else.

Jerry Stuckle

unread,

Jun 18, 2009, 10:25:31 AM6/18/09

to

No, I'll try to help those who are interested in learning. The op
obviously is not familiar with how sessions work, which is quite common.
Most of the time it can be somewhat ignored because PHP does most of
the session handling behind the scenes. However, when you start trying
to do the actions the op is talking about, it requires a little better
understanding about how sessions work.

Pseudonyme

unread,

Jun 18, 2009, 11:59:51 AM6/18/09

to

Hi all, CURL Command
1) That is working ok getting properly the content in non-obscured
website like. That works for transparency-governed websites like :
> curl 'http://www.abca.com' | more

2) To retrieve properly the content of the KOUGLOF company

from here: http://tinyurl.com/lyhsmh

SIREN : 453786980

That is a major problem. We tested to the SIREN POST data :

> curl -F "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
> curl -F "siren=@453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
> curl -d "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=453786980

No content at-all can be retrieved !... and we carefully read 3 times
each of your message as well as all the CURL UNIX official
documentation.

That one suffer from the same transparency problem :
http://avis-situation-sirene.insee.fr/avisitu/IdentificationListeSiret.do?bSubmit=Avis%20de%20Situation&nic=00013&siren=512269432

Problem is that I have to suggest my initial searches to my
supervisor, and I do not see where to progress and get the answer.
Thank you for any help or operating solutions,
Norman

Tauno Voipio

unread,

Jun 18, 2009, 12:37:42 PM6/18/09

to

Obviously, you are trying to do something the website
designer has decided to avoid. He wants to prevent
automated vacuuming of his data base without harming
genuine manual queries.

One reason may be that your curl command does not
send the proper browser headers with the request.

--

Tauno Voipio
tauno voipio (at) iki fi

Jerry Stuckle

unread,

Jun 18, 2009, 1:39:17 PM6/18/09

to

That's because the site is using javascript on the button clicks. You
will have to emulate the javascript to get it to work - and this is
likely to fail if they change the page and/or javascript.

I'm with Tauno on this one - the site is obviously designed to prevent
what you're trying to do. I would recommend you contact the company to
see if there is another way to retrieve the data.