I am working in an admin department where I have to retrieve the
information from a State Website regarding 1.500 companies on a daily
basis. Manually doing the job represents a couple of hours. I try to
computerize that process to save energy.
I try to wget the data from a Linux/PHP server. Do you know how to
retrieve information from complicated WGET ?
Eg : the KOUGLOFF company.
State website is : http://www.infogreffe.fr/infogreffe/reset.do
Siren entry is : 448676973
Manually doing the job provide the information regarding KOUGLOFF.
Once, I get a session ID, I can copy and paste that link to retireve
the info :
http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973
I created a unix application ready to retrieve my daily 1.500
companies using recurring WGET.
Problem is that, as I do not have a session ID, my server cannot WGET
the file. I try to add a parallel session ID to access the file but
that does not lead me to the content :
http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973&bjsessionid=fzyzK4fLT8nn96Qv7L1TmLKVBKwv1Km0QnXqLQq53DgYhqGRFwjD!-1192967962!1225242595
From your knowledge and experience : Do you know how I could retrieve
the information ?
Adding a sessionid to the URL did not lead to a success for a Linux
script.
Thank you very much for any help, solution or advice.
Norman.
Hello Norman,
> I am working in an admin department where I have to retrieve the
> information from a State Website regarding 1.500 companies on a daily
> basis. Manually doing the job represents a couple of hours. I try to
> computerize that process to save energy.
Makes sense.
>
> I try to wget the data from a Linux/PHP server. Do you know how to
> retrieve information from complicated WGET ?
Why use WGET if you are on PHP?
Why not simply use file() or file_get_contents() and feed it an URL?
http://nl3.php.net/manual/en/function.file.php
>
> Eg : the KOUGLOFF company.
>
> State website is : http://www.infogreffe.fr/infogreffe/reset.do
> Siren entry is : 448676973
> Manually doing the job provide the information regarding KOUGLOFF.
>
> Once, I get a session ID, I can copy and paste that link to retireve
> the info :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973
>
>
> I created a unix application ready to retrieve my daily 1.500
> companies using recurring WGET.
>
> Problem is that, as I do not have a session ID, my server cannot WGET
> the file. I try to add a parallel session ID to access the file but
> that does not lead me to the content :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973&bjsessionid=fzyzK4fLT8nn96Qv7L1TmLKVBKwv1Km0QnXqLQq53DgYhqGRFwjD!-1192967962!1225242595
>
OK, that makes sense.
You must first log in, so you have an active session.
After that you can get the info.
A few remarks:
1) Judging by the length of the sessionid, this is not a standard PHP
generated sessionid which are shorter.
2) maybe they only accept php sessionid via a cookie instead of GET.
If I were you I would start by figuring out how the sessionid is
transferred to you. Is it in a cookie? In a form? (Appearantly it is not
meant to be in the URL)
Maybe consider using CURL instead of WGET or file() as I suggested above.
http://nl3.php.net/manual/en/book.curl.php
Using CURL, you can add cookies that contain sessionid, and also mimic
POSTS reliably.
Regards,
Erwin Moller
--
"There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult."
-- C.A.R. Hoare
Looking
>
at the HTTP headers (http://pastebin.com/f8ebfee) retrieving
the above page involves 2 redirects and 1 cookie.
>
> I created a unix application ready to retrieve my daily 1.500
> companies using recurring WGET.
>
> Problem is that, as I do not have a session ID, my server cannot WGET
> the file. I try to add a parallel session ID to access the file but
> that does not lead me to the content :
> http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=448676973&bjsessionid=fzyzK4fLT8nn96Qv7L1TmLKVBKwv1Km0QnXqLQq53DgYhqGRFwjD!-1192967962!1225242595
If
>
wget is giving you problems, then you can either try and fix them,
or else use another API. Something like libcurl will handle your network
IO for you and provide an API to read and set the HTTP headers.
> From your knowledge and experience : Do you know how I could retrieve
> the information ? Adding a sessionid to the URL did not lead to a
> success for a Linux script.
I would look into using libcurl based on what you've said and the
headers. It's be a simple matter to parse the cookies and sessionID to
use in subsequent requests.
The CURL instruction can be asked at least from LINUX and PHP. LINUX
commands are working much better, to my initial opinion. The website
listed above seems not to use cookies.
The access to the KOUGLOFF company is available through session
recognition. It seems that the principle is : no session equals no
access to the company details.
Problem is that the session principles using CURL under Linux is not
so easy reading : http://linux.about.com/od/commands/l/blcmdl1_curl.htm
There are informations regarding cookies but nothing regarding
sessions.
curl_init(), curl_setopt(), curl_exec(), curl_close() seem to be only
available using PHP.
Thank you very much for any help, operating solutions or advice.
Norman.
What part excactly in my previous answer did you not understand?
http is a stateless protocol - there is no such thing as a session in
the protocol. The only "session" is that which is defined by the server
software. Therefore, there is no special API for establishing a session.
For this session to work, the server sends a session id to the client,
and the client responds with this session id on each request. The
session id may be stored on the client in a cookie (the most common
case), or it may be a parameter in the URI (generally when the client
does not support cookies). cURL can handle cookies just fine; if
instead it's part of the URI you'll have to parse the page containing
the link to get the session id.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstu...@attglobal.net
==================
No, I'll try to help those who are interested in learning. The op
obviously is not familiar with how sessions work, which is quite common.
Most of the time it can be somewhat ignored because PHP does most of
the session handling behind the scenes. However, when you start trying
to do the actions the op is talking about, it requires a little better
understanding about how sessions work.
2) To retrieve properly the content of the KOUGLOF company
from here: http://tinyurl.com/lyhsmh
SIREN : 453786980
That is a major problem. We tested to the SIREN POST data :
> curl -F "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
> curl -F "siren=@453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml | more
> curl -d "siren=453786980" http://www.infogreffe.fr/infogreffe/newRechercheEntreprise.xml?siren=453786980
No content at-all can be retrieved !... and we carefully read 3 times
each of your message as well as all the CURL UNIX official
documentation.
That one suffer from the same transparency problem :
http://avis-situation-sirene.insee.fr/avisitu/IdentificationListeSiret.do?bSubmit=Avis%20de%20Situation&nic=00013&siren=512269432
Problem is that I have to suggest my initial searches to my
supervisor, and I do not see where to progress and get the answer.
Thank you for any help or operating solutions,
Norman
Obviously, you are trying to do something the website
designer has decided to avoid. He wants to prevent
automated vacuuming of his data base without harming
genuine manual queries.
One reason may be that your curl command does not
send the proper browser headers with the request.
--
Tauno Voipio
tauno voipio (at) iki fi
That's because the site is using javascript on the button clicks. You
will have to emulate the javascript to get it to work - and this is
likely to fail if they change the page and/or javascript.
I'm with Tauno on this one - the site is obviously designed to prevent
what you're trying to do. I would recommend you contact the company to
see if there is another way to retrieve the data.