file_get_contents or Curl - which one to take?

13 views

Skip to first unread message

Martin Kaspar

unread,

Jun 4, 2011, 1:29:16 AM6/4/11

to professi...@googlegroups.com

good day dear list

hope everything is all right

I am trying to scrape the datas from a webpage, but I get need to get all the records (ten lines each entry ) i need this data on my local machine - stored in a mysql-database - for the sake of better retrieval and search!

So here is what i am achieving:

see the target: www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder

I am trying to scrape the datas from a webpage, but I get need to get all the data in this link.

include 'simple_html_dom.php';
$html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

$info1 = $html1->find('b[class=[what to enter herer ]',0);

I need to get all the data out of this site.

see an example:

Bürgerstiftung Lebensraum Aachen
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Hubert Schramm
    Alexanderstr. 69/ 71
    52062 Aachen
    Telefon: 0241 - 4500130
    Telefax: 0241 - 4500131
    Email: in...@buergerstiftung-aachen.de
    www.buergerstiftung-aachen.de
    >> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
    rechtsfähige Stiftung des bürgerlichen Rechts
    Ansprechpartner: Helga Kühn
    Rotkehlchenstr. 72
    28832 Achim
    Telefon: 04202-84981
    Telefax: 04202-955210
    Email: in...@buergerstiftung-achim.de
    www.buergerstiftung-achim.de
    >> Weitere Details zu dieser Stiftung

I need to have the data that are "behind" the link - is there any way to do this
with a easy and understandable parser - one that can be understood and written by a newbie!?

well can you give me some hints... should i do this with XPahts - in PHP or Perl - (with mechanize)

look forward to hear from you - you are the first i want to ask here!!

do not lemme standing in the rain....

i would be more than glad if you help me in achieving this goal ...

thx for the help

greetings
martin

On Sat, Nov 20, 2010 at 8:53 AM, vishal bhandare <bhandar...@gmail.com> wrote:

SimpleXML and DOM object are supposed to process well formed XML File.
XML file and HTML is similar in some sense but it is not. HTML some
tags are not have closing/ending bracket which might throw error while
reading html as xml.

I am not experiend with such stuff but this will lead to you to remove
those tags before processing. So u might have to develope ur own HTML
parser in sense.
So why to do it? use the HTML parser created by experienced programmer.
Simple HTML DOM Parser is very usefull stuff in sense if u want to do
screen scrapping. Once i have faced issue of memory limit due to our
Server memory limit while reading HTML pages using PHP.

If you want to read many numbers of files then seems worry. you want
do work of crawler then why to use PHP for that. Use something else
like Perl/CGI or many others. They are best for such works.

On 11/19/10, Robert Gonzalez <robert.anth...@gmail.com> wrote:
> Can you not just use the built in DOM object? Or even SimpleXML for this?
> I'm pretty sure both support XPath.
>
> On Sun, Nov 14, 2010 at 3:40 PM, Martin Kaspar
> <martin...@campus-24.com>wrote:
>
>> hello dear PHP-Friends,
>>
>>
>> does the PHP Simple HTML DOM Parser (see
>> http://simplehtmldom.sourceforge.net/
>> )
>> support xpaths - i am not very sure.
>>
>> i want to parse the data structure of a fetched page:
>>
>> Here some details: well since we have several hundred of resultpages
>> derived from
>> this one: http://www.educa.ch/dyn/79362.asp?action=search
>>
>> Note: i want to itterate over the resultpages - with a loop.
>>
>> http://www.educa.ch/dyn/79376.asp?id=1568
>> http://www.educa.ch/dyn/79376.asp?id=2149
>>
>>
>> i take this loop:
>> PHP Code:
>> for($i=1;$i<=$match[1];$i++)
>> {
>> $url = "http://www.example.com/page?page={$i}";
>> // access new sub-page, extract necessary data
>> }
>>
>>
>> as the example we can set in here this domain:
>> http://www.educa.ch/dyn/79362.asp?action=search
>>
>> Note - you see that we have lots of targets....:
>> http://www.educa.ch/dyn/79376.asp?id=1568
>> http://www.educa.ch/dyn/79376.asp?id=2149
>>
>> and lots of others more:
>>
>> what do you think? What about the Loop over the target-Urls?
>>
>> BTW: you see - there will be some pages empty. Note - the empty pages
>> should be thrown away. I do not want to store "empty" stuff.
>>
>> well this is what i want to. And now i need to have a good parser-
>> script.
>>
>> Note: this is a tree-part-job:
>>
>> 1. fetching the sub-pages
>> 2. parsing them and if all goes well .... then we would have a third
>> part:
>> 3. storing the data in a mysql-db
>>
>>
>>
>> b. the Paser-Part:
>> Well - the problem - some of the above mentioned pages are empty. so i
>> need to find a solution to leave them aside - unless i do not want to
>> populate my mysql-db with too much infos..
>> Btw: parsing should be a part that can be done with DomDocument - What
>> do you think?
>> I need to combine the first part with tthe second - can you give me
>> some starting points and hints to get this.
>> The fetching-job should be done with CuRL - and to process the data
>> into a DomDocument-Parser-Job. No Problem here: But how to do the DOM-
>> Document-Job ...
>>
>> i have installed FireBug into the FireFox...
>>
>> now i have the Xpaths for the sites:
>>
>> http://www.educa.ch/dyn/79376.asp?id=1187
>> http://www.educa.ch/dyn/79376.asp?id=2939
>> http://www.educa.ch/dyn/79376.asp?id=1515
>> http://www.educa.ch/dyn/79376.asp?id=1469
>>
>>
>> see the details:
>>
>> Altes Schulhaus Ossingen :: /html/body/div[2]
>> Guntibachstrasse 10 :: /html/body/div[4]
>> 8475 Ossingen :: /html/body/div[6]
>> sekretariat...@bluewin.ch :: /html/body/div[9]/a
>> Tel:052 317 15 45 :: /html/body/div[11]
>> Fax:052 317 04 42 :: /html/body/div[12]
>>
>> question - does SimpleDomDocument support xpaths
>>
>> --
>> This group is managed and maintained by the development staff at 360 PSG.
>> An enterprise application development company utilizing open-source
>> technologies for todays small-to-medium size businesses.
>>
>> For information or project assistance please visit :
>> http://www.360psg.com
>>
>> You received this message because you are subscribed to the Google Groups
>> "Professional PHP Developers" group.
>> To post to this group, send email to Professi...@googlegroups.com
>> To unsubscribe from this group, send email to
>> Professional-P...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/Professional-PHP
>
> --
> This group is managed and maintained by the development staff at 360 PSG. An
> enterprise application development company utilizing open-source
> technologies for todays small-to-medium size businesses.
>
> For information or project assistance please visit :
> http://www.360psg.com
>
> You received this message because you are subscribed to the Google Groups
> "Professional PHP Developers" group.
> To post to this group, send email to Professi...@googlegroups.com
> To unsubscribe from this group, send email to
> Professional-P...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/Professional-PHP

--

This group is managed and maintained by the development staff at 360 PSG. An enterprise application development company utilizing open-source technologies for todays small-to-medium size businesses.

For information or project assistance please visit :
http://www.360psg.com

You received this message because you are subscribed to the Google Groups "Professional PHP Developers" group.
To post to this group, send email to Professi...@googlegroups.com
To unsubscribe from this group, send email to Professional-P...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/Professional-PHP

Robert Gonzalez

unread,

Jun 4, 2011, 6:28:14 PM6/4/11

to professi...@googlegroups.com

So are you saying that you need to not only grab the content of a page but also follow links on the page and get the linked to content as well?

Seems easy enough to do but you are going to have to put some thought into this in advance of development. Like are you planning on following relative links? If so, what happens if the link is a relative link that is full pathed? Make sure you put some thought into this early on.

Robert Gonzalez

http://twitter.com/RobertGonzalez

http://www.facebook.com/robertgonzalez

Reply all

Reply to author

Forward

0 new messages