from text to cvs with perl mecha

20 views
Skip to first unread message

Martin Kaspar

unread,
Oct 6, 2012, 10:01:13 AM10/6/12
to www-mecha...@googlegroups.com


hello dear folks 

well on the way to dig deeper into the text manipulation with perl  i came across the tutorial


see http://szabgab.com/talks/fundamentals_of_perl/process-csv-file.html

well i need to create a cvs out of a datachunk with more than 10000 lines.


with the Mechanize i get a dataset with the following set:

 see a datachunk:

 Loosdorftown Ledochowskastra�e 4 3382 Loosdorftown Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

 linux-wyee:/home/martin/perl #
 the script below gives back result  like  this one;
 Loosdorf
 Ledochowskastraße
 3382 Loostown
 Telefonnummer: 0002754 6257
 FAX-Nummer: 0002754 6257-4


Well - we have following options here:

to print to a file instead of printing at the screen, we just have to change:

say $text;

to:

print $OUT_FILE $text;


Some explanations: where $OUT_FILE will be a filehandle for the output file that we will have to open before getting into the so called "for loop".

This would work for the code as it is, but it might be different if we are using the Text:CSV module which has probably dedicated functions or methods for printing CSV lines to
 a file (Well to be frank i don't use this module and don't know it, although I should probably change this because I am using CSV files from time to time .
Well i try to describe more in details what we want to have: Which output file to look like. Well i want the comma to separate the fields of the addresses, or the records?


if we take this for example: katholisch.at

we have the following dataset:



well i want to have seperated each datset into these bits - in other words:
if i have a dataset that delimiters and seperates the lines that are given like that

Loosdorf Ledochowskastra�e 4 3382 Loosdorf Telefonnummer: 02754 6257 FAX-Nummer: 02754 6257-4

i would be very very happy. Note: there also a Encoding issues is: see the Ledochowskastra�e - there is a sign in it "ß" so we have to take care for the
iso 8859 encoding dont we!?

Well i love if you can give some hints and helping hands. That would be very very supportive.
Note;: this is a great gerat chance f or me to learn alot about Perl, and the options and power of Mechanize.



see more  results:
Marias Neustift Neustifttown 28 4443 Marias Neussstift Telefonnummer: 007250/204 FAX-Nummer: 07250/204-4 E-Mail: prre.inmar...@dioezese-linz.at
Marias Puchheim Gmundnertown Stra�e 1b 4800 Attnanger-Puchheim Telefonnummer: 007674/62334 FAX-Nummer: 07674/62334-4 E-Mail: prre.inmar...@dioezese-linz.at
Marias Scharten Schartenstown 1 4612 Schartensbook Telefonnummer: 007272/5210
Marias Schmolln Maria Schmollntown 2 5241 Maria Schmolln Telefonnummer: 007743/2209-12 FAX-Nummer: 07743/2209-17 E-Mail: prre.inmar...@dioezese-linz.at
Mattighofen R�merstra�e 12 5230 Mattighofentown Telefonnummer: 007742/2273 0676/87765221 FAX-Nummer: 07742/2273-22 E-Mail: peipfarre.i...@dioezese-linz.at
Mauerkirchens Pfarrhofstra�e 4 5270 Mauerkirchentown Telefonnummer: 007724/2262


 it does  count up - that is great!!

1  what i wanted is to force the  script to run from 00000 to 10000 -
 note: the results should be stored in a csv formatted way...

for 1. therfore i did the changes: changed the $max_page_num to the max number and change $page to the starting number. this will only print the data to stdout (console)


now i am trying to modify it... :-)

well i have  to put it to the CSV-values.

usually this can be done with  use Text::CSV_XS (where the Class::CSV is based on).
Note: A friend also suggested me using Text::CSV which will load up Text::CSV_XS or,

Well at the moment all the results will only print the data to stdout (console) im sure that i can modify it... :-)

i just installed the Text::CSV_XS
took it from here: http://search.cpan.org/~hmbrand/Text-CSV_XS-0.91/CSV_XS.pm


now i try to figure out which attributes i do use


what do you suggest!?
How to force the script to give back CSV




greetings martin


On Sat, Aug 11, 2012 at 6:48 AM, <www-mecha...@googlegroups.com> wrote:

Group: http://groups.google.com/group/www-mechanize-users/topics

    nixbuilder <nixbu...@gmail.com> Aug 08 12:47PM -0700  

    Since going some OS and package upgrades, I am now getting errors when
    try to connect with a previously very stable script.
     
    When I do a $mech->get($URL), it returns "Bad hostame". However the
    hostname does resolve when using nslookup or dig.
     
    So while running my script I also ran wireshark to see what was
    happening... the only thing I noticed was that DNS queries were going
    for AAAA records (IPv6) first... getting nothing back because there is
    no IPv6 on the requested machine, and then doing a query for the "A"
    record, which returns with the IP address. But the $mech->get() does
    not seem to pick up on the "A" IP address.
     
    Any ideas?
     
    BTW... versions are Linux=3.4.6, glibc=2.16.0 perl=5.14.2 and WWW-
    Mechanize=1.72.

     

    Natxo Asenjo <natxo....@gmail.com> Aug 10 10:46PM +0200  

    > no IPv6 on the requested machine, and then doing a query for the "A"
    > record, which returns with the IP address. But the $mech->get() does
    > not seem to pick up on the "A" IP address.
     
    without seeing some code it is hard to tell.
     
    You can try adding show_progress(1) to $mech in order to get some debugging
    info about what is going on; you can also disable ipv6 (although I do not
    think that is going to make any difference).
     
    --
    natxo

     

You received this message because you are subscribed to the Google Group www-mechanize-users.
You can post via email.
To unsubscribe from this group, send an empty message.
For more options, visit this group.

--
You received this message because you are subscribed to the Google Groups "WWW::Mechanize users" group.
To post to this group, send email to www-mecha...@googlegroups.com.
To unsubscribe from this group, send email to www-mechanize-u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/www-mechanize-users?hl=en.

Reply all
Reply to author
Forward
0 new messages