nearly there..

14 views
Skip to first unread message

Daniel Hoolihan

unread,
May 2, 2016, 6:09:44 AM5/2/16
to openaust...@googlegroups.com
Hi all.. hoping this list is still the best place to ask things..

I've spent my long weekend getting a dev environment built for openaustralia
under windows.. I've solved all the normal "dev types forgetting windows even
exists" kind of issues.. but i've got to the very last step and i can't get any
decent data to load into the database??

I've tried downloading a sample set of XML files from data.openaustralia.org
under both 'scrapedxml' and 'rewritexml' for random days.. but all of them seem
to have some sort of data issue and the xml2db perl script fails to load them..
usually the htime field, which quite often has data in it that is clearly not a
timestamp..

Am I just being very unlucky? or is there a magical set of 'clean' xml files
somewhere that should be used?

I'm sooooo close.. thoughts?

Daniel

Henare Degan

unread,
May 2, 2016, 6:40:18 AM5/2/16
to openaustralia-dev
Hi Daniel,


Hi all.. hoping this list is still the best place to ask things..

We've got Slack these days - I'll send you an invite (if anyone else would like an invite just email me off-list).


I've spent my long weekend getting a dev environment built for openaustralia under windows.. I've solved all the normal "dev types forgetting windows even exists" kind of issues.. but i've got to the very last step and i can't get any decent data to load into the database??

You're very brave getting this running at all let alone on Windows, well done! :)

I normally load stuff into the database locally via the parser so I had to pick this apart. I downloaded this:


Put it in a directory under the configured "RAWDATA" like so:

RAWDATA/scrapedxml/representatives_debates/2016-04-19.xml

Ran this from "twfy/scripts":

./xml2db.pl --debates --date=2016-04-19

That output this and the speeches were visible in the web app:

db loading  2016-04-19

What's happening when you run xml2db?

Cheers,

Henare


Daniel

--
You received this message because you are subscribed to the Google Groups "OpenAustralia Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openaustralia-...@googlegroups.com.
To post to this group, send email to openaust...@googlegroups.com.
Visit this group at https://groups.google.com/group/openaustralia-dev.
For more options, visit https://groups.google.com/d/optout.

Daniel Hoolihan

unread,
May 2, 2016, 7:15:36 AM5/2/16
to openaust...@googlegroups.com

Not used slack yet, but i'll check it out..

The file you mention below loads beautifully.. so does April 18th.. however..

scrapedxml/representatives_debates/2007-09-20.xml (the file used as the example on the install instructions) gives me DBD::mysql::st execute failed: Incorrect time value: 'unknown' for column 'htime'

and

scrapedxml/representatives_debates/2016-02-02.xml gives me DBD::mysql::st execute failed: Incorrect time value: ' (MaribyrnongÔÇöLeader of the Opposition) (14:43):' for column 'htime'

when you check that xml .. on line 781 .. the htime attribute of the speech tag does indeed have that value.. however on the main openaustralia site I can search and find that snippet.. so it must have loaded somehow..

I'm assuming someone either spends way too much time hand cleaning xml, or the parser (which i haven't got up and running yet, i'm not a ruby guy) is able to silently discard bad data and keep working..

not critical now I have at least *some* data loaded so i can keep playing, but a little frustrating to say the least.. :)

cheers
Daniel

grep'ing through those files confirms the htime attribute on some of the speeches definitely has bad data in it.. which i would have thought would have caused a problem

Henare Degan

unread,
May 2, 2016, 7:29:37 AM5/2/16
to openaustralia-dev
You're right - there's definitely incorrect data there.


I'm assuming someone either spends way too much time hand cleaning xml, or the parser (which i haven't got up and running yet, i'm not a ruby guy) is able to silently discard bad data and keep working..

Nope, everything you see in the repo and data store is the same as what we use.

My only guess is that MySQL on [Li]nix is the one silently throwing away the bad data. I was shocked to learn a while back that MySQL does this quite often - not exactly an admirable quality in a database.

When I rerun the importer it does this on my machine:

$ ./xml2db.pl --debates --date=2016-02-02
db loading  2016-02-02
Wide character in print at ./xml2db.pl line 750.
updating hansard object uk.org.publicwhip/debate/2016-02-02.69.2, changing: at 9 value #00:00:00# to # (Maribyrnong—Leader of the Opposition) (14:43):#.
updating hansard object uk.org.publicwhip/debate/2016-02-02.125.3, changing: at 9 value #00:00:00# to ##.
Wide character in print at ./xml2db.pl line 750.
updating hansard object uk.org.publicwhip/debate/2016-02-02.126.1, changing: at 9 value #00:00:00# to # (Mitchell—Assistant Minister to the Treasurer) (18:38):#.
$

We should definitely log a bug for this and fix the parser so it doesn't put this junk in time fields.

Cheers,

Henare
Reply all
Reply to author
Forward
0 new messages