Urgent: GC.com changes of 2010-10-05 break parser

1 view
Skip to first unread message

Steve8x8

unread,
Oct 7, 2010, 3:41:22 AM10/7/10
to geotoad
The redesign of GC.com this timeintroduced changes in the format of
the "print view" (cdpf.aspx), in a way that breaks the parser. In
particular, fields have been broken up into multiple lines (as in
LatLong), and there have been field name changes (e.g. for the
Additional Hints).

I tried to cope with line fragmentation using the following patch:

--- details.rb.save 2010-10-04 15:42:10.000000000 +0200
+++ details.rb 2010-10-07 09:04:07.000000000 +0200
@@ -113,6 +113,9 @@
wid = nil
cache = nil

+ # condense <p class>
+ data.gsub!(/(\<p class=[^>]*\>)\s*/m, "&\\1");
+ # now analyze
data.split("\n").each { |line|
# <title id="pageTitle">(GC1145) Lake Crabtree computer
software store by darylb</title>
if line =~ /\<title.*\((GC\w+)\) (.*?) by (.*?)\</

which actually allows to read the location, but then geotoad throws an
error somewhere else:
(relevant command line parameters: -y 1 "N53 39.177 E011 21.837")

D: ====+ Fetch URL: http://www.geocaching.com/seek/cdpf.aspx?lc=10&guid=7b986011-d12a-4076-823c-1cb9fb43646f
D: ====+ Fetch File: ~/.geotoad/cache/www.geocaching.com/seek/
cdpf.aspx_lc_10_guid_7b986011-d12a-4076-823c-1cb9fb43646f
D: local cache is only 43048 old (518400), using local file.
D: 36222 bytes retrieved from local cache
D: wid = GC2FFCZ name=L&#252;tt Schwerin creator=TeamStralenwurm
D: stype=multicache full_type=
D: parsing date: [10/06/2010]
D: Looks like a date: year=2010 month=10, date=06
D: Timestamp parsed as Wed Oct 06 00:00:00 +0200 2010
D: ctime=Wed Oct 06 00:00:00 +0200 2010 cdays=1
D: got written lat/lon
D: found size: Micro
/home/steffen/src/geotoad/lib/details.rb:269:in `parseCache':
undefined method `[]' for nil:NilClass (NoMethodError)
from ~/src/geotoad/lib/details.rb:65:in `fetch'
from ~/src/geotoad/geotoad.rb:405:in `fetchGeocaches'
from ~/src/geotoad/geotoad.rb:401:in `each_key'
from ~/src/geotoad/geotoad.rb:401:in `fetchGeocaches'
from ~/src/geotoad/geotoad.rb:612

Another file already in the local cache yields

D: ====+ Fetch File: ~/.geotoad/cache/www.geocaching.com/seek/
cdpf.aspx_lc_10_guid_72651f68-6d46-4e1b-8982-c615615dc78a
D: local cache is only 422873 old (518400), using local file.
D: 51246 bytes retrieved from local cache
D: wid = GC1WAN3 name=Thorge - Mecklenburgische Seenplatte/
lakedistrict creator=jeromedax
D: stype=earthcache full_type=
D: parsing date: [07/24/2009]
D: Looks like a date: year=2009 month=07, date=24
D: Timestamp parsed as Fri Jul 24 00:00:00 +0200 2009
D: ctime=Fri Jul 24 00:00:00 +0200 2009 cdays=440
D: got written lat/lon
D: found size: Not chosen
D: difficulty: 3.5
D: terrain: 1.5
D: found short desc: [Ni ju san
[...]

- as one can see, terrain, difficulty, and descriptions are missing
(note: GC2FFCZ has no short desc).
The reason for the missing terrain and difficulty values is obviously
the split over _three_ lines - my patch only removes the single line
feed after the <p> tag.
It might help to condense _everything_ between <p class...> and </p>
onto a single line before running the data.split.each loop. But I
don't know how to do that :(
And issue 152 points towards the logs ... which are still different...
Of course another approach would be to abandon line parsing
completely... as it's already being done for hints.

Any ideas?

Thomas Strömberg

unread,
Oct 7, 2010, 8:57:39 AM10/7/10
to geo...@googlegroups.com
Thanks for the heads up. I haven't been able to focus on geotoad for a
few weeks now, but will get a chance to look at it on Saturday
morning. If you would like, I can give you access to submit any
changes to the trunk (after a code review).

2010/10/7 Steve8x8 <steffen....@gmx.net>:

--
// thomas

Steve8x8

unread,
Oct 7, 2010, 9:51:40 AM10/7/10
to geotoad
> Thanks for the heads up. I haven't been able to focus on geotoad for a
> few weeks now, but will get a chance to look at it on Saturday
> morning.

Thanks back for your heads-up towards myself ;) (As I've got some RL
as well, I understand
your situation.)

There's quite a lot of intermixed changes applied to my geotoad copy
right now;
about 20 issues leave some marks in the code :)

I suppose I'd have to separate them again; some are purely cosmetic
(spelling/typo fixes),
others do have almost no effect (only on the output file name), and
some have proven to
be *essential* (event cache parsing, and the current changes triggered
by gc's redesign).
I'm not sure though whether I can make it before leaving for the
holiday break (geocaching,
what else? :))
The current unified patch set consists of 281 '+' lines, so I hope the
critical pain level won't
be reached too soon.
Theoretically it should be possible to cut it into pieces and let
patch correct the line counts.
Which order would you prefer?

> If you would like, I can give you access to submit any
> changes to the trunk (after a code review).

I suppose you should have a look first (at my coding style which is
almost non-existent).

Also, I need a helping hand on the "combine all lines belonging to <p
class=...> ... </p>
as that's something I haven't found any recipe yet - IMHO it reduces
to
1. finding all occurrences of <p class=...> and
2. appending subsequent lines while there's no </p> in the
(cumulative) line
Unfortunately, we have to stay backwards-compatible for a while.
Something along the lines of (pseudo-code, not exactly Ruby!) comes to
mind:
for all matches (data =~ m/\<p class[^\>]*\>(.*?)\<\/p\>/) {
$1.gsub(/\s*[\n\r]+\s*/, '')
}
Since Ruby isn't Perl, there's certainly another way to do it. I just
don't see any. Do you?

For now I'm happy to have found the (rather obvious, if you compare
the HTML sources) fix
that allows newer caches to be parsed...

Steve8x8

unread,
Oct 7, 2010, 2:33:01 PM10/7/10
to geotoad
> 1. finding all occurrences of <p class=...> and
> 2. appending subsequent lines while there's no </p> in the
> (cumulative) line

I have run several tests, for different variants of code, and found
that everything that checks individual lines is horribly slow (slower
than anything else, therefore I suspect the string concatenations to
consume most of the time - perhaps because of garbage collection?).
Multiple gsub()s are the only thing that's sufficiently fast.
Perhaps I've missed the obvious, clever solution :(
I have included three different code snippets into my patch.

> For now I'm happy to have found the (rather obvious, if you compare
> the HTML sources) fix
> that allows newer caches to be parsed...

I have tested the code with three different searches (home zone, and
two holiday zones) and found nothing alarming. Finally, also the
"inactive" tagging seems to work reliably...
Feedback is welcome.

Steve8x8

unread,
Oct 8, 2010, 6:02:18 AM10/8/10
to geotoad
> There's quite a lot of intermixed changes applied to my geotoad copy
> right now;

> I suppose I'd have to separate them again

> I'm not sure though whether I can make it before leaving for the
> holiday break

> Theoretically it should be possible to cut it into pieces and let
> patch correct the line counts.

There's a re-combined patch set now, with hopefully
self-explaining names (and references to issue numbers)
which might save you some head-scratching. Or not.
Your comments are appreciated.
(It's again on my file space at
http://steve8x8.faehrwiese.dyndns.org/my-geotoad/
as I cannot figure out how to attach files to the message :(:( )

> Which order would you prefer?

Shouldn't matter as all of them are pretty separated, unified diffs.

> > If you would like, I can give you access to submit any
> > changes to the trunk (after a code review).

At somepoint in the future I might want to get noticed of newly added
Issues,
but let you decide what's to go into "upstream".

Cheers,
-St

Steve8x8

unread,
Oct 24, 2010, 1:18:49 PM10/24/10
to geotoad
Since dyndns.com no longer provides wildcard DNS for free, I have
moved to selfhost.de yesterday.
Patches _to the latest geotoad SVN rev_ can be found below
http://steve8x8.faehrwiese.selfhost.me/my-geotoad/ now.

BTW: I have checked in the whitespace patch which removes the famous
question marks from the text (tabs, CRs and LFs) as rev 679.
I'm not happy yet with the waypoint code - it looks long, inreadable
and bloated (but it works!) - and have prepared a patch for attributes
(which I'm pretty proud of since it covers _all_ currently known
attributes and is short, if you don't count the hash
initialization...). Unfortunately these two are so intertwined that I
cannot (semi-automatically) separate them at the moment :( The
add_waypoints patch (from an earlier stage) is there for exactly this
reason. I'd appreciate comments.

Note that only the latest *-all.diff file is the authoritative patch I
had applied against the current SVN revision. The older versions are
there only for historical reasons.
The *-patches.tar.gz file is created from the "big" patch by splitting
(awk/csplit) into individual chunks, manually re-arranging them into
"topic" groups, and re-combining them into individual patch files.
This may result in muffled complaints from patch when you try to apply
them - I have found that given a proper context, patch (at least the
Linux one) can stand (and compensate for) quite big line offsets. This
way, you may select individual patches and apply them in random order.
At least that's the idea... your mileage may vary.

Feedback is welcome!
Reply all
Reply to author
Forward
0 new messages