CIA Factbook change the country html

67 views
Skip to first unread message

Filipe Sousa

unread,
Apr 14, 2015, 7:44:46 AM4/14/15
to open...@googlegroups.com
Hi guys,

It seems that CIA changed the html of the factbook. This gem is not working anymore.


Gerald Bauer

unread,
Apr 14, 2015, 7:49:28 AM4/14/15
to Filipe Sousa, open...@googlegroups.com
Hello,
Thanks for reporting. Can you post some details - what errors do you get?
Cheers.

Filipe Sousa

unread,
Apr 14, 2015, 8:06:27 AM4/14/15
to open...@googlegroups.com

page.rb  seaches for divs that are not in the current HTML of factbook such as:

          [ @opts[:fields] ? 'Introduction'        : 'intro',    '<div id="CollapsiblePanel1_Intro"'   ],
          [ @opts[:fields] ? 'Geography'           : 'geo',      '<div id="CollapsiblePanel1_Geo"'     ],
          [ @opts[:fields] ? 'People and Society'  : 'people',   '<div id="CollapsiblePanel1_People"'  ],
          [ @opts[:fields] ? 'Government'          : 'govt',     '<div id="CollapsiblePanel1_Govt"'    ],
          [ @opts[:fields] ? 'Economy'             : 'econ',     '<div id="CollapsiblePanel1_Econ"'    ],
          [ @opts[:fields] ? 'Energy'              : 'energy',   '<div id="CollapsiblePanel1_Energy"'  ],
          [ @opts[:fields] ? 'Communications'      : 'comm',     '<div id="CollapsiblePanel1_Comm"'    ],
          [ @opts[:fields] ? 'Transportation'      : 'trans',    '<div id="CollapsiblePanel1_Trans"'   ],
          [ @opts[:fields] ? 'Military'            : 'military', '<div id="CollapsiblePanel1_Military"'],
          [ @opts[:fields] ? 'Transnational Issues': 'issues',   '<div id="CollapsiblePanel1_Issues"'  ]

So the resulting data is always empty.

Gerald Bauer

unread,
Apr 14, 2015, 9:29:53 AM4/14/15
to Filipe Sousa, open...@googlegroups.com
Hello,
Thanks for the details. Will try to see if an update of the (web
page document) search rules is possible. Cheers.

Filipe Sousa

unread,
Apr 14, 2015, 11:17:35 AM4/14/15
to open...@googlegroups.com
thanks for your feedback :)


On Tuesday, April 14, 2015 at 12:44:46 PM UTC+1, Filipe Sousa wrote:

Eckhard Licher

unread,
Apr 15, 2015, 1:55:25 AM4/15/15
to open...@googlegroups.com
Thanks for the info.

The python based world factbook scraper silently creates a mess.

Will look into the matter when the download version of the factbook is updated (I expect this to happen anytime soon).

Regards,

Eckhard



Gerald Bauer

unread,
Sep 28, 2015, 1:56:48 PM9/28/15
to openmundi
Hello,

  FYI: I'm updating the gem (library) for the "new" html format e.g.:

> page.rb  seaches for divs that are not in the current HTML of factbook such as:
>
>       [ @opts[:fields] ? 'Introduction'        : 'intro',    '<div id="CollapsiblePanel1_Intro"'   ],
>       [ @opts[:fields] ? 'Geography'           : 'geo',      '<div id="CollapsiblePanel1_Geo"'     ],
>       [ @opts[:fields] ? 'People and Society'  : 'people',   '<div id="CollapsiblePanel1_People"'  ],
>       [ @opts[:fields] ? 'Government'          : 'govt',     '<div id="CollapsiblePanel1_Govt"'    ],
>       [ @opts[:fields] ? 'Economy'             : 'econ',     '<div id="CollapsiblePanel1_Econ"'    ],
>       [ @opts[:fields] ? 'Energy'              : 'energy',   '<div id="CollapsiblePanel1_Energy"'  ],
>       [ @opts[:fields] ? 'Communications'      : 'comm',     '<div id="CollapsiblePanel1_Comm"'    ],
>       [ @opts[:fields] ? 'Transportation'      : 'trans',    '<div id="CollapsiblePanel1_Trans"'   ],
>       [ @opts[:fields] ? 'Military'            : 'military', '<div id="CollapsiblePanel1_Military"'],
>       [ @opts[:fields] ? 'Transnational Issues': 'issues',   '<div id="CollapsiblePanel1_Issues"'  ]
>
>  So the resulting data is always empty.

  Note, the updated factbook page reader now uses <h2>'s for finding sections and no longer divs with ids (see above) - that should make it way more stable (let's see if that holds true ;-)  

  Note: Another new addition for helping avoiding breaking things - cleaned-up ("santized" and "chrome-less" e.g. without any headers, footers, scripts, etc) version of the HTML pages - now get hosted and version tracked in a new repo, that is, /factbook [1] and the updated gem (library) uses the cached version by default (thus, guaranteeing that at least the library always works even if the format has changed)

   If anyone still follows along, let us know how it goes. Cheers.


Reply all
Reply to author
Forward
0 new messages