Scraping shipping information

1,367 views
Skip to first unread message

Gordo

unread,
Aug 29, 2012, 12:59:20 PM8/29/12
to scrap...@googlegroups.com
Hi all! First off, I am gonna apologize in advance for my questions, as I am sure that some or all of it will have been answered before. However, as I am not as technically adept as some of you, I am gonna ask anyways:

I am writing an article where the goal is to describe shipping risk (collision, allision, foundering) in the Atlantic, as part of my studies. To do this, im dependent on sanitizing large AIS-datasets with up to date vessel information. The site marinetraffic.com has that information, but manually updating each listing in the AIS-dataset would result in me finishing my studies somewhere around 2020 :)

Vessels over 300 dwt usually have what is called an MMSI number. It seems that marinetraffic uses the MMSI number as a part of the URL, I.E: http://marinetraffic.com/ais/shipdetails.aspx?mmsi=677032000   where 677032000 is the MMSI. Is there any way to scrape this off the site and into a table, so that I end up with data from all MMSI numbers on the site?

The output data i need for each mmsi number is the mmsi number itself, ship type, year of build, deadweight tonnage, flag and IMO number, in a table. If anyone could  advice me as of the feasibility of such a project, point me in the right direction or help me in any other way, I would be very thankful.

Thanks for reading

/Gordo

Paul Bradshaw

unread,
Aug 29, 2012, 1:06:12 PM8/29/12
to scrap...@googlegroups.com
Hi Gordo, 
Vessels over 300 dwt usually have what is called an MMSI number. It seems that marinetraffic uses the MMSI number as a part of the URL, I.E: http://marinetraffic.com/ais/shipdetails.aspx?mmsi=677032000   where 677032000 is the MMSI. Is there any way to scrape this off the site and into a table, so that I end up with data from all MMSI numbers on the site?

Ideally you'd want a starting point that links to all of those, but in the absence of that you may need to generate each possible combination of 9 digits! Have you tried asking nicely at http://marinetraffic.com/ais/exporttext.aspx
 

The output data i need for each mmsi number is the mmsi number itself, ship type, year of build, deadweight tonnage, flag and IMO number, in a table. If anyone could  advice me as of the feasibility of such a project, point me in the right direction or help me in any other way, I would be very thankful.

This second stage should be easy once you have the list, because you just have to generate the URL from it. 

--
Paul Bradshaw

Out now - Scraping for Journalists: http://leanpub.com/scrapingforjournalists 
8,000 Holes: How the 2012 Olympic Torch Relay Lost its Way: https://leanpub.com/8000holes (all proceeds to the Brittle Bone Society)
The Online Journalism Handbook: http://amzn.to/jEND3p 

Online Journalism Blog http://onlinejournalismblog.com 
Help Me Investigate http://helpmeinvestigate.com - Shortlisted for 
Multimedia Publisher of the Year, 2010; winner of Talk About Local investigation of the year 2010

Organiser, Hacks and Hackers Birmingham http://meetupbirmingham.hackshackers.com/

Visiting Professor, City University, London http://www.city.ac.uk/journalism/
Course Leader, MA Online Journalism, Birmingham City University http://bit.ly/maonlinejournalism

http://twitter.com/paulbradshaw
LinkedIn profile and recommendations at http://bit.ly/paulbrecommendations



Paul Bradshaw

unread,
Aug 29, 2012, 1:08:39 PM8/29/12
to scrap...@googlegroups.com
Actually, just spotted this: http://marinetraffic.com/ais/datasheet.aspx?datasource=SHIPS_CURRENT&alpha=A&level0=200

You should be able to scrape the IDs from that using OutWit Hub pretty easily.

--

Paul Bradshaw

unread,
Aug 29, 2012, 1:17:00 PM8/29/12
to scrap...@googlegroups.com
In fact, here they are. You need to use Excel or Google Docs to extract the codes from each URL and then 

You can probably turn that into a list and adapt this scraper to then cycle through and grab the data you need: https://scraperwiki.com/scrapers/free_school_meals_scotland/
OutWit guess export - Current_Vessels_in_Range_AIS.csv

Gordo

unread,
Aug 29, 2012, 1:53:49 PM8/29/12
to scrap...@googlegroups.com
Hi Paul! Thanks so much for your input. Much appreciated. I have contacted Marinetraffic via mail, but am unsure of the result, as they seem to require you to be participating in their AIS-reciever network to share data (which i am not, as i dont have the AIS equipment).

The list you provided from the site is a good starting point, but it seems they only list the first 500 vessels there. To my knowledge, there are between 80-120 thousand ships with mmsi numbers. Thats why i thought "brute-forcing" it could be a possibility.

/Gordo

Thad Guidry

unread,
Aug 29, 2012, 2:03:25 PM8/29/12
to scrap...@googlegroups.com

Gordo

unread,
Aug 29, 2012, 2:57:31 PM8/29/12
to scrap...@googlegroups.com
True, fun stuff. However, as the AIS-data i have access to uses MMSI numbers, that is what i need to sanitize my data. Am by the way playing around with the school lunch scraper, with a couple of 100 MMSI numbers I already know I need updated. But i keep getting the error "NameError: name 'schoolIDs' is not defined" at line 63. Am probably missing something, cant really program, so would be delighted if someone could take a looksie.

/Gordo

Paul Bradshaw

unread,
Aug 29, 2012, 4:35:12 PM8/29/12
to scrap...@googlegroups.com
Got a URL for your scraper? Can't find it via browse.

Gordo

unread,
Aug 29, 2012, 5:26:08 PM8/29/12
to scrap...@googlegroups.com
Yeah, its https://scraperwiki.com/scrapers/gordo/  . I have got it to run, but it is not generating any data (atleast, as i have yet to understand how i define the tables/tags in the marinetraffic site so that it catches mmsi, name, imo, deadweight, flagstate and type. Any help with that would be much appreciated.

/Gordo

Paul Bradshaw

unread,
Aug 30, 2012, 4:20:05 AM8/30/12
to scrap...@googlegroups.com
Just took a look - the first part of the scraper (at the bottom) works but the def scrape_table part needs customising too. Instead of:
    rows root.cssselect("table.destinations tr")

You need to specify the part of the pages you want. A quick look suggests this is the tag wrapped around: <div id='detailtext'>

But the HTML within that isn't helpful, with the <br/> or <b> tags used to separate parts of data. You can either fiddle round with fetching those in the scraper, or grab the whole div and clean up in a spreadsheet software.

This is how I changed the code to get it to grab that div in your pages instead (this could be simplified, but it represents the fewest changes from the original code):

    rows root.cssselect("div#detailtext")
    for row in rows:
        record {}
        record['FSM'row.text_content()

Gordo

unread,
Aug 30, 2012, 6:24:27 AM8/30/12
to scrap...@googlegroups.com
Thanks a lot Paul, I have what I need now. Much obliged. If youre ever in Norway, ill buy you a beer :)
 
/Gordo

Páll Hilmarsson

unread,
Aug 30, 2012, 6:39:18 AM8/30/12
to scrap...@googlegroups.com
Have you looked at: http://www.itu.int/online/mms/mars/ship_search.sh

Don't know how up to date it is, or how inclusive - but it's easy to scrape

p

Tom Morris

unread,
Aug 30, 2012, 9:27:25 AM8/30/12
to scrap...@googlegroups.com
On Wednesday, August 29, 2012 1:53:49 PM UTC-4, Gordo wrote:

The list you provided from the site is a good starting point, but it seems they only list the first 500 vessels there. 

The search results are capped at 500 ships, but you can adaptively add additional leading characters for the categories which have more than 500 ships.  e.g aa, ab, ac, etc for the letter A.


Note also, that the list in the UI isn't comprehensive, so you'll want to add digits, etc to assure complete coverage.  You can also use TYPE_SUMMARY=Cargo to restrict your search to cargo ships if that suits your needs http://marinetraffic.com/ais/datasheet.aspx?SHIPNAME=0&TYPE_SUMMARY=Cargo&menuid=&datasource=SHIPS_CURRENT&app=&mode=&B1=Search

Tom
Reply all
Reply to author
Forward
0 new messages