NLTK and physical address extraction from HTML

Nick

unread,

Dec 23, 2012, 6:39:48 AM12/23/12

to nltk-...@googlegroups.com

Good Morning,

I have a stack of 300k HTML pages, and I am looking to play around with these, and extract information from the HTML. I've spent days reading and researching NLP and the NLTK. I've read blogs and books as well as the NLTK website, but I am at a loss in a few areas.

One thing I want to do, is extract physical addresses from the HTML, similar as outlined in Zheyuan Yu's Thesis, http://web.cs.dal.ca/~zyu/research/Thesis.pdf which he found the best method is with machine learning. However, I am stuck to even get started.

Is this called Part-of-Speech Tagging or Chunking / Partial Parsing? The terms seem to be thrown around quite loosely.

I believe I need to do some supervised learning, so I need to create some training and testing data sets, using the IOB format, but I am not sure how to even create that, is it a simple matter of adding /O /I or /B tags around text? Eg

"my/o neighbor/o lives/o at/o 1234/b Hill/i St/i i,b Christchurch/i New/i Zealand/i"

or is tagging done using some other method?

Any help at all will be appreciated.

Have a great christmas, and a new year.

Kind regards
Nick

Chandan Gupta

unread,

Dec 23, 2012, 10:32:37 PM12/23/12

to nltk-...@googlegroups.com

Hi Nick,

what do you mean by "physical address" here ?
and what kind of information extraction you are asking about ?
would you mind to be more specific ?

Thanks,
Chandan

--

Nick

unread,

Dec 23, 2012, 10:58:18 PM12/23/12

to nltk-...@googlegroups.com

Hello,

Thanks for replying.

By "physical address" I mean the postal address. This could be PO BOX, street number, flat + street number (eg 2/123) etc which is why I want to use machine learning, as regex will not be enough. I have been trying to solve this and it's doing my head in, as I feel like I am getting no where.

For this do I need to train a POS tagger, or chunker. How do I make the training set? I got so many questions, just feel I need to be pointed in the right direction.

Thanks again, ANY help is appreciated.

Regards

Nick

Chandan Gupta

unread,

Dec 24, 2012, 11:14:20 AM12/24/12

to nltk-...@googlegroups.com

Nick,

Sorry for asking more things but would you mind sharing few files (may be 5 html files) that you have. I just want to see in which pattern physical address are placed in the files.

--

Sam

unread,

Dec 24, 2012, 12:34:57 PM12/24/12

to nltk-...@googlegroups.com

I think your first step would be creating some training data. I'd recommend collecting a bunch of addresses and hand-labeling them. Something like:

"My/O address/O is/O 123/B-ST Main/I-ST St./I-ST ,/O Anytown/B-TWN ,/O USA/B-CO ,/O 11111/B-ZIP"
Where 'ST', 'TWN', 'CO', and 'ZIP' are labels for the street address, town, country, and zip code, respectively, and 'B', 'I', and 'O' mark data that is the beginning of, inside of, or outside of one of those labels (so 'I-TWN' would be not-the-first word in the name of a town, e.g.) The labels you choose depend, obviously, on the desired level of granularity you want in your output. If you're planning on extracting the addresses atomically, you'd only need one label (and then basically everything would just be 'I-ADDR' or whatever.)

Ideally, you'd strike a balance between variety and depth--too much homogeneity in your training data, and your classifier won't learn to handle enough of the variation you see in real-world data; not enough depth, and your classifier might not really learn much at all.

With sufficient labeled training data, you can then use a NLTK classifier (google 'NLTK IOB') to extract the addresses.

Hope this helps!

Nick

unread,

Dec 27, 2012, 5:06:37 AM12/27/12

to nltk-...@googlegroups.com

Hello,

Thanks for both of your replies.

I've attached 4 pages which contain addresses, and also gone about pulling out a few sentences and hand labeling them (see below). Is that what your talking about? I assume I should add these into a file, and use them to train a classifier?

premises/O :/O view/O larger/O map/O totally/O boating/O :/O 223/B-ST akersten/I-ST street/I-ST port/I-TWN nelson/B-TWN new/B-CO zealand/I-CO phone/O :/O (/O

pm/O Boats/O for/O sale/O yard/O :/O 18/B-ST Library/I-ST Lane/I-ST Albany/B-SUB Village/I-SUB Auckland/I-TWN 0632/B-ZIP Boats/O for/O

IN/O OUR/O NEW/O LOCATION/O -/O 247/B-ST TI/I-ST RAKAU/I-ST DRIVE/I-ST ,/O PAKURANGA/B-SUB PLENTY/O OF/O OFF/O STREET/O PARKING/O

5/O +/O 8/O */O Our/O Location/O :/O 12/B-ST Clearwater/I-ST Cove/I-ST West/B-SUB Park/I-SUB Marina/I-SUB Hobsonville/I-SUB Auckland/B-TWN Opening/O hours/O :/O Mon-Fri/O

Am I on the right path here?

WIth the towns (TWN) label, I get a database of these, same with the country (CO). Is there a way I can put this into the classifier so it knows that CO will always be one of x, and towns will always be y. That way it will make it more accurate.

Thanks again for your help.

Kind regards
Nick

boat-html.zip

Sam

unread,

Dec 27, 2012, 12:06:02 PM12/27/12

to nltk-...@googlegroups.com

With the exception of a few minor errors (swapping I and B, e.g. (btw, if any thing is only 1 token long, that token should be a B, not an I)), it looks good. I'm not sure if you can explicitly limit the classifier to a range of options, but you could try supplementing your training data with appropriately marked-up town and county names; I'm not sure it'd be more worthwhile to do so than to just add more 'real world' training data, however.

If you're looking for a little more accuracy, you could try the BILOU annotation scheme (the aforementioned BIO scheme + Last and Unit-length (e.g. single-token) labels.)

Message has been deleted

Nick

unread,

Jan 4, 2013, 6:15:22 AM1/4/13

to nltk-...@googlegroups.com

Sam,

Thanks for getting back in touch.

I've gone through and corrected the items now, thanks for pointing that out.

Before I go through and classify a few hundred I was looking to train the classifier with the items I had.

I've added each of them to a single file, separated by new line. I've searched google and NLTK book on training a classifier, but can't seem to find what I need here.