Good Morning,
I have a stack of 300k HTML pages, and I am looking to play around with these, and extract information from the HTML. I've spent days reading and researching NLP and the NLTK. I've read blogs and books as well as the NLTK website, but I am at a loss in a few areas.
One thing I want to do, is extract physical addresses from the HTML, similar as outlined in Zheyuan Yu's Thesis, http://web.cs.dal.ca/~zyu/research/Thesis.pdf which he found the best method is with machine learning. However, I am stuck to even get started.
Is this called Part-of-Speech Tagging or Chunking / Partial Parsing? The terms seem to be thrown around quite loosely.
I believe I need to do some supervised learning, so I need to create some training and testing data sets, using the IOB format, but I am not sure how to even create that, is it a simple matter of adding /O /I or /B tags around text? Eg
"my/o neighbor/o lives/o at/o 1234/b Hill/i St/i i,b Christchurch/i New/i Zealand/i"
or is tagging done using some other method?
Any help at all will be appreciated.
Have a great christmas, and a new year.
Kind regards
Nick
--
--
"My/O address/O is/O 123/B-ST Main/I-ST St./I-ST ,/O Anytown/B-TWN ,/O USA/B-CO ,/O 11111/B-ZIP"
Where 'ST', 'TWN', 'CO', and 'ZIP' are labels for the street address, town, country, and zip code, respectively, and 'B', 'I', and 'O' mark data that is the beginning of, inside of, or outside of one of those labels (so 'I-TWN' would be not-the-first word in the name of a town, e.g.) The labels you choose depend, obviously, on the desired level of granularity you want in your output. If you're planning on extracting the addresses atomically, you'd only need one label (and then basically everything would just be 'I-ADDR' or whatever.)
Ideally, you'd strike a balance between variety and depth--too much homogeneity in your training data, and your classifier won't learn to handle enough of the variation you see in real-world data; not enough depth, and your classifier might not really learn much at all.
With sufficient labeled training data, you can then use a NLTK classifier (google 'NLTK IOB') to extract the addresses.
Hope this helps!
If you're looking for a little more accuracy, you could try the BILOU annotation scheme (the aforementioned BIO scheme + Last and Unit-length (e.g. single-token) labels.)