Help defining/quantizing a problem

Samudra Neelam Bhuyan

unread,

Nov 17, 2012, 11:04:47 PM11/17/12

to mu...@googlegroups.com

Hey all

Needed some advice about a problem we are trying to solve.

The problem involves finding the significant words in a para, finding the relevant words out of them, and finding the related terms including concepts that are one level higher than those words (e.g. from "ferro magnetic" to "magnetism")

We are aware about zemanta API and alchemy API, and although short term they do seem like a good idea, I'm afraid that in the long term we would lose out on the flexibility of having our own algorithms and dictionaries.

Q: would it be a better idea to build a layer on top of 3rd party API? Or to build it from scratch? (we are aware of some libraries like conceptnet, etc but haven't used them personally)

I guess the question could be: How difficult is it to reach alchemy API or zemanta API levels of confidence, especially if you are defining the field of tags narrowly? How much approximately will it take to build that tagger?

Thanks for your help!
Samudra

Devendra Rane

unread,

Nov 18, 2012, 12:31:54 AM11/18/12

to mu...@googlegroups.com

I have used 'kea' and also worked on a custom text-mining tool. I have used both of them for supervised learning (ontology/taxonomy based).
The general set of algorithms starting from stemming, keywords, key phrase, distance clustering etc. are available and you probably would not even need to code it in your preferred language.
The software part of it is easy, the tagger taxonomy and tuning the parameters according to your domain is going to be tricky.
I would suggest building up on any MIT/BSD type licensed library instead of something from scratch.
There are no silver bullets to reach the accuracy levels of commercially available apis, you will need to keep working on the algorithms. Doable and beneficial in the long run, do not expect good results in a short time span.

/dev/
Devendra K. Rane
Ph: +91 900 403 8889
Skype: ranedk
Gtalk: ranedk

--
_________________________________________________
Mumbai Python Users Group - http://www.mumpy.org/
Mailing Group - http://groups.google.com/group/mumpy/
Membership Management - http://groups.google.com/group/mumpy/subscribe/

Samudra Neelam Bhuyan

unread,

Nov 19, 2012, 1:40:26 AM11/19/12

to mu...@googlegroups.com

Hi Dev

Thanks for the answer! :)

I agree with you. Seems to me building a layer of logic on top of a 3rd party API will give ROI very fast, but will plateau very quickly too.

Also, any pointers on how we could measure the accuracy of the algorithms? So far we have thought of a crowdsourced tagging correction mechanism to collect errors, and then using statistical mechanisms, but is there a better (or standard) way to measure the accuracy of taggers?

Samudra

Raxit Sheth

unread,

Nov 17, 2012, 11:49:33 PM11/17/12

to mu...@googlegroups.com

You may want to have look @ nltk online as well as mailing list. I am
sure you will get definite help from mailing list.

Raxit

On Sun, Nov 18, 2012 at 9:34 AM, Samudra Neelam Bhuyan
<samu...@gmail.com> wrote:

Devendra Rane

unread,

Nov 19, 2012, 6:58:18 AM11/19/12

to mu...@googlegroups.com

There is no direct way to measure accuracy of taggers and crowd sourcing correction seems legit.
However, to get better results, instead of starting from a 'random' supervised learning set (or a taxonomy), you can start with few a representative documents you plan to tag... Use an extremely basic taxonomy to tag it, report all untagged records in the document and measure it with some metric. Now, keep building the taxonomy so as to minimize this metric. In the process, you will come up with lots of small rules/patterns which will automatically help you build on the taxonomy.
In due time, your metric will come to a acceptable limit and the patterns/rules will be able to create taxonomies with 50-60% accuracy on newer documents.
Hope this helps.

/dev/
Devendra K. Rane
Ph: +91 900 403 8889
Skype: ranedk
Gtalk: ranedk

Reply all

Reply to author

Forward