Information Extraction

41 views
Skip to first unread message

rpun...@gmail.com

unread,
Apr 9, 2014, 8:14:26 AM4/9/14
to accor...@googlegroups.com
Hi,

I am trying to extract the information from resume like skills,qualifications, personal details,address etc.

So can you tell me which algorithm is best to use it ? Can accord.net framework used to do all this stuff ?

Regards,
Rohit

César

unread,
Apr 11, 2014, 5:01:28 AM4/11/14
to accor...@googlegroups.com
Hello Rohit!

Sorry for taking some time to reply. I haven't worked much with text processing so I don't know much about information extraction. But are you talking about extracting this information from text, or from images containing the text (i.e. scanned copies of the resumes)?

If you are working with text, then perhaps you could take a look on using different Bag-of-Words to represent the different parts of the resume (i.e. the different sections or paragraphs). Then, after you have your bag of words, you could try to use any classifier in the framework, such as Support Vector Machines, to do the classification for you. I am not sure if this would be the best way to do it, but certainly this is something that I would try!

In the document I linked above there is an example on how to use Bag-of-Words together with naive Bayes to do a simple text classification. Hope it helps! 

Best regards,
Cesar

rpun...@gmail.com

unread,
Apr 11, 2014, 7:23:46 AM4/11/14
to accor...@googlegroups.com
Hi Cesar,

Many thanks for replying the query.

Yes, I am talking about extracting the information from text. Initially concentrating only on resumes prepared in word file (doc,docx) only.

Can get any steps/documentation to use HMM + Accord.Net parallel ?

Regards,
Rohit

César

unread,
Apr 15, 2014, 7:44:34 AM4/15/14
to accor...@googlegroups.com
Hi Rohit,

I don't know very well how to do classification of text sentences using HMMs; but I would guess you could use a symbol dictionary to simply transform your words into discrete symbols and then transform your paragraphs to sequences of those symbols. This approach can be seen on this simplified example on how to perform text generation using HMMs. Basically, in a first step, the different words from your text corpus should be converted into a dictionary using the Codification class. Then, you should be able to transform your text into sequences of integers. Those sequence of integers can then be fed to a HiddenMarkovClassifier. Examples on how to learn HMM classifiers can be found here.

By the way, parallel computations are already used internally by those learning methods. But in short, they are implemented using just .NET's Task Parallel Library.
Reply all
Reply to author
Forward
0 new messages