Training Maui with keywords but no documents possible?

97 views
Skip to first unread message

Emery Mersich

unread,
Jun 2, 2016, 10:01:05 AM6/2/16
to Kea and Maui Support
Hi,

I'm sorry for the silly question. I don't quite understand the training process.

I have a list of keywords to use for training. Is it absolutely necessary to use documents for the training?

What if I don't have documents? Can I just use the keywords? My intention is to extract keywords from documents using Maui.

Or can I convert the keywords to the skos format and use them in that way? Again, without documents?

Thank you very much for your help!

Emery

Richard Cyganiak

unread,
Jun 2, 2016, 1:19:12 PM6/2/16
to kea-and-ma...@googlegroups.com
Hi Emery,

If you don’t have any documents, then why do you want to extract keywords from documents? :-)

You will need documents. But let me ask something else first:

When you say that you have “a list of keywords to use for training”, do you mean that you have a complete list of all the possible keywords, and you want Maui to pick the best match from that list for a document? In other words, do you have a controlled vocabulary? Because that would mean you don’t want to do keyword/keyphrase extraction. It would mean you want to do term assignment.

If that’s what you want, you need to turn the list of keywords into SKOS format first.

But if your list of keywords are just examples, and you want to Maui to also return any other highly relevant keywords/keyphrases that it may discover in the document, then you really want keyphrase extraction.

In this case, you won’t need SKOS.

But either way, you need training data. Find some example documents that are typical for the kind of document you want to work with. Start with 20 documents—you can add more later to improve quality. Then have a close look at each document and manually assign five or so topics to each. If you want to do term assignment, then make sure that each topic is *exactly* as it is on your list. If you want to do keyphrase extraction, then forget about your list and just pick some good keywords or keyphrases based on the contents of the document.

Then you have everything you need to train Maui.

Best,
Richard



--
You received this message because you are subscribed to the Google Groups "Kea and Maui Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kea-and-maui-sup...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Emery Mersich

unread,
Jun 2, 2016, 3:10:45 PM6/2/16
to Kea and Maui Support

Step 1: Yes, then it's term assignment I'd like to perform in a first step. I have fixed set of keywords / keyphrases and would like the system to discover the best matches for articles fed in. I don't have the articles ahead of time. I'll look into creating an SKOS file.

At one point VocBench was recommended in another post as a tool to create the SKOS files. Is this still a good option or does another tool exist?

Step 2: In a second step I will look to extend my project for discovery of keywords and phrases in documents. At that point I'll have gathered good sets of documents and will better know how to train models.

Thanks for your help!

Richard Cyganiak

unread,
Jun 3, 2016, 5:59:03 AM6/3/16
to kea-and-ma...@googlegroups.com

> On 2 Jun 2016, at 20:10, Emery Mersich <emer...@gmail.com> wrote:
>
> At one point VocBench was recommended in another post as a tool to create the SKOS files. Is this still a good option or does another tool exist?


There are various tools for converting CSV and similar simple text-based formats to SKOS out there. Such a tool is quite easy to cobble together with a bit of knowledge of any scripting or programming language. I’m not aware of a really good, free, flexible, easy-to-use out-of-the-box solution for converting to SKOS.

(I feel obliged to mention that my employer makes a SKOS-based ontology and taxonomy management product called TopBraid EVN, and it has pretty good importers from spreadsheets and CSV. It’s not free. It has an integration for running Maui.)

Richard

Emery Mersich

unread,
Jun 3, 2016, 10:56:58 AM6/3/16
to Kea and Maui Support
OK, thanks. I'll take a look.

Emery Mersich

unread,
Jun 9, 2016, 3:15:27 PM6/9/16
to Kea and Maui Support

Hi,

 

I've created an SKOS file and am able to run Maui with it. I still do have 2 questions.

 

1.  If I'm only doing term assignment, why must I still provide a path and documents, in this case the fao_train path, to create a model? I can't get it to run without first training and I can't just train without the -l flag and path with documents.

 

java -Xmx1024m -jar maui-standalone-1.1-SNAPSHOT.jar train -l data/docs/fao_train/ -m data/models/term_assignment_model -v data/vocabulary/term_list.rdf -f skos

 

2.  What do the scores of the format "4.901960784313724E-4" mean? Is there an explanation of them to be found somewhere?

 

crockpots 0.1538235294117647

crockpot 0.1538235294117647

she loves me 0.1338235294117647

appetizers 0.0538235294117647

recipes by ingredients 0.0538235294117647

dips 0.0538235294117647

dish 0.0538235294117647

bbq sauce 4.901960784313724E-4

appetizer recipes 4.901960784313724E-4

easy recipes 4.901960784313724E-4

 

Thanks!



Richard Cyganiak

unread,
Jun 9, 2016, 4:50:00 PM6/9/16
to kea-and-ma...@googlegroups.com


On 9 Jun 2016, at 20:15, Emery Mersich <emer...@gmail.com> wrote:


Hi,

 

I've created an SKOS file and am able to run Maui with it. I still do have 2 questions.

 

1.  If I'm only doing term assignment, why must I still provide a path and documents, in this case the fao_train path, to create a model? I can't get it to run without first training and I can't just train without the -l flag and path with documents.

 

java -Xmx1024m -jar maui-standalone-1.1-SNAPSHOT.jar train -l data/docs/fao_train/ -m data/models/term_assignment_model -v data/vocabulary/term_list.rdf -f skos


Does the training data use terms from your vocabulary? If not, the trained model will not be very good at all.

Why does Maui need training data for term assignment? Short answer: Because it's a machine learning algorithm. Longer answer in Alyona's PhD thesis ;-)

 

2.  What do the scores of the format "4.901960784313724E-4" mean? Is there an explanation of them to be found somewhere?

 

crockpots 0.1538235294117647

crockpot 0.1538235294117647

she loves me 0.1338235294117647

appetizers 0.0538235294117647

recipes by ingredients 0.0538235294117647

dips 0.0538235294117647

dish 0.0538235294117647

bbq sauce 4.901960784313724E-4

appetizer recipes 4.901960784313724E-4

easy recipes 4.901960784313724E-4


It's “E notation”, a version of scientific notation. The number behind the E is called the exponent, it's -4 in your example. To read E notation, add a few imaginary zeroes to the front (00004.901960784313724), then if the exponent is -4, move the decimal point four digits to the left (0.0004901960784313724). An exponent of 4 (or +4) would mean moving the point four to the right.


Best,
Richard



 

Thanks!



Emery Mersich

unread,
Jun 10, 2016, 9:18:33 AM6/10/16
to Kea and Maui Support

> Does the training data use terms from your vocabulary?

Yes it does. I had understood you as saying I didn't need training documents for term assignment, only for keyword extraction.

>E notation

Yes, ok. I thought it had a deeper meaning without the realm of Maui, since all examples I try at some point suddenly become "4.901960784313724E-4", and not say "2.901960784313724E-4" or "4.501960784313724E-4".

Thanks

Emery Mersich

unread,
Jun 10, 2016, 3:49:43 PM6/10/16
to Kea and Maui Support
Sorry

> E notation

Yes, ok. I thought it had a deeper meaning WITHIN the realm of Maui, since all examples I try at some point suddenly become "4.901960784313724E-4", and not say "2.901960784313724E-4" or "4.501960784313724E-4".
Reply all
Reply to author
Forward
0 new messages