Extract topics from new documents

meir shamay

unread,

Nov 2, 2018, 4:51:14 AM11/2/18

to bigartm-users

Hi,

I have 1000 blog posts and I want to extract topics from each one of them,

I want to create this results: document_name: [topics list]

As I understand I need to create a batch from each document using this function: ArtmParseCollection

And then call:

bigartm --use-batches <batches> --use-modality @default_class,@target --topics 50 --load-model model.bin
          --write-predictions pred.txt --csv-separator=tab --predict-class @target --write-class-predictions pred_class.txt --score ClassPrecision

But, it's not working for me, do I miss something?

Thanks.

Oleksandr Frei

unread,

Nov 2, 2018, 9:02:13 AM11/2/18

to meir....@gmail.com, bigart...@googlegroups.com

Hi,

Here is an example using eurlex collection (available here http://docs.bigartm.org/en/stable/tutorials/datasets.html )

1. Fit the model and save it as eurlex.model fle

bigartm.exe -c vw.eurlex.txt -t 20 -p 10 --save-model eurlex.model --use-modality @labels_class,@default_class --force

2. Find topic distribution of test documnet:

bigartm.exe -c vw.eurlex-test.txt --load-model eurlex.model --write-predictions eurlex-test.pred.txt --use-modality @default_class --force

3. Find label distribution of test documents (where label is one of the modalities of the model)

bigartm.exe -c vw.eurlex-test.txt --load-model eurlex.model --write-predictions eurlex-test.pred.class.txt --predict-class @labels_class --use-modality @default_class --force

4. Find a single label for test documents:

bigartm.exe -c vw.eurlex-test.txt --load-model eurlex.model --write-class-predictions eurlex-test.class.txt --predict-class @labels_class --use-modality @default_class --force

In (2) the output is Nx50 matrix (N=number of documents)

In (3) the output is NxL matrix (L =number of distinct labels in @label_class modality)

In (4) the output is Nx1 vector (giving best label for each document)

Does this help to solve your problem? Note that --predict-class changes what's written into --write-predictions <file>.

Simplest way to feed data is to use VW format (http://docs.bigartm.org/en/stable/tutorials/datasets.html#), but all commands will also work with batches. Batches can save time if you are working with large collection (then we skip parsing text which takes time), but for small collection it doesn't matter.

Kind regards,

Alex

--
You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/d94751be-1937-4be1-85c8-3021e5b0454b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

meir shamay

unread,

Nov 2, 2018, 12:22:42 PM11/2/18

to bigartm-users

Hi,

Thank you for your response,

I am trying to extract topic from a blog post, and I am getting the error: "InvalidOperation : Error in test_document.txt:1 has too few entries:"

Do I need to pre-process the blog content before? If so, what should I do?

Regards,

Oleksandr Frei

unread,

Nov 2, 2018, 12:36:33 PM11/2/18

to meir shamay, bigart...@googlegroups.com

Hi,

You need to pre-process collection and convert it into VW format. VW format looks as follows:

document_number_one |@labels_class peninsular_malaysia_@LABEL monitoring_of_exports_@LABEL export_of_waste_@LABEL |@default_class posit pursuant amend:5 relev

document_number_two |@labels_class agricultural_statistics_@LABEL livestock_@LABEL animal_production_@LABEL statistical_method_@LABEL cattle_@LABEL swine_@LABEL |@default_class council:5 made commun:2 treati

Each line starts with document identifier (must be without spaces), followed by tokens.

In the example above both documents are represented as bag-of-words, e.i. I've conted how many times each token is contained in the documents, and I've put that count as "token:count". That's optional, i.e. it's ok to put just text - just make sure if doesn't have colon (":") or pipe ("|"). Pipe indicates class label, for example |@label_class defines that all subsequent tokens will be of class @label_class (all until next pipe, e.i. |, which we'll re-define class).

Simplest VW file, where all tokens belong to the default class will just look as follows:

doc1 content of my document number

doc2 the quick brown fox jumps over the lazy dog

But remember that tokens must be cleaned, stemmed, and legitimatized, converted to standard case, etc. "DOG" , "dog", "Dog" would be three different tokens, "jumps" and "jump" would be different too.

Kind regards,

Alex

--

You received this message because you are subscribed to the Google Groups "bigartm-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigartm-user...@googlegroups.com.
To post to this group, send email to bigart...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/bigartm-users/fc259eff-7e57-44f0-9d05-c9417d63be27%40googlegroups.com.

Reply all

Reply to author

Forward