How do I create categories in my own corpus? (NLTK Book chapter 2)

Natalie Nitz

unread,

May 18, 2024, 2:01:24 AMMay 18

to nltk-users

Hi everyone,

For a university assignment I need to create a corpus of song lyrics. I need to organize these files on the basis of year and genre. In the NLTK book chapter 2, they show how the Brown Corpus is organized into categories. How can I apply that same logic to my own corpus? I am planning to do an analysis of lexical complexity by genre and year and so need to be able to access and analyze the files on the basis of these two characteristics.

Thanks in advance!

Screenshot 2024-05-17 at 14.05.06.png

Mustafa A

unread,

May 20, 2024, 7:10:06 AMMay 20

to nltk-users

Hello,

In this particular case you have categories "year" and "genre"? Im not an expert on this, but I believe you can implement a ML model in order to classify lyrics into genres. You would need to have a labelled dataset in order to train your model, and then test the model with your own data.

If you have a small set of lyrics you probably prefer to implement a rule based model.

Another approach is to manually label your corpus.

You may use a jsonl file to store your corpus as: {"genre": "I need somebody\n(Help) not just anybody\n(Help)\n you know I need someone, help...(beatles song)", "year": 1965}.

Yet, another approach may be, if your lyrics are stored on some internet database, you may use web scraping (beautiful soup) in order to extract the genre and the year. If there are no years available you can use a Wikipedia dump to extract these.

I hope you may find this useful.

Francis Bond

unread,

May 21, 2024, 3:57:13 PMMay 21

to nltk-...@googlegroups.com

Hi,

I think you need to use the Categorized Corpus Reader.

See here:

https://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.CategorizedCorpusReader

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/7c132f8e-c805-4578-bda4-8cdcee7508d0n%40googlegroups.com.