Els Test Your Vocabulary

0 views
Skip to first unread message

Mack Mosely

unread,
Aug 3, 2024, 3:56:44 PM8/3/24
to condcapkati

At the moment, I am following best practices and creating a "bag of words" vector with a vocabulary from the training data. My cross validation (and test) datasets are transformed using this model, using the same vocabulary created by the training set. They don't contribute any vocabulary, or affect the document frequency (for "term-frequency inverse document frequency" calculation).

However, this is restrictive in a few ways. Firstly, calculating the bag of words model is expensive, and so this prohibits me carrying out k-folds cross validation (since it would require constant re-calculation of the bag of words). My dataset is around 10 million words, and I'm calculating bag of words and bag of bi-grams, which takes around 5 minutes each time.

Would I be biasing my results significantly if I fit the bag of words on both the training set, and the cross validation set? In other words, if I use the vocabulary in the validation set to calculate the vocabulary for the bag of words? The way I figure, even though they might contribute to the vocabulary, there's no risk of overfitting since the frequency for those specific samples won't be seen at training. This allows me to slice the validation set later however I like, and I still have a "test" set for an accurate predictor of generalisation error (the test set won't be seen at all until test time).

Yes. That should be fine. What you suggest makes sense and should not bias the results. The reasons you give are good ones. For any reasonable classifier, if the value of an attribute is always zero in the training set, that should cause the attribute to be essentially ignored.

There is a simple test to let you confirm this. You can try, for each document in the validation set, zeroing out the entries in the feature vector that correspond to words that were not present in any document in the training set, and see if that changes the classification. If it doesn't, then you know that your method has had no effect and hasn't introduced any bias.

As a matter of implementation, it's certainly possible to implement that in a more efficient way than re-generating the vocabulary and re-generating the feature vectors for each fold. As long as your implementation generates the same vectors, it doesn't matter how it obtains them.

Generate a sparse feature vector for each document, based on the "superset vocabulary". I suggest that you represent these in an efficient way, e.g., as a Python dictionary that maps from word to count.

Generate a subset vocabulary, containing only the words in the training set. Probably there is a small set $S$ of words that are in the superset vocabulary but not the subset vocabulary, so I suggest storing that set $S$.

For each document in the training set and validation set, generate a derived feature vector for the document, using only the subset vocabulary. I suggest generating this by starting from the feature vector for the superset vocabulary, then removing the words in $S$. This should be more efficient than regenerating the feature vector from scratch.

Language learning consists of the following parts: grammar, vocabulary learning and speaking. Of these; grammar is the one that can be completed in the shortest time. Speaking is only possible if you know words. There are tens of thousands of words in a language. The vocabulary learning process is quite long. You can test your vocabulary to know how many words you know and plan your learning process accordingly.

Receptive vocabulary is words that learners recognize and understand when they are used in context, but which they cannot produce. It is vocabulary that learners recognize when they see or meet in reading text but do not use it in speaking and writing (Stuart Webb, 2009).

Productive vocabulary is words that the language learners understand and can pronounce correctly and use constructively in speaking and writing. It involves what is needed for receptive vocabulary plus the ability to speak or write at the appropriate time. Therefore, productive vocabulary can be addressed as an active process, because the learners can produce the words to express their thoughts to others (Stuart Webb, 2005).

The numbers of active words are less than passives. Because peoeple understand thousands of words on different subjects by listening, reading and seeing, but they can use only the words in line with their interests and the words used commonly (especially when speaking). The words people use to understand are passive, and the words that are used to express and build new structures are called active words.

How many words do we actually know? What would it take to identify and count all the words a person knows? A tough nut to crack, isn't it? Sure, you could dig into a thick, hardcover dictionary, flip through page by page, mark and count all the words you know. But do you really want to? And hey, what about the time it would take? It would be like trying to count all the grains of sand in a handful of sand.

Is there a better way? Yes, there is. Let's start by counting, say, 20 grains, weigh them, and then use a mathematical algorithm to find the number of remaining grains based on their weight. In our case, weight refers to the frequency of a word, meaning that common and frequent (i.e. simple) words have less weight, whereas rare and complex words carry more weight. We've created a counter and "scales" for you, so you don't have to bother with all that weight-math stuff.

If you really want an accurate result, be honest. Do not tick the words that you might have seen or heard once upon a time, and they seem to be familiar (but you have doubts about their meaning).Don't look up words in a dictionary. And don't worry about your time, the test will take only a few minutes to complete. If your vocabulary is big enough, you will have to go through at most 152 words. Take the test and compare your score with other people from around the world.

Hey, you probably want to increase your vocabulary. How about some AI help? Check out our partner LangPilot - a tool to practice and memorize new words with personalized exercises built by cutting-edge AI.

Of course, you already know thousands of words, and you will continue to learn more whether you work at it or not. The fact is that many of the words you know were probably learned simply by coming across them often enough in your reading, in conversation, and even while watching television. But increasing the pace of your learning requires a consistent, dedicated approach. If you learned only one new word a day for the next three years, you would have over a thousand new words in your vocabulary. However, if you decided right now to learn ten new words a day, in one year you would have added over three thousand to what you already know, and probably have established a lifetime habit of learning and self-improvement.

While there are not any magic shortcuts to learning words, the larger your vocabulary becomes, the easier it will be to connect a new word with words you already know, and thus remember its meaning. So your learning speed, or pace, should increase as your vocabulary grows. There are four basic steps to building your vocabulary:

When you have become more aware of words, reading is the next important step to increasing your knowledge of words, because that is how you will find most of the words you should be learning. It is also the best way to check on words you have already learned. When you come across a word you have recently studied, and you understand it, that proves you have learned its meaning.

Once you have begun looking up words and you know which ones to study, vocabulary building is simply a matter of reviewing the words regularly until you fix them in your memory. This is best done by setting aside a specific amount of time each day for vocabulary study. During that time you can look up new words you have noted during the day and review old words you are in the process of learning. Set a goal for the number of words you would like to learn and by what date, and arrange your schedule accordingly. Fifteen minutes a day will bring better results than half an hour once a week or so. However, if half an hour a week is all the time you have to spare, start with that. You may find more time later on, and you will be moving in the right direction.

The steps we have just discussed do not involve the use of vocabulary-building aids such as books, tapes, or CDs; all that is required is a dictionary. But what about such materials? Are they worth using? We say yes.

The first advantage of vocabulary-building books is that they present you with words generally considered important to know, thus saving you time. Another advantage of many of these books is that they will use the words in several sentences, so that you can see the words in different contexts. A third advantage is that they usually have exercises that test what you have learned, which gives you a clear sense of progress.

The major disadvantage of many of these books is that the words in them may sometimes be too difficult for the person who does not have a large vocabulary. Such a person would have a hard time learning these words and could quickly become discouraged. We suggest, therefore, that you scan the materials you are interested in before buying. If most of the words are totally unfamiliar to you, you will probably not get very much out of it. If, however, you recognize many of the words but do not quite know them, then the material is probably at the right level for you.

We know you can expand your vocabulary almost as fast as you wish. There are countless examples of people who have done so. Remember, you started out in life knowing no words, and now you know thousands. You can learn many more. Why not start today?

We are experiencing sporadically slow performance in our online tools, which you may notice when working in your dashboard. Our team is fully engaged and actively working to improve your online experience. If you are experiencing a connectivity issue, we recommend you try again in 10-15 minutes. We will update this space when the issue is resolved.

c80f0f1006
Reply all
Reply to author
Forward
0 new messages