Unexpected Tokens

70 views

Skip to first unread message

JJohnson

unread,

Jan 18, 2023, 2:34:17 AM1/18/23

to GloVe: Global Vectors for Word Representation

I was making use of the smallest GLoVe pretrained vectors with Pytorch and running a quick test on an untrained model, gave back a lot of text I would not anticipate to be included in a pre-trained language embedding vector set. This is just one of the examples the (untrained) model gave in response to a prompt:

['k977-1', 'k587-1', 'js04bb', 'greg.wilcoxdailynews.com', 'bulletinyyy', 'k977-1', '65stk', '15utl', '65stk', '65stk', 'em96', 'k587-1', 'srivalo', 'str95bb', 'k978-1', 'k977-1', 'k978-1', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bulletinyyy', 'bb96', 'bb96', 'bb96', 'srivalo', 'http://www.mediabynumbers.com', 'bb96']

Are there any plans to downsize and remove gibberish and web links from the vector set? Or is there a smaller version I'm just not seeing?

Would be great if there were a distilled version with just English words in standard spelling and some common misspellings. The reason I ask is because the model I'm training produces embedding predictions which are then mapped to the nearest word vector as outputs, and the process for doing that becomes quite calculation intensive as the number of vectors increase.