Need Guidance

22 views

Skip to first unread message

Supernatural Self

unread,

Nov 21, 2022, 7:35:49 AM11/21/22

to Gensim

I was wondering if anyone could point me in the right direction. I am an amateur python coder, but I am interested in using the latest word2vec or another model to output semantically similar terms/topics up to 5 n-grams or what some call LSI terms. Forgive me if I use a term wrong. I am not an AI programmer, nor am I full fledge python coder. I learn as much as I need to make scripts work with each other.

GOAL: Optimize website content for both readers and SEO.

I want to be able to supply the gensim script with 1 or 2 or more words/terms, run the script against the word2vec model and have it generate a CSV file with the closest words/terms (1-gram, 2-gram, 3-gram up to 5-gram) semantically similar and tightly correlated entities often associated with the input terms. But instead of just kicking out results based on the supplied documents or URLs, would it be possible to have it render results from the much larger corpus of documents? A much larger dataset?

People in the SEO industry use the term "LSI" which is a dinosaur term. Another closely associated term is TF-DFI, but I feel that can be highly limited if only based on a smaller corpus of documents.

I use Google NLP API to parse text but it can get expensive fast, especially if you use the content classification call. However, what google will not reveal is their entities' vector relationships tiers (for obvious reasons)... but I think that replicating that word(s)/term(s) vector neural network shouldn't be complicated. When I say entities, most people think of named entities such as people, places, or things, but Google NLP has a lot of "other" entities that aren't named entities but are important in relation to named entities.
Examples input: semantic SEO, lexical semantics
Example output: semantic search, search engine, syntax semantics, etc (output all closely related words.) Would be nice if a cosine score was provided as well.

Second, I'd also like to supply the script with a list of URLs and use various scripts to scrape the URLs, clean the pages, crawl the cleaned pages, and use word2vec to pull from the text all semantically similar terms and strong "often used together with" related terms.

I have been playing with Orange3 which is a widget/add-on-based UI that can be used for text mining. It is pretty neat, but it's limited. How nice would it be if there was another program out there like Orange3 that had far more text-mining tools. Brilliant idea, but primarily focused on biochemistry.

If I have misunderstood what Gensim does or word2vec does, please point me in the right direction.

Thanks

Reply all

Reply to author

Forward

0 new messages