Keyword Extraction - Project

110 views
Skip to first unread message

TW Tones

unread,
Oct 21, 2020, 7:17:37 PM10/21/20
to TiddlyWiki
Folks,

I just thought I would let you know what I am working on to seek input.

Background
  • Many of you use tags and tiddlers to categorise tiddlers, be it tags or fields such as lists. 
  • The information used to do this is often found in the text you type or data you imported.
  • I am trying to automate this to a high degree to extract and formalise keywords and key phrases and generate tiddlers.

My Project
  • I am currently working on methods to extract content from the text and other fields to create keywords and key phrases as tiddlers. 
  • Once created the freelinks plugin should highlight them, where ever they appear.
  • When Viewing a tiddler I can list all the keywords and key phrases found in the content, and with a link widget click to create keyword tiddlers and add what I want to a separate list.  
  • Once I have a selected list of keywords and key phrases
    • I would like to order them by importance
    • Be able to insert them in a text field then compose text that uses each selected keyword. I am authoring content for organic search.
Help needed
  • I would like to extract words from content, excluding "if, and, or, but, not, the" etc... and generate links to tiddlers of each name.
  • Sometimes we want a phrase rather than a word.
  • What if that content is HTML or JSON?
Your thoughts would be appreciated.

Regards
Tones

Lin Onetwo

unread,
Oct 22, 2020, 11:02:32 AM10/22/20
to TiddlyWiki
Hi, Tones

Great idea, automating linking and tagging will be a boost to the writing. We normally consume time to do this by hand, and automatically do this in English (instead of in Chinese and Japanese) is very easy, as there are many tools.


You can even do rule-based matching using https://github.com/catalogm/compromise-match2 , so you can create your own precise matching rules to do topic extraction.

Maybe you can do this in some hooks of tiddlywiki? I am not fremiliar with hooks.

Hope this helps

LinOnetwo

Atronoush Parsi

unread,
Oct 22, 2020, 11:34:28 AM10/22/20
to tiddl...@googlegroups.com
Hi Lin

On Thu, Oct 22, 2020 at 6:32 PM Lin Onetwo <lindo...@gmail.com> wrote:
Hi, Tones

Great idea, automating linking and tagging will be a boost to the writing. We normally consume time to do this by hand, and automatically do this in English (instead of in Chinese and Japanese) is very easy, as there are many tools.


The LDA is a machine learning tool, it would be great if this kind of library is developed for Tiddlywiki so it can recognize tiddler, tiddler fields, ...
 
--
You received this message because you are subscribed to the Google Groups "TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tiddlywiki+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tiddlywiki/a8e13bed-9ceb-47dc-8cd1-fe9ee077d91en%40googlegroups.com.

TW Tones

unread,
Oct 22, 2020, 6:07:16 PM10/22/20
to TiddlyWiki
 Interesting insight's Lin and Atro,

I was thinking about the issue with other languages, there could perhaps be two approaches passive and active keywords.

Passive keyword Identification
  • Using simple rules to identify words or delimited phrases perhaps excluding a simple word list.
  • This should be easy to implement in any language with a little knowledge of that language and its word sentence structures.
Active keyword identification
  • Using more sophisticated textual analysis and keyword databases 
  • More sensitive to the language in use, potentially third party solutions
  • We can consider the temporary installation of a tool for keyword identification, or even a utility wiki for analysis of submitted tiddlers.
    • This requires once a keyword is identified save a change to the text or special tiddlers so the tool can be removed from the wiki reducing it's size.
  • I would think there should be data sources that are made available after the analysis of languages using big data, that is smaller than the input (te language) to that system but reflects the Machine learning about that language we can use.
  • Makes me wonder if grammatical information about words could be used in word smithing tiddlywiki content. eg search for nouns only etc...
lin,

Your links led me to Stop Words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return vast amount of unnecessary information. A better definition is provided below:

See attached, the stop word list in a single tiddler, it is quite short.

Regards
Tones
English Stop words.json
Reply all
Reply to author
Forward
0 new messages