Hey everyone. Last night I was brainstorming for ways that I could build on Hyperaudio and I subsequently concluded that building a data store to house term frequency information from transcripts could prove to be pretty valuable in the short term and long term. This would entail capturing the textual content from transcripts, both user created and converted then pumping it into a module that creates a bag of words. The bag of words is a break down of each term in the transcript and it's frequency within the document. Once the bag is created it's stored in a database (likely MongoDB). Along with the transcript data, a global term -> frequency document is updated that reflects term frequency across the whole system (all transcripts). By having a global store of term frequency we can filter out the most common words that have no value (the, a, that, it etc...) and filter in the ones that do (keywords) for determining categorical data and other useful things.
So initially what are some benefits of capturing and storing this data?
- Better search. Precision, ranked results, and search hinting.
- Better descriptive metadata. Data about data.
- Assisting transcript cleanup. For example, document breakdown could allow each word to be spell-checked.
- Increasing accessibility. Document breakdown could allow for acronym identification and automatic <abbr></abbr> tag generation in rendered outputs. As well, words can be passed through lexical analysis to determine the language of the transcript, even going as far as possibly providing a jumping point for translations.
- Suggestions. In content mixing, the data store can be used to provide suggestions to related content that may mix well. In publishing the data can be used to provide suggestions for content tagging and sharing.
What are some long term benefits?
- The data provides useful insight into what Hyperaudio is being used for. For example what kind of content is being input, created, and shared. Collecting some additional date/time metadata per transcript upon creation can give us trends.
- Growth of the data store provides increased precision in common term filtering.
- Provides data enrichment for media and can be subsequently used in developing for the Semantic Web (http://www.w3.org/standards/semanticweb/)
- The data can be used as a jumping point for extended applications for natural language processing(semantic analysis or topic modeling for example) and other machine learning aspects that might find it useful.
Like the other aspects found within the Hyperaudio ecosystem, this additional capability would conform to any requirements set by Hyperaudio's primary maintainers.
If anyone has additional thoughts, ideas, or even other possible use cases please share them :)