Creating a valuable data store for Hyperaudio

50 views
Skip to first unread message

Dustin Blake

unread,
Jan 21, 2014, 1:35:17 PM1/21/14
to hyper...@googlegroups.com
Hey everyone.  Last night I was brainstorming for ways that I could build on Hyperaudio and I subsequently concluded that building a data store to house term frequency information from transcripts could prove to be pretty valuable in the short term and long term.  This would entail capturing the textual content from transcripts, both user created and converted then pumping it into a module that creates a bag of words.  The bag of words is a break down of each term in the transcript and it's frequency within the document.  Once the bag is created it's stored in a database (likely MongoDB).  Along with the transcript data, a global term -> frequency document is updated that reflects term frequency across the whole system (all transcripts).  By having a global store of term frequency we can filter out the most common words that have no value (the, a, that, it etc...) and filter in the ones that do (keywords) for determining categorical data and other useful things.

So initially what are some benefits of capturing and storing this data?

  • Better search. Precision, ranked results, and search hinting.
  • Better descriptive metadata. Data about data.
  • Assisting transcript cleanup. For example, document breakdown could allow each word to be spell-checked.
  • Increasing accessibility.  Document breakdown could allow for acronym identification and automatic <abbr></abbr> tag generation in rendered outputs.  As well, words can be passed through lexical analysis to determine the language of the transcript, even going as far as possibly providing a jumping point for translations.
  • Suggestions.  In content mixing, the data store can be used to provide suggestions to related content that may mix well.  In publishing the data can be used to provide suggestions for content tagging and sharing.

What are some long term benefits?

  • The data provides useful insight into what Hyperaudio is being used for.  For example what kind of content is being input, created, and shared. Collecting some additional date/time metadata per transcript upon creation can give us trends.
  • Growth of the data store provides increased precision in common term filtering.
  • Provides data enrichment for media and can be subsequently used in developing for the Semantic Web (http://www.w3.org/standards/semanticweb/)
  • The data can be used as a jumping point for extended applications for natural language processing(semantic analysis or topic modeling for example) and other machine learning aspects that might find it useful.

Like the other aspects found within the Hyperaudio ecosystem, this additional capability would conform to any requirements set by Hyperaudio's primary maintainers.


If anyone has additional thoughts, ideas, or even other possible use cases please share them :)


Mark

unread,
Jan 21, 2014, 4:38:15 PM1/21/14
to hyper...@googlegroups.com
Dustin,

Some interesting and very valuable ideas. I think the potential to analyse the text in transcripts could be powerful for a number of reasons, pretty sure you've listed the main ones. No doubt other uses will emerge, but for search alone I think it's worth doing, let alone all the other wins we could get out of it.

We're actually working with some students who are interested in creating an application that will require search so maybe I can get them to comment on their idea too.

It's great to see the community coming up with ideas, we're just essentially trying to get the basic building blocks in there and some interesting examples and applications. There's a lot you could build on top.

As far as architecture is concerned, it seems a shame to be constrained by specific technologies, but yes we currently use mongodb and node.js, which reminds me, we need to start getting some docs written to detail the back end.

The exciting news is we're going to launch a limited beta this week so if you've signed up at http://hyperaud.io/signup/ watch out for an email. You can also follow the news on https://twitter.com/hyperaud_io

Thanks Dustin, for contributing. Not sure if anyone else has ideas around this subject, but please drop them here if you have.

Better get back to it.

Cheers

Mark


--
You received this message because you are subscribed to the Google Groups "hyperaudio" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hyperaudio+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dustin Blake

unread,
Jan 21, 2014, 5:32:18 PM1/21/14
to hyper...@googlegroups.com
I'll be looking out for an email, I've been signed up for a while now ;)

I look forward to hearing more ideas from everyone, and hope to help in anyway I can.

During my storming session last night I concluded that a document store would probably suite this kind of addition quite well (I was actually going to go with CouchDB merely out of slightly more familiarity) but since I've already setup a local Mongo and Node environment to get started with this).  With an API, too, those constraints doesn't necessarily mean everyone else's constraints. 

Forrest Pruitt

unread,
Jan 22, 2014, 1:25:45 PM1/22/14
to hyper...@googlegroups.com
Dustin,

My name is Forrest Pruitt, and I'm one of the students Mark mentioned previously. My team and I are in the starting stages on a feature that could definitely use this kind of functionality. My question to you is this: what do you need to get started? I would love to help.

Cheers,
Forrest

Dustin Blake

unread,
Jan 22, 2014, 2:26:33 PM1/22/14
to hyper...@googlegroups.com
Hi Forrest, thanks for jumping in :)

At the moment I have a simple Node HTTP server I've written that accepts some post data from a ha-converter.  The post text is lightly filtered and then iterated on to fill an object.  As soon as I've got more of a complete skeleton that includes a bit more structure and putting the objects into the database I'll commit the code to Github and we can collaborate on that portion there.  

In the meantime an informal design document we can collaborate on that covers, in no specific detail, what data needs to exist in the database and other things would be a good starting point so that we're not running around guessing about it.  If you want to create it with Google docs (or another service of your preference) and take what's already been talked about here at the group and what you and your team might have in mind that would be an awesome addition to help things get on a track for something more formal.

As for ideas or thoughts about this functionality feel free to express them here :)

Forrest Pruitt

unread,
Jan 26, 2014, 10:18:04 PM1/26/14
to hyper...@googlegroups.com
Hey Dustin! 

Sorry for the delay in getting back to you. I organized the Global Game Jam this weekend for my community and it took a lot of time away from other things.
Have you started the design document yet? 

We're looking at creating a module that will allow one to search all hypertranscripts with a given tag (say, 'Obama') and grab words/phrases from various transcripts. These could then be put together into a mix. The effect can roughly be seen here (http://hyperaud.io/pad/?m=-BNrwA4pS2-Sor98QSSwlQ). The effect of 'make who you want say what you want' could be fun/attention grabbing! 

In the datastore, then, we would need the ability to search by tag, and then access the word-level transciptions. The timestamps for the individual words could be returned and fed into a pad-like environment where it could be pieced together.

If I'm not being clear or something I've said seems off-base, let me know! 
Looking forward to working on this with you.

Hope all is well!

Cheers,
Forrest
Reply all
Reply to author
Forward
0 new messages