[OT?] Best DB/architecture for n-gram corpus?

Sam Raker

unread,

Mar 6, 2015, 7:25:05 PM3/6/15

to clo...@googlegroups.com

I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking individual songs into lines, which are then split into words, so you end up with something like

{0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3 "goose" 4 "is" 5 "dead"}...}

(Yes, maps with integer keys is kind of dumb; I thought about using vectors, but this is all going into MongoDB temporarily, and I'd rather just deal with maps instead of messing with Mongo's somewhat lacking array-handling stuff.)

The idea, ultimately, is to build a front-end that would allow users to, e.g., search for all songs that contain the (sub)string "aunt rhodie", or see how many times The Rolling Stones use the word "woman" vs how many times the Beatles do, etc. The inspiration comes largely from projects like COCA[2].

I'm wondering if any of you have opinions about which database to use (Mongo is most likely just a stopgap), and how best to architect it. I'm most familiar with MySQL and Mongo, but I'd rather not be limited by just those two if there's a better option out there. I'm thinking that I'll probably end up storing tokens over types--e.g., each word would be stored individually, as opposed to having an entry for, e.g., "the" that stores every instance of the word "the." I was also thinking that I'll probably have to end up storing each token's "previous" and "next", either as full references or just as strings. This seems potentially inefficient, however.

(I could've just gone to StackOverflow with this, but figured I'm more likely to get a real answer here, because you all seem so smart and nice?)

Thanks!

[1] https://en.wikipedia.org/wiki/N-gram

[2] http://corpus.byu.edu/coca/

Ray Miller

unread,

Mar 7, 2015, 5:22:24 AM3/7/15

to clo...@googlegroups.com

On 7 March 2015 at 00:25, Sam Raker <sam....@gmail.com> wrote:
> I'm trying to create an n-gram[1] corpus out of song lyrics. I'm breaking
> individual songs into lines, which are then split into words, so you end up
> with something like
>
> {0 {0 "go" 1 "tell" 2 "aunt" 3 "rhodie"} 1 {0 "the" 1 "old" 2 "grey" 3
> "goose" 4 "is" 5 "dead"}...}

Why split into lines? In this example, "rhodie the" is just as valid a
bigram as "tell aunt". It would be more natural to split at a sentence
boundary.

> (Yes, maps with integer keys is kind of dumb; I thought about using vectors,
> but this is all going into MongoDB temporarily, and I'd rather just deal
> with maps instead of messing with Mongo's somewhat lacking array-handling
> stuff.)
>
> The idea, ultimately, is to build a front-end that would allow users to,
> e.g., search for all songs that contain the (sub)string "aunt rhodie", or
> see how many times The Rolling Stones use the word "woman" vs how many times
> the Beatles do, etc. The inspiration comes largely from projects like
> COCA[2].
>
> I'm wondering if any of you have opinions about which database to use (Mongo
> is most likely just a stopgap), and how best to architect it. I'm most
> familiar with MySQL and Mongo, but I'd rather not be limited by just those
> two if there's a better option out there.

First up, I think you'll likely want to trade space for speed, and the
simplest way to do this is to store every n-gram you're interested in.
This means deciding up-front the maximum size of the n-gram you're
interested in. You could fairly easily model this in any relational
database as:

tracks
--------
id (serial, primary key)
name (text)

n_grams
------------
id (serial, primary key)
n (integer)
n_gram (text)

track_n_gram
-------------------
track (references tracks.id)
n_gram (references n_grams.id)
num_occurrences (integer)

With that schema, you should be able to answer all of the queries you
posed with some simple SQL.

Of course, this being the Clojure mailing list you should also
consider Datomic, and I think the above could be mapped to a Datomic
schema without too much effort.

Ray.

Matching Socks

unread,

Mar 7, 2015, 6:22:20 AM3/7/15

to clo...@googlegroups.com

A lot of guys would use Lucene. Lucene calls n-grams of words "shingles". [1]

As for "architecture", here is a suggestion to use Lucene to find keys to records in your "real" database. [2]

[1] https://lucidworks.com/blog/whats-a-shingle-in-lucene-parlance/

[2] https://groups.google.com/d/msg/datomic/8yrCYxcQq34/GIomGaarX5QJ

John Wiseman

unread,

Mar 9, 2015, 6:12:43 PM3/9/15

to clo...@googlegroups.com

One thing you can do is index 1, 2, 3...n-grams and use a simple & fast key-value store (like leveldb etc.) e.g., you could have entries like

"aunt rhodie" -> song-9, song-44
"woman" -> song-12, song-65, song-96

That's basically how I made the Metafilter N-gram Viewer, a clone of Google Books Ngram Viewer.

Another possibility is using Lucene. Just be aware that Lucene calls n-grams of characters ("au", "un", "nt") n-grams but it calls n-grams of words ("that the", "the old", "old gray") shingles. So you would end up using (I think, I haven't done this) the ShingleFilter.

You might also find this article by Russ Cox interesting, where he describes building and using an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html

John

Three things that you might find interesting:

Russ Cox' explanation of doing indexing and retrieval with an inverted trigram index: http://swtch.com/~rsc/regexp/regexp4.html

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Raker

unread,

Mar 9, 2015, 10:09:39 PM3/9/15

to clo...@googlegroups.com

That's interesting. I've been really reluctant to "hard code" n-grams, but it's probably the best way to go.

John Wiseman

unread,

Mar 10, 2015, 1:29:44 PM3/10/15

to clo...@googlegroups.com

By "hard coding" n-grams, do you mean using the simple string representation, e.g. "aunt rhodie" as the key in your database? If so, then maybe it helps to think of it from the perspective that it's not really just text, it's a string that encodes an n-gram just like "[\"aunt\", \"rhodie\"]" is another way to encode an n-gram--the encoding/decoding uses clojure.string/join and clojure.string/split instead of json/write and json/read, and escaping tokens that contain spaces is on your TODO list at a low priority :)

(And I think the Google n-gram corpus uses the same format.)

John

Sam Raker

unread,

Mar 10, 2015, 1:58:35 PM3/10/15

to clo...@googlegroups.com

I more meant deciding on a maximum size and storing them qua ngrams--it seems limiting. On the other hand, after a certain size, they stop being ngrams and start being something else--"texts," possibly.

John Wiseman

unread,

Mar 10, 2015, 3:03:14 PM3/10/15

to clo...@googlegroups.com

OK, I see. Well, on non-trivially sized corpora, I think storage requirements can become an issue, and in a situation where you're handling user queries one might wonder how often someone will query a 10-gram. But if you can make it work, go nuts!

For a lot of statistical language modeling there seems to be a sweet spot at the 3-gram point. I feel like I even saw a paper recently that compared different human languages and concluded something about the importance of trigrams, but I can't find it now.

Ray Miller

unread,

Mar 10, 2015, 3:47:37 PM3/10/15

to clo...@googlegroups.com

On 10 March 2015 at 17:58, Sam Raker <sam....@gmail.com> wrote:
> I more meant deciding on a maximum size and storing them qua ngrams--it
> seems limiting. On the other hand, after a certain size, they stop being
> ngrams and start being something else--"texts," possibly.

Exactly. When I first read your post, I almost suggested you model
this in a graph database like Neo4j or Titan. Each word would be a
node in the graph with an edge linking it to the next word in the
sentence. You could define an index on the words (so retrieving all
nodes for a given word would be fast), then follow edges to find and
count particular n-grams. This is more complicated than the relational
model I proposed, and will be a bit slower to query. But if you don't
want to put an upper-bound on the length of the n-gram when you index
the data, it might be the way to go.

Ray.

Sam Raker

unread,

Mar 10, 2015, 4:03:53 PM3/10/15

to clo...@googlegroups.com

That's honestly closer to what I was originally envisioning--I've never really looked into graph dbs before, but I'll check out Neo4j tonight. Do you know whether you can model multiple edges between the same nodes? I'd love to be able to have POS-based wildcarding as a feature, so you could search for e.g. "the ADJ goose", but that's a whole other layer of stuff, so it might go in the "eventually, maybe" pile.

Ray Miller

unread,

Mar 10, 2015, 4:27:17 PM3/10/15

to clo...@googlegroups.com

On 10 March 2015 at 20:03, Sam Raker <sam....@gmail.com> wrote:
> That's honestly closer to what I was originally envisioning--I've never
> really looked into graph dbs before, but I'll check out Neo4j tonight. Do
> you know whether you can model multiple edges between the same nodes?

Yes, certainly possible.

If you go for Neo4j you have two options for Clojure: embedded (with
the borneo library and reading Javadoc, or plain Java interop) or
stand-alone server with REST API (with the well-documented Neocons
library from Clojurewerkz). You'll also have to think about how to
model which text (song) each phrase came from - likely another node
type in the graph with a linking edge to the start of the phrase.

Great book on Neo4j available for free download, also covers data
modelling: http://neo4j.com/books/

> I'd love to be able to have POS-based wildcarding as a feature, so you could
> search for e.g. "the ADJ goose", but that's a whole other layer of stuff, so
> it might go in the "eventually, maybe" pile.

Sounds like fun, but means doing some natural language processing on
the input texts, which is a much more difficult problem than simply
tokenizing.

Ray.

Sam Raker

unread,

Mar 10, 2015, 4:52:53 PM3/10/15

to clo...@googlegroups.com

POS tagging is a solved-enough problem, at least in most domains. Clojure still doesn't have its own NLTK (HINT HINT, GSOC kids!), but I'm sure I can find a Java lib or 2 that should do the job well enough.

Pablo Nussembaum

unread,

Mar 11, 2015, 2:34:17 PM3/11/15

to clo...@googlegroups.com

Have looked at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ that comes with Postgres 9.4 and it's really really powerful and fast.

Reply all

Reply to author

Forward