More mashup code: extracting a word count and sorting

23 views

Skip to first unread message

Michael Harrison (goodmike)

unread,

May 8, 2009, 4:38:55 PM5/8/09

to Clojure Study Group Washington DC

Here's some more mashup code. It's a long email, I realize, but I
wanted to send it out to spur some discussion. I hope you're hungry
for some Clojure.

Building on the simple function for building a data map of RSS item
contents from an RSS feed, I put together some functions for deriving
a word count from some of the contents fields.

First, I imagined a plain-English description of the process I wanted
to code.

1. Combine several RSS builders' outputs into a 'combined-rss'
variable
2. Extract a data map called 'word-counts' of (word => count) pairs
from the title and description values in combined-rss, and weigh
appearances of a word in the title at twice the weight of appearances
in description.
3. Make a sorted-word-count by sorting the 'word-counts' data map
alphabetically by key

Part 1 is easy. Serge and Luke have pointed out that concat will do
all the work for us. It's even lazy:

user> (def cnn-rss (rss-reader "http://rss.cnn.com/rss/
cnn_topstories.rss"))
user> (def abc-rss (rss-reader "http://feeds2.feedburner.com/
AbcNews_TopStories"))
user> (def combined-rss (concat cnn-rss abc-rss))

Let's see what we've got:
user> (:title (first combined-rss))
"White House plans to release N.Y. flyover photo"

OK. And the description?
user> (:description (first combined-rss))
"The White House indicated today that a report and a photo from the
controversial low-altitude New York flyover by a 747 plane could be
released soon. Earlier, White House officials had said that there were
no plans to release the photos to the public. Military officials
estimated that the mission and photo shoot, aimed at updating file
photos of Air Force One, cost about $328,835 in taxpayer money.<div
class=\"feedflare\">\n<a href=\"http://rss.cnn.com/~ff/rss/
cnn_topstories?a=olgK3TdEkJI:yROJEaccfUE:yIl2AUoC8zA\"><img src=
[snip]

Uh oh. Those HTML tags are going to mess up our word counting unless
we remove them.

=== Step 1.b. Remove HTML tags from the values of combined-rss ===

I'm too lazy to write bulletproof code for stripping out all the HTML
out of a string. Here's a quick and dirty way to do it:
(.replaceAll my-string "</?[^>]+(>|$)" "")
Using Java's String#replaceAll method, we can remove all the html tags
from title and description.

It would also be great to get rid of annoying newlines too:
(.replaceAll my-string "\n" "")

Or, to do both at once in a Clojure-ish way, we could use the ->
("thread") macro:
(-> desc (.replaceAll "</?[^>]+(>|$)" "") (.replaceAll "\n" ""))

Let's do it:

user> (defn remove-html-from-values [map]
(reduce (fn [newmap key] (let [val (key map)]
(assoc newmap key
(if (nil? val) "" (remove-html (key map))))))
{} (keys map)))

The (if (nil? val) ... ) business traps conditions in which values are
nil. Some RSS feeds will produce nil values when parsed.

user> (def cleaned-rss (map remove-html-from-values combined-rss))
#'user/cleaned-rss
user> (:description (first cleaned-rss))
"The White House indicated today that a report and a photo from the
controversial low-altitude New York flyover by a 747 plane could be
released soon. Earlier, White House officials had said that there were
no plans to release the photos to the public. Military officials
estimated that the mission and photo shoot, aimed at updating file
photos of Air Force One, cost about $328,835 in taxpayer money."

=== Step 2. Extract a data map called 'word-counts' of (word => count)
pairs from the title and description values in cleaned-rss, and weigh
appearances of a word in the title at twice the weight of appearances
in description. ===

This is a complicated process. I broke down the pieces of the work
into little bites and wrote functions for them. These functions are
non-optimized. I just wanted something to work with.

a. Split a string into sequence of "words"
user> (defn word-split [str]
(re-seq #"\w{2,}" str))
user> (word-split "White House plans to release N.Y. flyover photo")
("White" "House" "plans" "to" "release" "flyover" "photo")

b. Count occurrences of each word in a sequence of words, applying a
multiplier to each occurrence.

user > (defn add-to-word-count [count-map words-seq multiplier]
(reduce (fn [acc it]
(let [count-symbol (symbol (.toLowerCase it))
word-count (count-symbol acc)]
(assoc acc count-symbol (if word-count (+ multiplier word-
count) multiplier))))
count-map words-seq))

The first item in the let expression, count-symbol, is assigned by
making a lowercase copy of the word being examined, and turning it
into a symbol: you can't use a string as a key in a data map. The
second item, word-count is the value already in the count-map data map
for this key (it may be nil). The assoc function gives us a new copy
of the count-map data map with an increased value for the key we've
made out of the word (or the value of the multiplier is word-count was
nil). Reduce lets us perform this operation for every word in word-
seq.

c. Take a datamap of keys and weights and use it and the counting
functions we've just defined to make a word-counts data map.

Now things got hard. I knew I could easily express the keys and
weights with a data map like {:title 1 :description 2}. I thought of
this as a "template" for the process of splitting and counting. But
how would I use this template in conjunction with the reduction in add-
to-word-count?

I made the problem easier with a function that would take my template
and return a function that when applied to an RSS contents data map
would itself deliver the arguments I needed to apply to add-to-word-
count:

user> (defn make-extractor [template]
(fn [item] (map (fn [k] (list (word-split (k item)) (k
template))) (keys template))))

This is a dense function. What it does is set up a function that will
run through the keys in template, e.g. :title and :description,
applying each (as the k variable in the function map uses) to an item
(e.g. an RSS contents data map), to get the string value for that key,
applying word-split to that string, and then combining the resulting
word-seq with the value for that key in the template, i.e. a weight.

Here's how it's used:
user> (def extract (make-extractor {:title 2, :description 1}))
user> (def extracted-item (extract (first cleaned-rss)))
user> extracted-item
((("White" "House" "plans" "to" "release" "flyover" "photo") 2)
(("The" "White" "House" "indicated" "today" "that" "report" "and"
"photo" "from" "the" "controversial" "low" "altitude" "New" "York"
"flyover" "by" "747" "plane" ...) 1))

These are word-seqs and weights, ready to go into the add-to-word-
count-function as arguments 2 and 3.

If we wanted to apply add-to-word-count to a single extracted item,
like the one we just derived, I would use reduce like this:

user> (reduce (fn [word-map text-and-weight]
(add-to-word-count word-map (first text-and-weight) (second text-and-
weight)))
{} extracted-item)
{indicated 1, house 4, new 1, cost 1, air 1, altitude 1, york 1, the
5, file 1 ... }

The reduce function operates over the two lists in extracted-item, the
one for the title and the one for the description. It then applies add-
to-word-count to the accumulator, word-map, which starts out as an
empty data map, and to the first and second parts of each list.

Now, how to combine it into a top-level function that takes the rss
contents seq and a template for arguments and gobbles up the whole
seq?

user> (defn count-words-in-map-values [datamap-seq keys-and-weights]
(let [extractor (make-extractor keys-and-weights)
count-item
(fn [word-map item]
(reduce (fn [word-map text-and-weight]
(add-to-word-count word-map (first text-and-weight) (second text-
and-weight)))
word-map item))]
(reduce count-item {} (map extractor datamap-seq))))

First, an extractor is set up using make-extractor and the template
(the descriptively-named parameter keys-and-weights). An inner
function, count-item, is defined. It wraps the reduction I just wrote
above with a way to assign a word-map and an extracted item on which
to perform the reduction. This inner function count-item is then used
in another reduce expression on the results of mapping the extractor
to the entire seq of rss contents.

user> (def word-counts (count-words-in-map-values cleaned-rss {:title
2 :description 1}))
user> word-counts
{dark 1, than 1, maine 5, little 1, demand 2, all 2, boy 6, culp 1,
dominate 1, degree 1 ... }

That's a lot of code for an email, but if you have room for a little
bit of dessert (just one mint--it's wafer thin), let's sort the
results.

=== 3. Make a sorted-word-count by sorting the 'word-counts' data map
alphabetically by key ===

The Clojure function sorted-map takes a number of (key value)
arguments and returns a data map with the keys sorted alphabetically.
This seems like just the thing. However, it's a little complicated
using this with an existing data map. The arguments to sorted-map
can't be nested. It has to look like this:
(sorted-map dark 1 than 1 maine 5 little 1 demand 2 all 2 boy 6 culp 1
dominate 1 degree 1)

We can use apply with sorted-map and a list of keys and values, and
apply will construct the call. But how to get a flat list of the
contents of word-counts?

Well, can we reduce on word-counts? Yes, but then we discover that
enumerations over data maps return vectors. Seriously.

(reduce conj word-counts)
[dark 1 [than 1] [maine 5] [little 1] [demand 2] [all 2] [boy 6] [culp
1] [dominate 1] ...]

And there's weird nesting. Is there some way to unnest this into a
correct flat list?

Well, I wrote my own, but then I googled and discovered that Rich had
cranked out an all-purpose flatten function for seqs:

(defn flatten [x] (let [s? #(instance? clojure.lang.Sequential %)]
(filter (complement s?) (tree-seq s? seq x))))

OK, that is not wafer-thin. This is a big dessert. But if you just
accept that flatten does what you'd expect, then you'll be ready to
use it without having to digest it all right now. :-)

(flatten (reduce conj word-counts))
(dark 1 than 1 maine 5 little 1 demand 2 all 2 boy 6 culp 1 dominate
1 ...)

So... then?
user> (defn map->sorted-map [unsorted-map]
(apply sorted-map (flatten (reduce conj unsorted-map))))
#'user/map->sorted-map
user> (def sorted-word-counts (map->sorted-map word-counts))
user> sorted-word-counts
{000 1, 10 1, 110 1, 20 2, 2005 1, 2009 2, 275 1, 328 1, 46 1, 747 1,
835 1, abduction 2, about 6, abstinence 2, adam 2, ... [snip]

OK, so the numbers are silly: my definition of what a word is needs
tweaking. But I've got a sorted map of word counts.

Next up, supplying this information to a client-side tag cloud,
possibly via JSON. Anyone got a JSON outputter?

Michael

Reply all

Reply to author

Forward

0 new messages