How can I sort Dict efficiently?

541 views
Skip to first unread message

Michiaki Ariga

unread,
Dec 5, 2014, 9:57:28 AM12/5/14
to julia...@googlegroups.com
Hi,
I have a question about Dict().

I found there are no method such as sort_by() after v0.3.
But I want to count word frequency with Dict() and sort by its value to find frequent word.
So, how can I sort Dict efficiently?

Of course, I know sort using DataFrame like following,

```
counts = Dict{String, Int64}("apple" => 100, "town" => 250, "space" => 24)
df = DataFrame(word = collect(keys(counts)), count = collect(values(counts)))
sort(df, cols = [:count], rev = true)
```

I think it is natural `convert(DataFrame, dict)` returns Nx2 DataFrame instead of 1xN one.

Thanks,

---
Michiaki Ariga

Steven G. Johnson

unread,
Dec 5, 2014, 9:23:58 PM12/5/14
to julia...@googlegroups.com


On Friday, December 5, 2014 9:57:28 AM UTC-5, Michiaki Ariga wrote:
I found there are no method such as sort_by() after v0.3.
But I want to count word frequency with Dict() and sort by its value to find frequent word.
So, how can I sort Dict efficiently?

 You may want to use a different data structure.  For example, you can store word frequencies in a PriorityQueue and then pull out the most frequent word with peek or dequeue.  See:

http://julia.readthedocs.org/en/latest/stdlib/collections/

(A PriorityQueue lets you quickly fetch the smallest value, whereas you want the largest frequency, but you can work around this by just storing frequency * -1.)

If you need all of the values in order, you can instead use an OrderedDict from https://github.com/JuliaLang/DataStructures.jl

Michiaki ARIGA

unread,
Dec 7, 2014, 8:45:12 AM12/7/14
to julia...@googlegroups.com
I'm sorry that my example is not good to explain what I want to do.

I tried to count up words and get top N frequent words and I referred to following example.


I don't think it should return Dict for top N words, so I think DataFrame is good to get top N words using head(). But I wonder if DataFrame isn't suitable because DataFrame converted from Dict is not sortable style using its values of Dict.

Jeff Waller

unread,
Dec 8, 2014, 3:50:16 AM12/8/14
to julia...@googlegroups.com
This can be done in O(N).  Avoid sorting as it will be O(NlogN)

Stefan Karpinski

unread,
Dec 8, 2014, 7:21:33 AM12/8/14
to Julia Users
We have a select function as part of Base, which can do O(n) selection of the top n:

julia> v = randn(10^7);

julia> let w = copy(v); @time sort!(w)[1:1000]; end;
elapsed time: 0.882989281 seconds (8168 bytes allocated)

julia> let w = copy(v); @time select!(w,1:1000); end;
elapsed time: 0.054981192 seconds (8192 bytes allocated)

So for large arrays, this is substantially faster.

Michiaki ARIGA

unread,
Dec 14, 2014, 9:13:18 PM12/14/14
to Julia Users

Of course Julia can work fast with Array, I know.
But in natural language processing or text analyzing, we often count word frequency and create dictionary. We usually store word frequency in kind-a Dict and we always cut off non-frequent words (its frequency are under threshold) to exclude noisy words. So I want remove keys which values follow some condition.

Finally, I found John Myles White's implementation creating n-gram. So, I will refer this.

https://github.com/johnmyleswhite/TextAnalysis.jl/blob/master/src/ngramizer.jl

Todd Leo

unread,
Dec 14, 2014, 10:15:46 PM12/14/14
to julia...@googlegroups.com
Is there a partial sort equivalent to sortperm! ? Supposingly selectperm! ?

Stefan Karpinski

unread,
Dec 15, 2014, 2:38:11 PM12/15/14
to Julia Users
There is not, but if I recall, there may be an open issue about this functionality.

Stefan Karpinski

unread,
Dec 15, 2014, 2:40:04 PM12/15/14
to Julia Users
I'm not sure how this technique applies any less to strings than to numbers. Can you clarify what your concern is?

Pontus Stenetorp

unread,
Dec 15, 2014, 10:55:48 PM12/15/14
to julia...@googlegroups.com
On 16 December 2014 at 04:39, Stefan Karpinski <ste...@karpinski.org> wrote:
>
> I'm not sure how this technique applies any less to strings than to numbers.
> Can you clarify what your concern is?

I am suspecting that what he wants to do is to limit the size of a
vocabulary (set of tokens) by imposing an occurrence threshold. This
is fairly standard and what you really want would be a sorted
dictionary. However, since this is usually just a pre-processing step
I don't even bother optimising and just do something like:

UNKFORM = "<UNK>"
unkcutoff = 1
tokoccs = Dict{UTF8String,Int}()
for tok in data
tokoccs[tok] = get(tokoccs, tok, 0) + 1
end
vocab = Set{UTF8String}()
unks = Set{UTF8String}()
push!(vocab, UNKFORM)
for (tok, occs) in tokoccs
if occs > unkcutoff
push!(vocab, tok)
else
push!(unks, tok)
end
end

This is O(n) + O(n) which should be fine unless you want to focus on
optimising this specific region of code.

Pontus

Todd Leo

unread,
Dec 16, 2014, 1:30:56 AM12/16/14
to julia...@googlegroups.com
Could you provide any clue to guide me locate the issue? I'm willing to make a PR but I am unable to find the related issue.

Michiaki ARIGA

unread,
Dec 16, 2014, 3:30:58 AM12/16/14
to julia...@googlegroups.com
Thanks for Pontus's kind explanation. He answered what I want to know.
I want to know the standard way to create dictionary (which is a set of words for ASR or NLP).

To create dictionary for speech recognition or something NLP, we often control size of vocabulary. There are two ways to limit size of vocabulary, one is to cut under threshold frequency that Pontus showed, and the other is to pick up top N frequent words (ngram tool kit such as IRSTLM supports this situation and it is popular way to control necessary memory size). If I want to pick frequent words, I think I'll use DataFrame.

John Myles White

unread,
Dec 16, 2014, 8:53:34 AM12/16/14
to julia...@googlegroups.com
If you want to retain N words, perhaps a priority queue would be useful?


I'd be cautious about drawing many coding lessons from the TextAnalysis package, which has been never been optimized for performance.

 -- John

Michiaki ARIGA

unread,
Dec 16, 2014, 11:49:15 PM12/16/14
to julia...@googlegroups.com
Thanks for your answer, John.
I'll use PriorityQueue to get top N words, and if I want to cut off using threshold I'll create a dictionary with Dict.

---
Michiaki
Reply all
Reply to author
Forward
0 new messages