Best way in Julia to build a set of unique values?

1,163 views
Skip to first unread message

Scott Jones

unread,
May 18, 2015, 4:46:28 PM5/18/15
to julia...@googlegroups.com
I would like to be able to do the following in Julia:
Take a UInt64 (or UInt128, for that matter), and add it to an array, if it is not already present, returning the index.
(This would be trivial in the language I used to work on, and I think it probably is in Julia as well, but I haven't found the right data structure yet...)
What would be the best performing way of handling that?
What if, instead of an UInt64 or UInt128, I had an array of bytes (like 128 or 256)?  What would be the best way for that?

Thanks, Scott

Jameson Nash

unread,
May 18, 2015, 4:53:02 PM5/18/15
to julia...@googlegroups.com

Scott Jones

unread,
May 18, 2015, 5:24:25 PM5/18/15
to julia...@googlegroups.com
I suppose I could use a set simply to determine if it was present or not, and then push! to another array if not present... just didn't seem as efficient as what I'm used to...

Yichao Yu

unread,
May 18, 2015, 5:31:18 PM5/18/15
to Julia Users
On Mon, May 18, 2015 at 5:24 PM, Scott Jones <scott.pa...@gmail.com> wrote:
> I suppose I could use a set simply to determine if it was present or not,
> and then push! to another array if not present... just didn't seem as
> efficient as what I'm used to...

Why does it have to be an array? And what data structure do you use in
other languages?

Tim Holy

unread,
May 18, 2015, 5:34:39 PM5/18/15
to julia...@googlegroups.com
If you really want to get the index back, perhaps better might be a dict:

counter = 0
d = Dict{KeyType,Int}()
for item in list
idx = get(d, item, counter+1)
if idx > counter
counter += 1
end
# Do whatever you plan to do with idx
end

--Tim

Scott Jones

unread,
May 18, 2015, 5:37:06 PM5/18/15
to julia...@googlegroups.com
How well do Dict’s perform currently in Julia?  I hope pretty well!  Thanks, that’s more what I needed…

-Scott

Scott Jones

unread,
May 18, 2015, 5:47:07 PM5/18/15
to julia...@googlegroups.com


On Monday, May 18, 2015 at 5:31:18 PM UTC-4, Yichao Yu wrote:
On Mon, May 18, 2015 at 5:24 PM, Scott Jones <scott.pa...@gmail.com> wrote:
> I suppose I could use a set simply to determine if it was present or not,
> and then push! to another array if not present... just didn't seem as
> efficient as what I'm used to...

Why does it have to be an array? And what data structure do you use in
other languages?

In COS (CachéObjectScript... i.e. 21st century Mumps), I would have written something like:

If '$data(arr(key),index) { // If the array already has the key, it is set into the variable index, otherwise:
    Set arr(key) = $Increment(index) // atomically increment index, and set the key to that value in the array
}
// Do whatever I want with the index (usually I also set up a reverse mapping, i.e.:
Set index(index) = key

The nice thing in COS, all I have to do to make that persistent, and shared across a system (or also distributed to other systems), is add a "^" before the arrays or values...
`^arr(key)` and `$increment(^index)` and `^index(^index)`.
Very very fast too...

Scott Jones

unread,
May 18, 2015, 5:53:24 PM5/18/15
to julia...@googlegroups.com
Oops... that should have been:
If '$data(arr(key),index) { // If the array already has the key, it is set into the variable index, otherwise:
    Set index = $Increment(arr) // atomically increment index, and set the key to that value in the array
    Set arr(key) = index
    Set index(index) = key  // Set up reverse index

Tim Holy

unread,
May 18, 2015, 6:05:55 PM5/18/15
to julia...@googlegroups.com
As long as the dict is concretely typed, it should be pretty good. (I can't
comment about string keys, but you probably could! I think Symbols are more
performant than strings for indexing.)

Also, small correction: I should have used get! instead of get. You want to
set that new value in the Dict.

--Tim

Steven G. Johnson

unread,
May 18, 2015, 6:07:54 PM5/18/15
to julia...@googlegroups.com
Scott, this looks pretty much exactly like what Tim's example does: you have a dictionary (aka associative array, aka mapping, depending on your terminology) mapping keys to a counter.

Dicts are reasonably fast in Julia, although they could certainly be further optimized (like almost anything).

Jacob Quinn

unread,
May 18, 2015, 6:26:06 PM5/18/15
to julia...@googlegroups.com
You could also take a look at JudyDicts.jl, which wrap the corresponding C library. Supposedly, it's one of the most highly optimized Dict implementations anywhere. I think the Julia package may need an update, however.


-Jacob

Steven G. Johnson

unread,
May 18, 2015, 7:16:20 PM5/18/15
to julia...@googlegroups.com, quinn....@gmail.com


On Monday, May 18, 2015 at 6:26:06 PM UTC-4, Jacob Quinn wrote:
You could also take a look at JudyDicts.jl, which wrap the corresponding C library. Supposedly, it's one of the most highly optimized Dict implementations anywhere. I think the Julia package may need an update, however.


The benchmark numbers on that page make the Julia Dict look pretty darn good.

Scott Jones

unread,
May 18, 2015, 9:38:40 PM5/18/15
to julia...@googlegroups.com
Thanks everybody, Dicts are working fine now, for my tests with UInt64 and UInt128s as keys... (I haven't benchmarked it against my old stuff, I'll have to do that at some point ;-) )
What I'm doing right now isn't really performance critical (I, of course, am obsessively interested in any performance topic!), I am testing building different data structures for handling
the tasks that utf8proc is used for right now... that won't take 1.5MB, and will be much faster... (in the database world, a lot of what I'm doing is changing it from a row store to a column store...).
I don't know how much time in compiling Julia is spent on seeing if a character is a start identifier character or a plain identifier character... but this should speed those checks up quite a bit.
BTW: Jacob & Steven, you are both very evil, pointing me at another need performance thing to get obsessed about!  Now I'll start thinking about Judy as well as Julia... (Judy is LGPL, which is OK
for making a wrapper, but I don't think that would allow me to rewrite it in Julia, would it?)

Isaiah Norton

unread,
May 19, 2015, 8:14:04 AM5/19/15
to julia...@googlegroups.com
Judy is LGPL, which is OK for making a wrapper, but I don't think that would allow me to rewrite it in Julia, would it?

Translations are fine but must be released under the same license. Clean-room algorithm implementations are not subject to that restriction.

Scott Jones

unread,
May 19, 2015, 8:25:03 AM5/19/15
to julia...@googlegroups.com
So, in other words, I'd better not look at the source code to Judy, if I wanted to write a Julia version that could be under the MIT license... (I haven't yet, and maybe I won't now, although I'm deadly curious as to what they are doing!)

David Gold

unread,
May 19, 2015, 9:30:48 AM5/19/15
to julia...@googlegroups.com
Perhaps I'm being naive, but is there anything wrong with findfirst? http://docs.julialang.org/en/release-0.3/stdlib/arrays/#Base.findfirst

function findoradd(A::Array{Int64, 1}, value::Int64)
   
if findfirst(A, value) > 0
       
return findfirst(A, value)
   
else
        push
!(A, value)
       
return length(A)
   
end
end


Though I don't know how well the above would work for arrays of bytes, or how it performs compared to a dictionary.

Scott Jones

unread,
May 19, 2015, 9:51:32 AM5/19/15
to julia...@googlegroups.com
Well, doing a linear search, when inserting n items, isn't that going to be O(n^2)?
I would hope that using a dictionary would give me something more like O(n log n)...

Steven G. Johnson

unread,
May 19, 2015, 10:35:05 AM5/19/15
to julia...@googlegroups.com


On Monday, May 18, 2015 at 9:38:40 PM UTC-4, Scott Jones wrote:
I don't know how much time in compiling Julia is spent on seeing if a character is a start identifier character or a plain identifier character... but this should speed those checks up quite a bit.

I strongly suspect that this is a negliglible fraction of the parsing time, which itself is a small fraction of the compilation time IIRC.

Have you heard the phrase "premature optimization is the root of all evil?"

Scott Jones

unread,
May 19, 2015, 11:40:26 AM5/19/15
to julia...@googlegroups.com
Yes - but this is just one of the optimizations, and is also related to making Julia no longer dependent on utf8proc.c and it's 1.5MB of data... (much better for the Rasperry Pi, wouldn't you think?)

I haven't seen any profiles yet of Julia compilation, I'd be interested in seeing where most of the time is spent... I figure you are probably correct, but the more important thing here is more reducing
all of the bloat...

And yes, I know that quote quite well... but I've been optimizing things for 40 years now... I *think* I have some idea what might be useful to optimize or not (whether for space or speed)...

-Scott
Reply all
Reply to author
Forward
0 new messages