Ordering attributes

argherna

unread,

Jan 15, 2009, 10:47:36 PM1/15/09

to Cernunnos Discussion

I've recently begun a new round of scripts for our deployment
environment and I decided that when I looked at the script's output,
it would be nice to see the attributes in some kind of order. I have
opened issue 71 (http://code.google.com/p/cernunnos/issues/detail?
id=71) and attached a patch to it. The patch does not break backwards
compatibility (it's small). However, I haven't really tested the
performance of sorting the attributes. Drew, could you look it over
and see what you think? Are there any drawbacks to this patch?

Eric Dalquist

unread,

Jan 15, 2009, 11:00:34 PM1/15/09

to cernunnos-...@googlegroups.com

Since it is only done if INFO logging is enabled the cost is probably
minimal. I think another viable approach would be to just replace the
instances of HashMap with TreeMap in RuntimeRequestResponse.

HashMap is O(1) cost for puts and gets, TreeMap is O(log n) for those
operations, with the fairly small (only a few hundred entries) size of
the attributes map in CRN that shouldn't change performance much at all
and would amortize the minor cost increase over each map.put operation.

-Eric

> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "Cernunnos Discussion" group.
> To post to this group, send email to cernunnos-...@googlegroups.com
> To unsubscribe from this group, send email to cernunnos-discus...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/cernunnos-discussion?hl=en
> -~----------~----~----~----~------~----~------~--~---
>
>

argherna

unread,

Jan 15, 2009, 11:25:12 PM1/15/09

to Cernunnos Discussion

On Jan 15, 10:00 pm, Eric Dalquist <eric.dalqu...@doit.wisc.edu>
wrote:

> Since it is only done if INFO logging is enabled the cost is probably
> minimal. I think another viable approach would be to just replace the
> instances of HashMap with TreeMap in RuntimeRequestResponse.
>

I thought about that, but I was worried about affecting the rest of
the execution of the script. My change is for display purposes only
for the script "preamble". The SortedSet is constructed from the
keySet of the mergedAttributes HashMap if the log statement executes.
Other than that small amount of sorting to save me a visual headache,
the remaining execution should not have to change anything about how
it works today.

> HashMap is O(1) cost for puts and gets, TreeMap is O(log n) for those
> operations, with the fairly small (only a few hundred entries) size of
> the attributes map in CRN that shouldn't change performance much at all
> and would amortize the minor cost increase over each map.put operation.
>

Yeah, that's true. But I am just trying to make the preamble easier
to read. I think changing mergedAttributes to a TreeMap isn't
necessary. That said, I agree it wouldn't impact execution noticeably
for the small amount of parameters it would contain.

> -Eric
>
> argherna wrote:
> > I've recently begun a new round of scripts for our deployment
> > environment and I decided that when I looked at the script's output,
> > it would be nice to see the attributes in some kind of order. I have
> > opened issue 71 (http://code.google.com/p/cernunnos/issues/detail?
> > id=71) and attached a patch to it. The patch does not break backwards
> > compatibility (it's small). However, I haven't really tested the
> > performance of sorting the attributes. Drew, could you look it over
> > and see what you think? Are there any drawbacks to this patch?
> > >
>
>

> smime.p7s
> 4KViewDownload

Andrew Wills

unread,

Jan 16, 2009, 4:55:13 PM1/16/09

to cernunnos-...@googlegroups.com

Andy,

I think this is a helpful enhancement, and that you should go ahead and commit.

I agree with you that, since (at this point) we're only interested in
sorting them for an instant within the process, we can avoid the
additional overhead of a TreeMap by sorting them only if/when we need
it.

But I also agree with Eric that this overhead is pretty insignificant.
Should we need the attributes sorted for some other purpose down the
road, maybe we should bring the TreeMap idea back to the table.

Cheers,

drew wills

Eric Dalquist

unread,

Jan 16, 2009, 5:51:42 PM1/16/09

to cernunnos-...@googlegroups.com

Thinking about this a bit more doing the sort once per log call is much more expensive than just having these maps be TreeMaps. If you switch the top level maps to TreeMap it costs O(log n) to insert or access each entry instead of O(1) which isn't a big deal since N will likely be very small for everything CRN does. If you do the sort at each log call it will cost you O(n log n) every time you log the data, which will add up over the run of a script.

The only other thing to be careful with either approach is if the keys do not all implement Comparable an exception will be thrown.

-Eric

Andrew Wills wrote:

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups "Cernunnos Discussion" group.
To post to this group, send email to cernunnos-...@googlegroups.com

Eric Dalquist

unread,

Jan 16, 2009, 7:56:32 PM1/16/09

to cernunnos-...@googlegroups.com

Ah but what about the uPortal import/export where subscripts are used
everywhere and called thousands of times.

On that note I have a few areas I'd like to see the logging switched
from INFO to DEBUG as the messages are very verbose or are not INFO
level info for end users.

-Eric

wills...@gmail.com wrote:
> > Thinking about this a bit more doing the sort once per log call is much
> > more expensive than just having these maps be TreeMaps.
>

> But they're only logged once, in the ScriptRunner, which happens one
> time per invocation of a CRN XML document.
>
> It might be a good idea to move the sorting code itself into a private
> method w/in ScriptRunner, to prevent it from being used unwittingly
> elsewhere.

>
> >
> > The only other thing to be careful with either approach is if the keys
> > do not all implement Comparable an exception will be thrown.
> >
>

> Ah sure. In our case, the keys are all Strings per the API.
>
> drew wills

Andy Gherna

unread,

Jan 16, 2009, 9:05:29 PM1/16/09

to cernunnos-...@googlegroups.com

On Fri, Jan 16, 2009 at 6:56 PM, Eric Dalquist <eric.d...@doit.wisc.edu> wrote:

Ah but what about the uPortal import/export where subscripts are used everywhere and called thousands of times.

By switching mergedAttributes from a HashMap implementation to a TreeMap implementation (or LinkedHashMap or any other implementation where the sort order of the keys is guaranteed) the keys are (re-)sorted every time you add an attribute to it. It wouldn't be noticeable for little jobs but for an export or import of thousands of records, it could be significant.

On that note I have a few areas I'd like to see the logging switched from INFO to DEBUG as the messages are very verbose or are not INFO level info for end users.

I agree. For example, if the preamble was logged at DEBUG rather than INFO, then the patch I submitted is fine. Furthermore, large jobs with lots of script invocations execute without any noticeable overhead (maybe they would even run faster since the log statements may not even be ran at all).

-Eric

wills...@gmail.com wrote:

> Thinking about this a bit more doing the sort once per log call is much
> more expensive than just having these maps be TreeMaps.

But they're only logged once, in the ScriptRunner, which happens one time per invocation of a CRN XML document.

Right.

It might be a good idea to move the sorting code itself into a private method w/in ScriptRunner, to prevent it from being used unwittingly elsewhere.

That's why my patch scoped the method as package private. It felt like the "right" place to put it but it certainly does not feel "right" for that method to be public. I'm not intimately familiar with the runtime package and where else something like that would be useful so busting it down to a private method in ScriptRunner is reasonable (even in-lining it in the logging statement before the iterating through the keys is fine too).

argherna

unread,

Jan 17, 2009, 9:59:39 AM1/17/09

to Cernunnos Discussion

Committed.

On Jan 16, 3:55 pm, "Andrew Wills" <wills.d...@gmail.com> wrote:
> Andy,
>
> I think this is a helpful enhancement, and that you should go ahead and commit.
>
> I agree with you that, since (at this point) we're only interested in
> sorting them for an instant within the process, we can avoid the
> additional overhead of a TreeMap by sorting them only if/when we need
> it.
>
> But I also agree with Eric that this overhead is pretty insignificant.
> Should we need the attributes sorted for some other purpose down the
> road, maybe we should bring the TreeMap idea back to the table.
>
> Cheers,
>
> drew wills
>

Andy Gherna

unread,

Jan 17, 2009, 10:03:30 AM1/17/09

to Cernunnos Discussion

And for free, I accidentally committed files related to Issue 53 (http://code.google.com/p/cernunnos/issues/detail?id=53). This obviously is not related to this patch. Should they be removed?

Output from SVN attached.

crn-rev441.txt

Eric Dalquist

unread,

Jan 17, 2009, 1:57:05 PM1/17/09

to cernunnos-...@googlegroups.com

Andy Gherna wrote:

On Fri, Jan 16, 2009 at 6:56 PM, Eric Dalquist <eric.d...@doit.wisc.edu> wrote:

Ah but what about the uPortal import/export where subscripts are used everywhere and called thousands of times.

By switching mergedAttributes from a HashMap implementation to a TreeMap implementation (or LinkedHashMap or any other implementation where the sort order of the keys is guaranteed) the keys are (re-)sorted every time you add an attribute to it. It wouldn't be noticeable for little jobs but for an export or import of thousands of records, it could be significant.

The Maps are binary trees, they are not re-sorted every time you add an attribute, the attribute is placed in the correct order into the map in average log(n) time. The difference in cost is put/get of an entry to a HashMap is a linear cost, it always takes X instructions to figure out where to put or get the entry. A TreeMap is a binary tree, each put or get costs log2(n), where N is the number of entries in the Map, time to put or get an entry in the map.

A full sort, which the patch does on each log call costs N*log2(N) since you have to loop over every key (N) and add it to the tree (log2(N)).

In the case of the attribute Map sizes we have (only a few hundred elements) the log2(N) may actually be faster than the linear time work as calculating the hash position is fairly complex.

Just advocating that switching the attributes map to a TreeMap would solve the problem and likely have other benefits down the road with no cost most of the time and a speed up when using lots of subscripts and logging the start stanza.

On that note I have a few areas I'd like to see the logging switched from INFO to DEBUG as the messages are very verbose or are not INFO level info for end users.

I agree. For example, if the preamble was logged at DEBUG rather than INFO, then the patch I submitted is fine. Furthermore, large jobs with lots of script invocations execute without any noticeable overhead (maybe they would even run faster since the log statements may not even be ran at all).

-Eric

wills...@gmail.com wrote:

> Thinking about this a bit more doing the sort once per log call is much
> more expensive than just having these maps be TreeMaps.

But they're only logged once, in the ScriptRunner, which happens one time per invocation of a CRN XML document.

Right.

It might be a good idea to move the sorting code itself into a private method w/in ScriptRunner, to prevent it from being used unwittingly elsewhere.

That's why my patch scoped the method as package private. It felt like the "right" place to put it but it certainly does not feel "right" for that method to be public. I'm not intimately familiar with the runtime package and where else something like that would be useful so busting it down to a private method in ScriptRunner is reasonable (even in-lining it in the logging statement before the iterating through the keys is fine too).

>
> The only other thing to be careful with either approach is if the keys
> do not all implement Comparable an exception will be thrown.
>

Ah sure. In our case, the keys are all Strings per the API.

drew wills

Andy Gherna

unread,

Jan 17, 2009, 3:34:07 PM1/17/09

to cernunnos-...@googlegroups.com

On Sat, Jan 17, 2009 at 12:57 PM, Eric Dalquist
<eric.d...@doit.wisc.edu> wrote:
>
> The Maps are binary trees, they are not re-sorted every time you add an attribute, the attribute is placed in the correct order into the map in average log(n) time. The difference in cost is put/get of an entry to a HashMap is a linear cost, it always takes X instructions to figure out where to put or get the entry. A TreeMap is a binary tree, each put or get costs log2(n), where N is the number of entries in the Map, time to put or get an entry in the map.

Clearly there are some performance advantages here. But I was under
the impression that HashMaps are generally faster. From the API:
"This implementation provides constant-time performance for the basic
operations (get and put), assuming the hash function disperses the
elements properly among the buckets." Obviously the hashCode methods
for the element types need to be written carefully. String's hashCode
function would be as close to a guarantee that this happened. With
the small number of elements in the attribute maps, the difference in
implementations is negligible.

>
> A full sort, which the patch does on each log call costs N*log2(N) since you have to loop over every key (N) and add it to the tree (log2(N)).

OK.

>
> In the case of the attribute Map sizes we have (only a few hundred elements) the log2(N) may actually be faster than the linear time work as calculating the hash position is fairly complex.

It would actually be quite a bit faster (I'm a mathematician more than
a data structure expert). The curve for a log function is more
constant (shallower) over time.

>
> Just advocating that switching the attributes map to a TreeMap would solve the problem and likely have other benefits down the road with no cost most of the time and a speed up when using lots of subscripts and logging the start stanza.

>> On that note I have a few areas I'd like to see the logging switched from INFO to DEBUG as the messages are very verbose or are not INFO level info for end users.

Sure that would be fine too. Also though, if the preamble were logged
at DEBUG level (which is probably more appropriate as you hinted at
earlier) that would also work too.

Reply all

Reply to author

Forward