Sorting a very large number of objects

798 views
Skip to first unread message

Shevek

unread,
Nov 9, 2018, 10:08:23 AM11/9/18
to mechanical-sympathy
Hi,

I'm trying to sort/merge a very large number of objects in Java, and
failing more spectacularly than normal. The way I'm doing it is this:

* Read a bunch of objects into an array.
* Sort the array, then merge neighbouring objects as appropriate.
* Re-fill the array, re-sort, re-merge until compaction is "not very
successful".
* Dump the array to file, repeat for next array.
* Then stream all files through a final merge/combine phase.

This is failing largely because I have no idea how large to make the
array. Estimating the ongoing size using something like JAMM is too
slow, and my hand-rolled memory estimator is too unreliable.

The thing that seems to be working best is messing around with the array
size in order to keep some concept of runtime.maxMemory() -
runtime.totalMemory() + runtime.freeMemory() within a useful bound.

But there must be a better solution. I can't quite think a way around
this with SoftReference because I need to dump the data to disk when the
reference gets broken, and defeating me right now.

Other alternatives would include keeping all my in-memory data
structures in serialized form, and paying the ser/deser cost to compare,
but that's expensive - my main overhead right now is gc. Serialization
is protobuf, although that's changeable, since it's annoying the hell
out of me (please don't say thrift - but protobuf appears to have no way
to read from a stream into a reusable object - it has to allocate the
world every single time).

Issues:
* This routine is not the sole tenant of the JVM. Other things use RAM.
* This has to be deployed and work on systems whose memory config is
unknown to me.

Can anybody please give me pointers?

S.

Georges Gomes

unread,
Nov 9, 2018, 10:52:56 AM11/9/18
to mechanica...@googlegroups.com
Hi

That's a nice problem :)

I would probably use a serialised form that allow accessing the data without deserialising. Like SBE. This way there is no extra allocation, just the byte stream.

Then I would maybe try to divide and conquer: shard objecs in buckets. Example In case of strings, a bucket for starting with A, a bucket for starting with B, etc...
Then you can sort inside the buckets (in parallel?) And finish by concatenate the buckets.

My 2 cents
Good luck
Georges



--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Kashchenko

unread,
Nov 9, 2018, 3:30:39 PM11/9/18
to mechanica...@googlegroups.com, Shevek
Hi,
It may help moving all data processing off-heap. With unsafe-tools [1]
you can load the data into OffHeapStructArrayList and then sort it with
OffHeapStructSorter (DualPivotQuicksort) and do a binary search over
sorted array if needed.

As mentioned in a neighbour comment, you also need a serialization
format that allows accessing the data without deserialising (some kind
of packed structs).

On GC pressure: it may be better to avoid heap allocation in all
processing routines that are called for every array element.


[1]
https://groups.google.com/d/msg/mechanical-sympathy/_miYYok2Rrg/vRVerQqFCgAJ

--
-Alex

Gil Tene

unread,
Nov 10, 2018, 11:51:17 AM11/10/18
to mechanica...@googlegroups.com
In general, whenever I see "my overhead is gc" and "unknown memory size" together, I see it as a sign of someone pushing heap utilization high and getting into the inefficient GC state. Simplistically, you should be able to drop the GC cost to an arbitrary % of overall computation cost by increasing the amount (or relative portion) of empty heap in your setup. So GC should never be "a bottleneck" from a throughput point of view unless you have constraints (such as a minimum required live set and a maximum possible heap size) that force you towards a high utilization of the heap (in terms of LiveSet/HeapSize). The answer to such a situation is generally "get some more RAM for this problem" rather than put in tons of work to fit this in". 

For most GC algorithms, or at least for the more costly parts of such algorithms, GC efficiency is roughly linear to EmptyHeap/LiveSet. Stated otherwise, GC cost grows with LiveSet/EmptyHeap or LiveSet/(HeapSize-LiveSet). As you grow the amount you try to cram into a heap of a given size, you increase the GC cost to the square of your cramming efforts. And for every doubling of the empty heap [for a given live set] you will generally half the GC cost.

This should [hopefully] make it obvious why using SoftReferences is a generally terrible idea.
 

Issues:
* This routine is not the sole tenant of the JVM. Other things use RAM.
 
You can try to establish what an "efficient enough" heap utilization level is for your use case (a level that keeps overall GC work as a % of CPU spend to e.g. below 10%), and keep your heap use to a related fraction of whatever heap size you get to have on the system you land on.

Shevek

unread,
Nov 12, 2018, 12:53:45 AM11/12/18
to mechanica...@googlegroups.com, Gil Tene
I'm using the serialized approach with an ergonomics approach to heap
management. protobuf really sucks for a lot of reasons which are easily
remediable if one is a protobuf maintainer, but I'm not and I don't care
to be, and I don't have the time or inclination to write a new serializer.

However, and to Gil's point, following the ergonomics approach, when I
resize an array, I drop a garbage-collectable array which I no longer
"own", so from the PoV of the memory management code, it forms part of
the "other" part of the heap, i.e. I have to subtract it from the total
memory before computing the proportion of the heap I'm allowed to use.
Over time, given that GC isn't timely, I can lose all my heap to these
collectable arrays, and if I follow the rule about "Never call
System.gc()" this actually isn't a very efficient approach either.

The next best approach is to allocate all my memory in 1Mb blocks, make
each memory "region" be backed by some list of these blocks, so I never
actually hand them to the GC, and just reduce my overall memory usage by
dropping blocks, rather than totally resizing byte arrays.

My main disbelief is that I'm inventing all of this from scratch. It HAS
to have been done before, right? All I want to do is sort a load of
objects...!

S.

The presumed solution to this
> set. So GC should never be "a bottleneck" from a throughput point of
> view unless you have constraints (such as a minimum required live set
> and a maximum possible heap size) that force you towards a high
> utilization of the heap (in terms of LiveSet/HeapSize). The answer to
> such a situation is generally "get some more RAM for this problem"
> rather than put in tons of work to fit this in".
>
> For most GC algorithms, or at least for the more costly parts of such
> algorithms, GC efficiency is roughly linear to EmptyHeap/LiveSet. Stated
> otherwise, GC cost grows with LiveSet/EmptyHeap or
> LiveSet/(HeapSize-LiveSet). As you grow the amount you try to cram into
> a heap of a given size, you increase the GC cost to the square of your
> cramming efforts. And for every doubling of the empty heap [for a given
> live set] you will generally half the GC cost.
>
> This should [hopefully] make it obvious why using SoftReferences is a
> generally terrible idea.
>
>
> Issues:
> * This routine is not the sole tenant of the JVM. Other things use RAM.
>
> You can try to establish what an "efficient enough" heap utilization
> level is for your use case (a level that keeps overall GC work as a % of
> CPU spend to e.g. below 10%), and keep your heap use to a related
> fraction of whatever heap size you get to have on the system you land on.
>
> * This has to be deployed and work on systems whose memory config is
> unknown to me.
>
> Can anybody please give me pointers?
>
> S.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Gil Tene

unread,
Nov 12, 2018, 1:32:21 AM11/12/18
to mechanical-sympathy


On Sunday, November 11, 2018 at 9:53:45 PM UTC-8, Shevek wrote:
I'm using the serialized approach with an ergonomics approach to heap
management. protobuf really sucks for a lot of reasons which are easily
remediable if one is a protobuf maintainer, but I'm not and I don't care
to be, and I don't have the time or inclination to write a new serializer.

However, and to Gil's point, following the ergonomics approach, when I
resize an array, I drop a garbage-collectable array which I no longer
"own", so from the PoV of the memory management code, it forms part of
the "other" part of the heap, i.e. I have to subtract it from the total
memory before computing the proportion of the heap I'm allowed to use.
Over time, given that GC isn't timely, I can lose all my heap to these
collectable arrays, and if I follow the rule about "Never call
System.gc()" this actually isn't a very efficient approach either.

The importance or impact of timeliness of GC is often misunderstood. In fact, from a GC cpu overhead perspective, the less timely it is, the better.

Once your code no longer holds any references to an array or object, you shouldn't count it as part of the live set, and shouldn't treat it as something that holds down any memory or that you are "losing all your heap to". From a GC efficiency (and GC cpu cost) perspective, if the array is no longer reachable, it is empty (non-live) heap regardless of whether or not GC collects it in a timely manner. The GC will get to recycling the memory it used to be in eventually, and the more of it there is (as a % of Xmx when GC actually triggers) the more efficient that GC will be. The thing to keep in mind when imagining/modeling this is that most GC algorithms work hard to avoid spending CPU cycles on dead stuff. It's the live (reachable) stuff that costs (and causes GC cpu spend). The larger the empty (dead when GC is triggered) parts of the heap are as a % of the overall Xmx, the less often that CPU cost (of dealing with the live stuff) needs to be paid.

When trying to evaluate the amount of empty heap available, in order to e.g. figure out how to size your temporary structures or caches, the things to estimate are not the "current" amount of heap use or "current" amount of empty heap. They are the live set (the sum of getUsed() calls on MemoryPoolMXBean.getCollectionUsage() for the various pools in the collector) and the empty heap after collection (the sum of the differences between the getMax() and getUsed() on the same), which will best approximate what you need to consider. It is critical to look at the values that reflect use levels immediately after GC is performed by using values obtained from MemoryPoolMXBean.getCollectionUsage() as opposed to using MemoryPoolMXBean.getUsage(). The latter simply show up as random numbers that fall somewhere between the live set and Xmx, depending on when you measure them. They make for nice plots for tracking GC activity but are generally useless for estimating available empty heap when trying to make sizing choices.
 

Mindaugas Žakšauskas

unread,
Nov 12, 2018, 9:24:36 AM11/12/18
to mechanical-sympathy
Hi,

Do you require the entire object to be loaded into memory in order to compare it with another object? Do these objects have IDs and could be accessed by IDs quickly after sorting? If so, you could derive a lightweight proxy only containing few attributes of such object and work with those, reducing the amount of heap needed. After the lightweights are sorted, you would know the order number of each one, and in turn, its parent.

If you can't extract a lightweight attribute subset, perhaps you can come up with some sort of universal object score for each object and work with that?

m.

Shevek

unread,
Nov 12, 2018, 4:57:17 PM11/12/18
to mechanica...@googlegroups.com, Mindaugas Žakšauskas
We are doing sorting by proxy. Right now I have a byte[] serialized as:

[sort-key0, data0, sort-key1, data1, ...]

and a comparator which can compare either key-bytes or values-by-key. We
then sort a separate integer array which points to each sort-key's
offset in the underlying array, then we emit by walking the integer array.

The challenge is to merge duplicate neighbouring objects by key, then
emit in order, so any proxy method which requires me to do an
index-lookup for an object on disk will die horribly in seek. So we are
sorting each block in a (now fairly appropriately-sized) byte array,
dumping that to file, then stream-merging files.

We still need another two orders of magnitude of performance, but at
least we're back to the profiler for a new hypothesis now.

I think to improve from here, we will need to (a) use EWMA to compute
appropriate buffer sizes per buffer, since throughput is not uniform,
and (b) use an interface to front a set of 4Mb-sized byte[] arrays
rather than using a single 1Gb array, so that (b1) we can exceed 2Gb,
and (b2) we can allocate and free at a finer granularity.

The garbage collector is no longer a significant participant in our
computation, but protobuf still has disappointingly bad mechanical
sympathy for stream processing, and the cost of <init> of protobuf
objects is currently an unavoidably large percentage of runtime.

S.
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Michael Meehan

unread,
Nov 15, 2018, 1:02:05 AM11/15/18
to mechanica...@googlegroups.com, min...@gmail.com
Is the thing that's consuming your list a stream-consumer? Does the consumer for the next operation need the whole sorted list, or just the first element of the sorted list (the largest/smallest)? If both of those are true, it's most important that you deliver the first element quickly, and achieve reasonable throughput (not slower than the next consumer in the stream can process) after. In that case, you don't want to take a lot of time to generate a fully sorted list, you want to take a very small time to find the first element to pass on!



To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Shevek

unread,
Nov 15, 2018, 2:19:04 AM11/15/18
to mechanica...@googlegroups.com, Michael Meehan
It's a batch job. We are taking log files from N days, merging records
by name, and building a sqlite database out of them. So the time to
first record isn't important at all, it's entirely about time to
completion for the whole dataset. What we have is effectively an N-way
map-reduce in Java.

We need another two orders of magnitude in performance, but I can see a
set of paths which ought to get us at least one.

There is also the option to swap out sqlite for C* or scylla or
something, if we think that the distributed update will be better than a
local sort-merge, but this isn't currently clear, and it isn't going to
happen this cycle.

S.
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>
> > <mailto:mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-sympathy%2Bunsu...@googlegroups.com>.

Siva Velusamy

unread,
Nov 15, 2018, 1:22:05 PM11/15/18
to mechanica...@googlegroups.com
In the thread below, gil@ makes the following statement:

> For most GC algorithms, or at least for the more costly parts of such algorithms,
> GC efficiency is roughly linear to EmptyHeap/LiveSet. ...
> This should [hopefully] make it obvious why using SoftReferences is a generally terrible idea.

I'm not following that conclusion. Is that because SoftReferenced objects are still considered to be part of the LiveSet in the calculation above, and that leads to increased GC cost?

As a follow up, what is the recommended way to cache objects? Currently many places in our codebase use Guava's CacheBuilder with softValues. Does it even make sense to use weakValues, or is the suggestion to just use strong values, but try to restrict the number of entries?

In general, whenever I see "my overhead is gc" and "unknown memory size" together, I see it as a sign of someone pushing heap utilization high and getting into the inefficient GC state. Simplistically, you should be able to drop the GC cost to an arbitrary % of overall computation cost by increasing the amount (or relative portion) of empty heap in your set. So GC should never be "a bottleneck" from a throughput point of view unless you have constraints (such as a minimum required live set and a maximum possible heap size) that force you towards a high utilization of the heap (in terms of LiveSet/HeapSize). The answer to such a situation is generally "get some more RAM for this problem" rather than put in tons of work to fit this in". 

For most GC algorithms, or at least for the more costly parts of such algorithms, GC efficiency is roughly linear to EmptyHeap/LiveSet. Stated otherwise, GC cost grows with LiveSet/EmptyHeap or LiveSet/(HeapSize-LiveSet). As you grow the amount you try to cram into a heap of a given size, you increase the GC cost to the square of your cramming efforts. And for every doubling of the empty heap [for a given live set] you will generally half the GC cost.

This should [hopefully] make it obvious why using SoftReferences is a generally terrible idea.
 

Issues:
* This routine is not the sole tenant of the JVM. Other things use RAM.
 
You can try to establish what an "efficient enough" heap utilization level is for your use case (a level that keeps overall GC work as a % of CPU spend to e.g. below 10%), and keep your heap use to a related fraction of whatever heap size you get to have on the system you land on.
 
* This has to be deployed and work on systems whose memory config is
unknown to me.

Can anybody please give me pointers?

S.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Benjamin Manes

unread,
Nov 15, 2018, 10:50:46 PM11/15/18
to mechanical-sympathy
Jeremy Manson describes the process fairly well in an old blog post.

Because soft references are based on the amount of memory given to JVM, the cost increases as more memory is added. This can be surprising, e.g. if abused to fill up the heap then full collections will happen much more frequently. When resolving by increasing the instance size the problem gets worse.

I believe that this is less of a problem on region-based collectors, like G1, compared to generational ones. The regions are independant and can be evacuated much more aggressively, so the impact of impact on pause times may be lessened. Of course, that also negatively impact application cache hit rates if used in that manner.

Does it even make sense to use weakValues

Yes, weak values can be aggressively collected, so the impact is much smaller. However their use cases is often quite different.

what is the recommended way to cache objects?

Prefer an explicit size bound on strongly referenced objects, expose a setting, and report the statistics for monitoring.

As soft references are evicted using a global LRU when there is GC pressure, very poor victims may be chosen. For example a performance sensitive application cache might have its entries removed in favor of keeping noisy, low value entries from a cache buried deep within an external dependency. It can be difficult to predict and replicate performance problems, since the cache is reactive to environmental conditions. 

An application designed cache can also leverage algorithmic improvements. Modern eviction policies can significantly outperform LRU when frequency provides a better indicator. In those cases it may also be more GC hygienic as low value items are quickly evicted, whereas LRU can to cause unnecessary old gen promotion due to aging items much more slowly.

Historically the arguments in favor of soft caches was application simplicity (less tuning) and better concurrency (no explicit locking as GC maintains it). Neither bore out except in demo code and microbenchmarks, and tend to result in worse behavior overall. You should strive to remove soft references from your application and generally be very weary of their usage when encountered.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Kodewerk

unread,
Nov 15, 2018, 11:56:02 PM11/15/18
to mechanica...@googlegroups.com

On Nov 15, 2018, at 7:50 PM, Benjamin Manes <ben....@gmail.com> wrote:

Jeremy Manson describes the process fairly well in an old blog post.

Because soft references are based on the amount of memory given to JVM, the cost increases as more memory is added. This can be surprising, e.g. if abused to fill up the heap then full collections will happen much more frequently. When resolving by increasing the instance size the problem gets worse.

There are two factors taken into consideration when deciding to process a soft reference. The first is the amount of free memory in heap and the second would be time since last access. So, increasing heap will result in more soft references being accumulated but I’m not sure that this is the only cause of longer pause times. With generational collectors at some point scan for roots will dominate and that activity is linear to the size of heap. IOWs, larger heap yields longer scan for root times. The next inflation would be copy costs (compaction). Finally, processing SoftReference is generally more expensive than processing any other reference type. That said, IME, it’s ~ constant for any application irregardless of the size of heap.


I believe that this is less of a problem on region-based collectors, like G1, compared to generational ones. The regions are independant and can be evacuated much more aggressively, so the impact of impact on pause times may be lessened. Of course, that also negatively impact application cache hit rates if used in that manner.

SoftReference.. in fact any reference type can cause the collector to degenerate into a condition where the only way to recover is via Full GC. It’s a bug and it’s a bugger to reproduce so.. a bugger to debug but I’ve seen it in many many different environments in G1 running in 7. 8. 9. 10, and 11.


Does it even make sense to use weakValues

Yes, weak values can be aggressively collected, so the impact is much smaller. However their use cases is often quite different.

I’m not sure what you mean here by aggressively collected. The WeakReference collection policy is less expensive to calculate than Soft, final for Phantom but I wouldn’t say it’s aggressively collected. But then, I’d need to understand what you mean by aggressively collected.


what is the recommended way to cache objects?

Prefer an explicit size bound on strongly referenced objects, expose a setting, and report the statistics for monitoring.

As soft references are evicted using a global LRU when there is GC pressure, very poor victims may be chosen. For example a performance sensitive application cache might have its entries removed in favor of keeping noisy, low value entries from a cache buried deep within an external dependency. It can be difficult to predict and replicate performance problems, since the cache is reactive to environmental conditions. 

An application designed cache can also leverage algorithmic improvements. Modern eviction policies can significantly outperform LRU when frequency provides a better indicator. In those cases it may also be more GC hygienic as low value items are quickly evicted, whereas LRU can to cause unnecessary old gen promotion due to aging items much more slowly.

Historically the arguments in favor of soft caches was application simplicity (less tuning) and better concurrency (no explicit locking as GC maintains it). Neither bore out except in demo code and microbenchmarks, and tend to result in worse behavior overall. You should strive to remove soft references from your application and generally be very weary of their usage when encountered.

SoftReference buts stress on the allocators, mutators, and the collector… however all reference types put some level of stress on the allocators and the collector (and sometimes the mutators).
That said, reference types allow the garbage collector, which has a global view, to make decisions about object lifecycle that otherwise would be difficult if not impossible to make. There are use cases for them as has been pointed out here. So, if you need them…. you need them...

Kind regards,
Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Steven Stewart-Gallus

unread,
Jan 19, 2019, 6:28:45 PM1/19/19
to mechanica...@googlegroups.com
I'm really confused.

You're talking about putting the data into sqlite which suggests there really isn't so much log data and it could be filtered with a hacky shell script. But then you're talking about a lot of heavy optimisation which suggests you really may need to put in custom effort. Precisely how much log data really needs to be filtered? You're unlikely to be able to filter much of the data faster than the system utilities which are often very old and well-optimised C code. I'm reminded about the old story of the McIlroy and Knuth word count programs.

Anyway while this is a very enlightening discussion it is probably worthwhile to reuse as much existing system utilities and code as you can instead of writing your own.

You know more than me about your problem so I'll trust you if you think you really need to delve into optimisation but just you mentioned sqlite.

Greg Young

unread,
Jan 19, 2019, 7:40:13 PM1/19/19
to mechanica...@googlegroups.com
Given the list and people having many levels of experience/age/etc its
probably worth expounding on : I'm reminded about the old story of the
McIlroy and Knuth word count programs.

https://franklinchen.com/blog/2011/12/08/revisiting-knuth-and-mcilroys-word-count-programs/

On Sat, Jan 19, 2019 at 6:28 PM Steven Stewart-Gallus
<stevensele...@gmail.com> wrote:
>
> I'm really confused.
>
> You're talking about putting the data into sqlite which suggests there really isn't so much log data and it could be filtered with a hacky shell script. But then you're talking about a lot of heavy optimisation which suggests you really may need to put in custom effort. Precisely how much log data really needs to be filtered? You're unlikely to be able to filter much of the data faster than the system utilities which are often very old and well-optimised C code. I'm reminded about the old story of the McIlroy and Knuth word count programs.
>
> Anyway while this is a very enlightening discussion it is probably worthwhile to reuse as much existing system utilities and code as you can instead of writing your own.
>
> --
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Studying for the Turing test

Shevek

unread,
Jan 20, 2019, 4:10:51 AM1/20/19
to mechanica...@googlegroups.com
This project is very much in-progress.

We need to sort about 1e13 records, several terabytes when compressed,
sort-merge, and end up with about 1e10 in sqlite. Right now, we are
running sqlite with 1e9 objects, and it isn't an issue. sqlite is much
better than one would naively believe it to be, if used appropriately.
Oddly enough, its VM is several times faster than pg, for IO-free raw
mathematical computation, too.

Our current bottleneck is the serialization and allocation overhead of
protobuf. Many of the serializers recommended on this list can only
serialize fixed-size structures, but we're working on an implementation
with flatbuf right now. Thank you, Georges. flatbuf will also permit us
to avoid having a separate serialized copy of the sort-key.

We are going to experiment with reading files via mmap rather than I/O,
but we have not yet done so. It's tempting to find some way to call
madvise(SEQUENTIAL) on the mmap. Not sure what the other effects are
likely to be, however, but it may help us keep most/all of the data
effectively off-heap during the merge phase.

We have mastered all the (currently known) GC issues, thank you, Gil.

Accessing the objects fast by id is not currently possible, although
it's definitely an angle we could pursue. A major purpose of this sort
is to merge identical objects, or data under the same key, so even if we
did store by id, it would have to be a mutable store, which would have
its own issues.

We started with https://github.com/cowtowncoder/java-merge-sort and
assumed that due to the simplicity of that implementation, it would be
easy to do better, but it turns out that the simplicity of that
particular implementation is not actually a significant limiting factor.
However, it turns out that once one has done the serialization, a custom
version of Guava's Iterators.mergeSorted() is somewhat better.

S.


On 1/19/19 3:28 PM, Steven Stewart-Gallus wrote:
> I'm really confused.
>
> You're talking about putting the data into sqlite which suggests there
> really isn't so much log data and it could be filtered with a hacky
> shell script. But then you're talking about a lot of heavy optimisation
> which suggests you really may need to put in custom effort. Precisely
> how much log data really needs to be filtered? You're unlikely to be
> able to filter much of the data faster than the system utilities which
> are often very old and well-optimised C code. I'm reminded about the old
> story of the McIlroy and Knuth word count programs.
>
> Anyway while this is a very enlightening discussion it is probably
> worthwhile to reuse as much existing system utilities and code as you
> can instead of writing your own.
>
> --
> You received this message because you are subscribed to the Google
> Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to mechanical-symp...@googlegroups.com
> <mailto:mechanical-symp...@googlegroups.com>.

Muruga Prasath Ganesan

unread,
Feb 12, 2019, 4:31:45 PM2/12/19
to mechanical-sympathy

shevek,
Can you Please tell us the final solution that you implemented to fix the issue?

Florian Enner

unread,
Nov 22, 2019, 4:04:41 AM11/22/19
to mechanical-sympathy
I ran into similar issues with Protobuf-Java in a different use case, and ended up maintaining a zero-allocation fork of JavaNano for a few years. However, between needing to support additional features, and Google abandoning the project, I decided to write a new Protobuf library that's entirely Java based.

It's allocation free in steady state, can be used with off-heap memory, and for most use cases has roughly 2x the throughput of Protobuf-Java.


I know it's a bit late for this thread, but I couldn't share the previous version. Maybe it still helps.

Florian
Reply all
Reply to author
Forward
0 new messages