reading LAS chunk by chunk using LASzip

341 views
Skip to first unread message

Simone Rapposelli

unread,
Nov 26, 2016, 9:34:58 AM11/26/16
to LAStools - efficient tools for LiDAR processing
Hi everybody,

I am trying to read a set of points instead of a single point from a LAS file using LASzip, but I didn't find anything except the example using laszip_read_point for a single point.
Could you please give me some advice?

Thank you in advance

Martin Isenburg

unread,
Nov 26, 2016, 12:15:04 PM11/26/16
to LAStools - efficient command line tools for LIDAR processing
Hello,

under the hood of the dynamic linked library API (DLL) is the LASzip library that decompresses the points one by one so from a decompression speed point of view there is nothing to be gained from decompressing points in chunks (-> those will be implemented by decompressing single point by single point anyways). What would be improved is the number of DLL calls and the repeated copy of point attributes from and to the struct (assuming the single chunk the could be decompressed to alternatively directly maps to your internal representation). By linking statically (LASzip is LGPL 2.1 with static linking exception) you could overcome the penalty of the repeated DLL calls (not sure how much overhead this is).

In summary, currently laszip_read_chunk_of_points() does not exist. Would it worthwhile to add this from a performance point of view? Maybe you could experiment yourself? It should not be too hard to write such a function given the info above.

Regards,

Martin @rapidlasso

Evon Silvia

unread,
Nov 30, 2016, 7:24:24 PM11/30/16
to last...@googlegroups.com
I like this idea because it gives the option to multi-thread the LAZ read/write without having simultaneous drive seeks, thereby speeding up the most expensive part of the LAZ IO – the (de)compression. The way I see this working is that the user would request a chunk of 50k (or whatever) points, LASzip would lock a mutex, read the raw data chunk from disk, unlock a mutex, and then perform the decompression on the raw bytes. Writing could work the same way except in reverse.

The end result would be that multiple threads could read and write at the same time, but the mutex would protect against simultaneous disk IO.

Alternatively, you could add a read/write interface that accepts a raw buffer instead of a file pointer, and then leave the actual reading/writing of raw bytes up to the user.

Evon

Terje Mathisen

unread,
Dec 1, 2016, 9:42:02 AM12/1/16
to last...@googlegroups.com
Evon Silvia wrote:
> I like this idea because it gives the option to multi-thread the LAZ
> read/write without having simultaneous drive seeks, thereby speeding
> up the most expensive part of the LAZ IO – the (de)compression. The
> way I see this working is that the user would request a chunk of 50k
> (or whatever) points, LASzip would lock a mutex, read the raw data
> chunk from disk, unlock a mutex, and then perform the decompression on
> the raw bytes. Writing could work the same way except in reverse.

Until I saw this thread I (naiviely?) assumed that this was the way it
worked, since it is the "only sensible" way to do it. :-)

However, as long as Martin is using the standard C file IO or C++ stream
library calls, then this happens underneath the hood anyway, i.e. even
if you do single-byte fgetc() file reads the IO library will under the
hood perform a block operation to fill a 2-64K internal buffer, then
supply data from that buffer. This is (among many other things) in order
to allow the library code to suppress CR ('\r') bytes from an input
stream on a Windows platform.

I have however also written quite a bit of serious
compression/decompression code, in the form of DVD and BluRay decoding,
the world's fastest Ogg Vorbis decoder and probably also the fastest LZ4
code: I always allocate a significant chunk of buffer space (a multiple
of the 4K page and common file block allocation size) and do all
subsequent processing to/from this buffer.

>
> The end result would be that multiple threads could read and write at
> the same time, but the mutex would protect against simultaneous disk IO.

You _can_ do this, but I have performed tests that show clearly that it
isn't needed and can often hurt: You just need a large enough local
buffer so that each OS file IO call becomes large enough, with 1 MB as
the point where you can use a single laptop hard drive to handle at
least 10 simultaneous h264 full-res video streams with no stutter or
glitching.

For lastile which can easily have 1000+ open file handles this would
require a GB or more of buffer space, so there the cutoff point should
probably be lowered.

One very interesting point though is that the Microsoft Visual C++
compiler library will in fact do a lot of mutex handling under the hood!
I.e. when you use the MT (multithreading) library, every single fgetc()
byte read is internally protected by a mutex to make sure that multiple
users of the same file handle cannot stomp on each other. This is
superfluous for at least 99% of all code, even code which does a lot of
true MT work. This one misfeature means that the same C(++) source code
running on the same hardware can be 3X faster on Linux than Windows!
>
> Alternatively, you could add a read/write interface that accepts a raw
> buffer instead of a file pointer, and then leave the actual
> reading/writing of raw bytes up to the user.

This is a god idea which follows quite naturally when you start with a
buffer interface.

Terje
>
> Evon
>
> On Sat, Nov 26, 2016 at 9:10 AM, Martin Isenburg
> <martin....@gmail.com <mailto:martin....@gmail.com>> wrote:
>
> Hello,
>
> under the hood of the dynamic linked library API (DLL) is the
> LASzip library that decompresses the points one by one so from a
> decompression speed point of view there is nothing to be gained
> from decompressing points in chunks (-> those will be implemented
> by decompressing single point by single point anyways). What
> would be improved is the number of DLL calls and the repeated copy
> of point attributes from and to the struct (assuming the single
> chunk the could be decompressed to alternatively directly maps to
> your internal representation). By linking statically (LASzip is
> LGPL 2.1 with static linking exception) you could overcome the
> penalty of the repeated DLL calls (not sure how much overhead this
> is).
>
> In summary, currently laszip_read_chunk_of_points() does not
> exist. Would it worthwhile to add this from a performance point of
> view? Maybe you could experiment yourself? It should not be too
> hard to write such a function given the info above.
>
> Regards,
>
> Martin @rapidlasso
>
> On Sat, Nov 26, 2016 at 3:07 PM, Simone Rapposelli
> <simone.r...@gmail.com <mailto:simone.r...@gmail.com>>
--
- <Terje.M...@tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Andrew Bell

unread,
Dec 1, 2016, 10:23:05 AM12/1/16
to last...@googlegroups.com
On Thu, Dec 1, 2016 at 3:23 AM, Terje Mathisen <terje.m...@tmsw.no> wrote:
Evon Silvia wrote:
I like this idea because it gives the option to multi-thread the LAZ read/write without having simultaneous drive seeks, thereby speeding up the most expensive part of the LAZ IO – the (de)compression. The way I see this working is that the user would request a chunk of 50k (or whatever) points, LASzip would lock a mutex, read the raw data chunk from disk, unlock a mutex, and then perform the decompression on the raw bytes. Writing could work the same way except in reverse.

Until I saw this thread I (naiviely?) assumed that this was the way it worked, since it is the "only sensible" way to do it. :-)

However, as long as Martin is using the standard C file IO or C++ stream library calls, then this happens underneath the hood anyway, i.e. even if you do single-byte fgetc() file reads the IO library will under the hood perform a block operation to fill a 2-64K internal buffer, then supply data from that buffer. This is (among many other things) in order to allow the library code to suppress CR ('\r') bytes from an input stream on a Windows platform.

People who write disk drivers and I/O libraries know what they're doing.  It's generally true that the buffers of which you speak are already constructed optimally for the I/O, especially in the case of sequential reads, which is what LAZ requires.  Note also that using SSD, which is becoming common, is different from using a magnetic disk in many ways.

As for chunks and threads, the ability to do this depends on the way that the LAZ file is written.  You must read the data in each chunk sequentially and the chunksize is up to the LAZ writer, not the reader.  Whether there is real benefit depends on what you do with the data after you read it.  For example, if your application wants to stream data, not having your input data come in sequentially complicates things if you want to maintain point order.  On the other hand, if you need to buffer and store all the data before performing some operation, it might be a win -- but it's still more complex, which typically means more maintenance effort.

--

Terje Mathisen

unread,
Dec 1, 2016, 10:50:02 AM12/1/16
to last...@googlegroups.com
Andrew Bell wrote:
> On Thu, Dec 1, 2016 at 3:23 AM, Terje Mathisen <terje.m...@tmsw.no
> <mailto:terje.m...@tmsw.no>> wrote:
>
> Evon Silvia wrote:
>
> I like this idea because it gives the option to multi-thread
> the LAZ read/write without having simultaneous drive seeks,
> thereby speeding up the most expensive part of the LAZ IO –
> the (de)compression. The way I see this working is that the
> user would request a chunk of 50k (or whatever) points, LASzip
> would lock a mutex, read the raw data chunk from disk, unlock
> a mutex, and then perform the decompression on the raw bytes.
> Writing could work the same way except in reverse.
>
>
> Until I saw this thread I (naiviely?) assumed that this was the
> way it worked, since it is the "only sensible" way to do it. :-)
>
> However, as long as Martin is using the standard C file IO or C++
> stream library calls, then this happens underneath the hood
> anyway, i.e. even if you do single-byte fgetc() file reads the IO
> library will under the hood perform a block operation to fill a
> 2-64K internal buffer, then supply data from that buffer. This is
> (among many other things) in order to allow the library code to
> suppress CR ('\r') bytes from an input stream on a Windows platform.
>
>
> People who write disk drivers and I/O libraries know what they're
> doing. It's generally true that the buffers of which

Actually, that isn't true: The disk driver level is normally OK, but as
I wrote the lowest level file IO library on Windows really sucks if you
actually use the unix tradition single-byte reads and writes for
sequential processing!

On unix/linux fgetc() is normally a macro that simply picks up the next
byte from the IO buffer, using very efficient inline code generation,
only falling back on a function call in the case of an empty buffer.

Something like this:

return (fh && (fh->bufpos < fh->bufmax)) ? fh->buf[fh->bufpos++] :
_fgetc(fh);

which is pretty close to optimal.

The Microsoft library which implements the same function consists of
1000+ lines of code and at least 3 kernel calls.
> you speak are already constructed optimally for the I/O, especially in
> the case of sequential reads, which is what LAZ requires. Note also
> that using SSD, which is becoming common, is different from using a
> magnetic disk in many ways.
Nope, you have to work with larger data items in order to reduce the
library/OS overhead. Please try some benchmarks!

>
> As for chunks and threads, the ability to do this depends on the way
> that the LAZ file is written. You must read the data in each chunk
> sequentially and the chunksize is up to the LAZ writer, not the
> reader. Whether there is real benefit depends on what you do with the
> data after you read it. For example, if your application wants to
> stream data, not having your input data come in sequentially
> complicates things if you want to maintain point order. On the other
> hand, if you need to buffer and store all the data before performing
> some operation, it might be a win -- but it's still more complex,
> which typically means more maintenance effort.

I would probably never try to handle LAZ data the same way as a BluRay
video stream, i.e. using parallel streams to process individual parts of
the screen! Such code is simply far too easy to get wrong.

For LiDAR the solution is much simpler: Just use buffered tiles and run
all the jobs on as many cores as you have available. :-)

Terje
>
> --
> Andrew Bell
> andrew....@gmail.com <mailto:andrew....@gmail.com>

Albert Godfrind

unread,
Dec 1, 2016, 11:25:35 AM12/1/16
to last...@googlegroups.com
Very interesting discussion. The secret sauce of processing large amount of data in acceptable time is definitely to exploit parallelism at all levels.

Which prompts this question: it seems like it would make a lot of sense to start processing lidar data using big data processing frameworks like hadoop or spark. Is anyone looking in that direction ?  

What about using noSQL databases (KVP) for storing point clouds ?

Albert

Evon Silvia

unread,
Dec 1, 2016, 12:33:45 PM12/1/16
to last...@googlegroups.com
Terje Mathisen wrote:
Evon Silvia wrote:
The end result would be that multiple threads could read and write at the same time, but the mutex would protect against simultaneous disk IO.

You _can_ do this, but I have performed tests that show clearly that it isn't needed and can often hurt: You just need a large enough local buffer so that each OS file IO call becomes large enough, with 1 MB as the point where you can use a single laptop hard drive to handle at least 10 simultaneous h264 full-res video streams with no stutter or glitching.

I suppose I just assumed that I would have to mutex-guard against simultaneous reads. How could they possibly protect against race conditions if the user isn't necessarily reading the file in sequence? For example, when pulling the spatial buffer from a neighbor tile, I'm only reading the relevant points within the buffer (using the LAX index), not the entire file.

My tests with multithreading IO of a single file on Windows seemed to agree with your assertion of 1MB as optimal. I landed on around 500k points as a nice number, but I defer to your expertise as you clearly have more experience and a deeper understanding than I do. 

For LiDAR the solution is much simpler: Just use buffered tiles and run all the jobs on as many cores as you have available. :-)

​Generally this
​multi-process idiom ​
works great, but I
​repeated ran 
into issues from ​
​having my multiple processes trying to access the same disk at the same time and thrashing (thanks, Windows...), so I took a deeper dive into multi-threaded processing instead.​ When I run my LAS IO using 4 threads on a buffer like I described I ended up with a 50% IO boost. 
My LAS read/write does more memory expansion than Martin's clever design so 
​he might not see the same gains I did, but I'd like to explore the option with LAZ and that's not possible with the "read next point" design it currently has AFAIK.

Besides, LAZ lends itself very well to internal buffering because I believe it's already stored in data blocks of roughly 50k points.

Albert – I'm just a hack programmer, so I've never done a deep dive into true spatial databases. They seemed to imply a higher level of infrastructure upgrades than I currently have access to.

Fun discussion!

Evon

Christopher Crosby

unread,
Dec 1, 2016, 7:57:21 PM12/1/16
to last...@googlegroups.com
We did a bit of work on MapReduce/Hadoop for gridding lidar point cloud data back in 2010:

Krishnan S, Baru C, Crosby CJ.  2010.  Evaluation of MapReduce for Gridding LIDAR Data. 2010 IEEE 5th International Conference on Cloud Computing Technology and Science: 33-40 (http://dx.doi.org/10.1109/CloudCom.2010.34)

-cc

--
Christopher Crosby
Geodetic Imaging & OpenTopography Project Manager
UNAVCO, Inc.

Terje Mathisen

unread,
Dec 1, 2016, 7:58:08 PM12/1/16
to last...@googlegroups.com
Evon Silvia wrote:
> Terje Mathisen wrote:
>
> Evon Silvia wrote:
>
> The end result would be that multiple threads could read and
> write at the same time, but the mutex would protect against
> simultaneous disk IO.
>
>
> You _can_ do this, but I have performed tests that show clearly
> that it isn't needed and can often hurt: You just need a large
> enough local buffer so that each OS file IO call becomes large
> enough, with 1 MB as the point where you can use a single laptop
> hard drive to handle at least 10 simultaneous h264 full-res video
> streams with no stutter or glitching.
>
>
> I suppose I just assumed that I would have to mutex-guard against
> simultaneous reads. How could they possibly protect against race
> conditions if the user isn't necessarily reading the file in sequence?
> For example, when pulling the spatial buffer from a neighbor tile, I'm
> only reading the relevant points within the buffer (using the LAX
> index), not the entire file.

You use separate file handles in each thread in that case, the OS+lib
takes care of all the simultaneous access rules, which are dead simple
for the normal read-only reads we do.
>
> My tests with multithreading IO of a single file on Windows seemed to
> agree with your assertion of 1MB as optimal. I landed on around 500k
> points as a nice number, but I defer to your expertise as you clearly
> have more experience and a deeper understanding than I do.Â
>
> For LiDAR the solution is much simpler: Just use buffered tiles
> and run all the jobs on as many cores as you have available. :-)
>
>
> ​Generally this
> ​multi-process idiom ​
> works great, but I
> ​repeated ranÂ
> into issues from ​
> ​having my multiple processes trying to access the same disk at the
> same time and thrashing (thanks, Windows...), so I took a deeper dive
> into multi-threaded processing instead.​ When I run my LAS IO using
> 4 threads on a buffer like I described I ended up with a 50% IO boost.Â
> My LAS read/write does more memory expansion than Martin's clever
> design soÂ
> ​he might not see the same gains I did, but I'd like to explore the
> option with LAZ and that's not possible with the "read next point"
> design it currently has AFAIK.

Right.

Terje
>
> Besides, LAZ lends itself very well to internal buffering because I
> believe it's already stored in data blocks of roughly 50k points.
>
> Albert – I'm just a hack programmer, so I've never done a deep dive
> into true spatial databases. They seemed to imply a higher level of
> infrastructure upgrades than I currently have access to.
>
> Fun discussion!
>
> Evon
>
Reply all
Reply to author
Forward
0 new messages