libarchive performance when scanning through a large archive with many files (7z) is poor

42 views

Skip to first unread message

Emanuele Oriani

unread,

Feb 6, 2021, 5:08:37 AM2/6/21

to libarchiv...@googlegroups.com

Dear libarchive maintainers/devs,

First of all thanks for this library, it's really simple to use and supports basically all known archive file formats.

I do have a performance issue though: not sure if this has been brought to your attention, but when one has to scan through an archive and then extract files/data in a non-linear fashion, the performance is atrocious.
For example, I have 7z file containing:

/dir/abc.bin
/dir2/cdef.bin
/dir2/xyz.bin

and many more files (10 thousands+). Now, I have to first extract '/dir2/xyz.bin', then potentially '/dir/abc.bin' and optionally '/dir2/cdef.bin'.

I have implemented non-linear access this by basically scanning the whole archive from scratch each time and if I match the name (via regex) then I extract the file. Basically for each file/regex path I execute the following pseudo-code:

void extract_file(const char* filename) {
   a_ = archive_read_new();
   archive_read_support_filter_all(a_);
   archive_read_support_format_all(a_);
   archive_read_open_filename(a_, "my1GiB.archive.7z", 10240);
   while(archive_read_next_header(a_, &entry) == ARCHIVE_OK) {
      // this is pseudo code, I'm using a regex etc etc
      if(filename == archive_entry_pathname(entry)) {
         // use archive_read_data to get data
      }
   }
   // dispose of a_ properly
}

I am not explicitly calling archive_read_data_skip as per notes at https://github.com/libarchive/libarchive/wiki/Examples#List_contents_of_Archive_stored_in_File .

Am I doing it 'right'? Isn't there a better way to somehow 'cache' a_ position/entry on the archive stream?
To be honest I understand if the answer is "no", because of the universal support of all archive (and some I would imagine are only able to be accessed linearly).

Thanks again for you great work!

Reply all

Reply to author

Forward

0 new messages