[python] which library using to read / write huge amount of data

143 views
Skip to first unread message

fruity

unread,
Oct 5, 2016, 4:10:41 AM10/5/16
to Python Programming for Autodesk Maya
Hi there !

I'm working on a skin weight IO tool (or similar), and i was wondering what is the fastest way of reading/writing the datas, mainly between configParser, json, xml and raw text.
From my researches, it seems that json is the best take (even better if ujson can be installed), but that sounds surprising, as i thought xml was precisely done to handle huge datasets (and xml is also what maya uses to export weights, btw). And finally, i'd like to know if writing raw text, without using any parser, could be faster ? I don't really need to access things line by line, so maybe this could be a fastest solution. But if i go for raw text, then i don't really know how the memory will be handled (for instance, if i have to read a 100 000lines file, does it mean that the entire document needs to be loaded into memory, resulting in some memory leaks or similar issues ? If i use a parser, the memory management will come for free i guess ?)
Any help will be more than welcome, i'm not familiar with those kind of problems !

Many thanks =]



Marcus Ottosson

unread,
Oct 5, 2016, 4:43:27 AM10/5/16
to python_in...@googlegroups.com

Before getting into how to make something fast, I must point out that your journey towards performance is meaningless without having something to run. Your first call to port should be to build your tool in the simplest and most straightforward and obvious way possible, and then start to look at how to make it faster.


Now, when it comes to speed, reading and writing are generally at conflict with each other. The faster it writes, the slower it reads.

The fastest thing to write is also the smallest - that could mean compressing your data for example.

import zlib

data = b"My very long string, "
compressed = zlib.compress(data)

At this point, writing will be at a peak speed, only dependent on the quality of your chosen compression method and amount of content you compress.

But reading suffers.

The fastest thing to read is also the smallest - that could mean decompressing your data.

import zlib

data = b"My very long string, "
compressed = zlib.compress(data)

decompressed = zlib.decompress(compressed)
assert decompressed == data

If you measure the results of compressed against decompressed, you’ll find that it’s actually larger than the original.

import zlib

data = b"My very long string, "
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert len(compressed) > len(decompressed)

What gives? Because the data is so small, the added overhead of the compression method outweighs the benefits. Compression, with zlib and likely other algorithms, are most effective on large, repetitive data structures.

import zlib

data = b"My very long string, " * 10000
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert len(compressed) < len(decompressed)  # Notice the flipped comparison sign

To get a sense of the savings made, you could try something like this.

import sys
import zlib

data = b"My very long string, " * 10000
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert decompressed == data

original_size = sys.getsizeof(data)
compressed_size = sys.getsizeof(compressed)

print("original: %.2f kb\ncompressed: %.2f kb\n= %i times smaller" % (
    original_size / 1024.0,
    compressed_size / 1024.0,
    original_size / compressed_size)
)

Which on my system (Windows 10, Python 3.3x64) prints:

original: 205.11 kb
compressed: 0.58 kb
= 356 times smaller

Now which is it, do you need it to read fast, or write fast? :)

Best,
Marcus

Alok Gandhi

unread,
Oct 5, 2016, 9:25:11 AM10/5/16
to python_in...@googlegroups.com
Generally speaking, binary compressed data would help in faster io. 

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOCAjZUST94WFyWE-JjmzN1z9m9hkHbpXtJyLWWv14X-LA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--

Justin Israel

unread,
Oct 5, 2016, 2:36:01 PM10/5/16
to python_in...@googlegroups.com


On Thu, 6 Oct 2016, 2:25 AM Alok Gandhi <alok.ga...@gmail.com> wrote:
Generally speaking, binary compressed data would help in faster io. 

Yes agreed. 

Using something like zlib, as Marcus suggested, would actually mean you have to do more work because first you have to serialize your python objects to strings and then zlib encode them. While you may save disk space on the resulting string data, you now need to zlib decompress before then marshaling the plain strings back into objects again. 

If you can serialize to a binary format in the first place, you can describe your data in a way that is cast to read. This also means you may have the ability to stream the data back in as opposed to reading the entire blob at once, depending on which format you choose. Json, for instance,  is capable of streaming when you have one object per line. 
A simple example of a binary format is the cPickle module, using protocol 2. It is decently fast for being a builtin standard library option and is a binary self describing format. Ujson may beat its performance in encoding however. A long time back I did a post about comparing serializing options 

Some newer options that are not on that list are protocol buffers and flatbuffers. These are also fast binary formats. 

But also, like Marcus suggested, you will have to test both encoding and decoding performance as they each have their strengths. Depends on what your needs are. 

Justin 


To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.



--

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAPaTLMR3JYvWTTLwf2kbj%3Dtp0R3qqTSqxn3ORvrtEw4E2jn1jA%40mail.gmail.com.

Justin Israel

unread,
Oct 5, 2016, 4:09:05 PM10/5/16
to python_in...@googlegroups.com
On Thu, Oct 6, 2016 at 7:35 AM Justin Israel <justin...@gmail.com> wrote:


On Thu, 6 Oct 2016, 2:25 AM Alok Gandhi <alok.ga...@gmail.com> wrote:
Generally speaking, binary compressed data would help in faster io. 

Yes agreed. 

Using something like zlib, as Marcus suggested, would actually mean you have to do more work because first you have to serialize your python objects to strings and then zlib encode them. While you may save disk space on the resulting string data, you now need to zlib decompress before then marshaling the plain strings back into objects again. 

Clarifying this bit a little, I wanted to point out that the compression step would be something you would do if you had unstructured data, like huge amounts of text, or some image data maybe (or the final serialized state of your encoded object data), and your goal was to reduce the size so that it could be transported faster and take less space. 

Marcus Ottosson

unread,
Oct 5, 2016, 5:04:47 PM10/5/16
to python_in...@googlegroups.com
@fruity, I think you'll need to clarify whether you are looking for advice on serialisation/deserialisation or on reading/writing from disk. You can have something that serialises quickly, but writes slowly. You can also have something which serialises slowly, but writes quickly.

For your tool, you'll need to determine where the bottleneck is, and focus on that. Perhaps it will be in serialising the data to json. Perhaps it will be writing because you are writing to the cloud. Best advice will differ based on which you are having problems with.

Alok Gandhi

unread,
Oct 6, 2016, 1:02:10 AM10/6/16
to python_in...@googlegroups.com
Generally speaking, binary compressed data would help in faster io. 

Just expand a bit on this, I have successfully used modules like gzip(for compression / decompression) and struct(reading/ writing binary data) to write and read huge chunks of data. JSON, YAML etc. do provide excellent functionality for 'structuring' / layering data, but when it  comes to things like baked out simulations, skin weights etc. (as in your case) which require fast and efficient i/o I would not care for sophistication but would design my own custom binary file which is closer to the metal (with some kind of header information for maintaining schema version info and misc metadata) and then lay out the bytes in binary format. The seek operations are much faster this way. Also, I presume, as you have weight info that needs to be applied once, the seek would be mostly linear making it even faster (or is it per frame data sample?).

Another advantage of using this type of  custom binary (library agnostic data) is that later, a suite of faster-compiled tools in C++, C#, C etc. tools can be more easily written to do read/write/apply to DCCs.

You can, perhaps, also use alembic as a data container, as it allows support for arbitrary data and the whole tool ecosystem is available for you (with python bindings).

I had written a custom binary transform cache format a few years back (before we had alembic) where I had tools/plugins in C++ for Maya, Houdini and Softimage to do i/o but then I also had some python code to do the same stuff when reading / writing was not time-critical.


On Thu, Oct 6, 2016 at 2:34 AM, Marcus Ottosson <konstr...@gmail.com> wrote:
@fruity, I think you'll need to clarify whether you are looking for advice on serialisation/deserialisation or on reading/writing from disk. You can have something that serialises quickly, but writes slowly. You can also have something which serialises slowly, but writes quickly.

For your tool, you'll need to determine where the bottleneck is, and focus on that. Perhaps it will be in serialising the data to json. Perhaps it will be writing because you are writing to the cloud. Best advice will differ based on which you are having problems with.

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOBpaakkOftf%2BjoypGsYiYRxHA3yVYgzke626OdK3z37gQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--

fruity

unread,
Oct 6, 2016, 4:11:59 AM10/6/16
to Python Programming for Autodesk Maya
wow, i can see a lot of knowledge here, thanks a lot ! 
So, to clarify a bit, i already have the working prototype, that's why i'd like to optimize it now =] And i didn't mention it, but i'd also like to keep it editable out of maya (i.e. i'd go for a human readable encoding). I didn't know zlib, but it looks very handy, thanks for the tip ! But as it is compressed, i guess i can't export it in a readable way (even if i could, i'd loose the whole benefit of using it). However, for other purposes, it can be really great, i'll keep that in mind ^^

cPickle looks great as well, and based on your test, Justin. It even looks faster than json, i'm surprised it's less used.
So from what i've read here, json seems to be the best option for me (i also want something native, so no msgpack or ujson), but still, i guess i'll have to run some tests to compare it to raw text with no module. The part that i'm not sure of is the memory management when working with huge chunks of data, and i'm even not comfortable at how to mesure and control it. I assume this comes for free when working with modules, but i'd have to handle it if i export my stuff without ?

Marcus Ottosson

unread,
Oct 6, 2016, 4:29:18 AM10/6/16
to python_in...@googlegroups.com

cPickle looks great

If you want it to be human readable, you can’t encrypt, compress or otherwise obfuscate it, so both pickling and zlib is out. But think about that for a second. If you are talking about multi-megabyte/gigabytes worth of data, the mere quantity would make it uneditable regardless of the format.

The part that i’m not sure of is the memory management when working with huge chunks of data, and i’m even not comfortable at how to mesure and control it.

I’m not sure why you would worry about that, unless you’ve experienced some issues already? So long as you read the file into a variable, and that variable isn’t kept around globally, then garbage collection would handle this for you no matter which format you choose.

To get a memory leak, you’d really have to try.

import json

leak = list()

def read_file(fname):
  with open(fname) as f:
    data = json.load(f)
    leak.append(data)  # builds up over time

fruity

unread,
Oct 6, 2016, 5:13:55 AM10/6/16
to Python Programming for Autodesk Maya
actually, the file would be split into 2 parts : the first one would be the infos that i want the user to be able to modify (a couple of lines), the second one would be a text of n lines (n being a number of vertices for a mesh, for instance), that the user will definetly not modify. So you're right, maybe i should split it into 2 files, the first in json or similar, the second encoded..
For the memory leak, yes, i did experience something similar, during my last project (yes, in rigging too, it was nothing but fun on this project =p ), but with in-house tools. So i've seen it, but didn't do it myself. But that's why i'm concerned about that. And by memory leak, i'm also talking about memory management. For instance, you work on super heavy sets (like really super heavy sets.... ;-), and you want to load some datas attached to this set. By default, you'll have to load the entire data file, which will result in a huge consommation of RAM, and ultimately, probably a crash because of a lack of memory. So it wouldn't be a memory leak so to speak, but something that you'd have to handle by reading your file chunk after chunk, and flush the memory after each iteration. I'm not sure any module would do that automatically, though.
I'll think at my problem differently, and try to split it into 2 parts, that seems so obvious now you mentionned it ! Thanks !

Marcus Ottosson

unread,
Oct 6, 2016, 5:31:51 AM10/6/16
to python_in...@googlegroups.com

Cool :)

About reading in a file, a little bit at a time, it’s not as difficult as it seems.

f = open("100gb.json")

At this point, you’ve opened a file of an imaginary 100 gigabytes. What is the effect on memory? 0. Nada.

Now, if you were to this..

data = f.read()

You’d be in trouble.

Now you’ve told Python (and the OS) to go out there and bring back the entire contents of this file and put it in your variable data. That’s no good.

But there are other ways of reading from a file.

first_line = next(f)

Bam, you’ve opened a huge file, read the first line and stopped. No more data is read, memory is barely affected.

You can iterate this too.

for line in f:
  print(line)

It will read one line at a time, print it and throw the data away. Memory is barely affected.

The thing about various file formats, is that some of them can’t be read like this. Some of them won’t make sense until you’ve read the entire file.

For example, consider JSON.

{
  "key": "value"
}

That first line is meaningless. The second line too. For this file to make sense, you will need to read the entire file.

Some formats, including a variation of JSON, support “streaming”, which is what we did up there. So if you’re looking to conserve memory, you’ll have to add this criteria to your desired file format.

ps. Don’t forget to close the file, or use a context manager.

f.close()

--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Marcus Ottosson
konstr...@gmail.com

Marcus Ottosson

unread,
Oct 13, 2016, 8:12:20 AM10/13/16
to python_in...@googlegroups.com

Hey @fruity, how did it go with this? Did you make any progress? :)

​I came to think of another constraint or method which to do what you’re after - in regards to random access. That is, being able to query weights for any given vertex, without (1) reading it all into memory and (2) physically searching for it.

There’s a file format called HDF5 which was designed for this purpose (which has Python bindings as well). It’s written by the scientific community, but applies well to VFX in that they also deal with large datasets of high precision (in this case, millions of vertices and floating point weights). To give you some intuition for how it works, I formulated a StackOverflow question about it a while back that compares it to a “filesystem in a file” that has some good discussion around it.

In more technical terms, you can think of it as Alembic. In fact Alembic is a “fork” of HDF5, which was later rewritten (i.e. “Ogawa“) but maintains (to my knowledge) the gist of how things are organised and accessed internally.

At the end of the day, it means you can store the results of your weights in one of these hdf5 files, and read it back either as you would any normal file (i.e. entirely into memory) or random access - for example, if you’re only interested in applying weights to a selected area of a highly dense polygonal mesh. Or if you have multiple “channels” or “versions” of weights within the same file (e.g. 50 gb of weights), you could pick one without requiring all that memory to be readily available.

fruity

unread,
Oct 13, 2016, 4:35:36 PM10/13/16
to Python Programming for Autodesk Maya
Hi Marcus !

Thanks for your answer&help ! Well, i'm still working on the optimisation, i used json for the readable info (influences, etc), and cPickle for the weights array. But i think most of the optimisation should now come from how to export / import the values to the vertices.
Exporting is not that expensive (0.777652025223s for 39k vertices and 2 influences), but importing is still 4.7600607872s. It seems there are different ways of reading/writing weights, and it takes some time to try all of them ! For now, it seems that skinPercent is definetly the worst idea (about 29sec for importing ^^), and i read that the MFnSkinCluster is not necessarily the best option, at least using getWeights() and setWeights() (http://www.macaronikazoo.com/?p=417). The fastest may be based on querying the values via api plugs.
Long story short, there are a lot of ways of doing it, so i need to try all of them, but i think the part i need to work on is more the maya part than the 'data' part ?
hdf5 looks great (i wish i could have a look at the book you mentionned on stackOverflow, too late now... ;-), but it's not native (because of the use of numpy ?), unfortunately. I'm not really informed on alembic possibilities and what you can or can't do with it, but it's definetly something i want to investigate, looks super powerful !
thanks for the help !

fruity

unread,
Oct 19, 2016, 11:11:53 PM10/19/16
to Python Programming for Autodesk Maya
Hi !

I realized that i didn't put the result of the tests i did, so here they are, if it can help someone else ! 
Treating the maya part (getting / setting the weights) and the IO part (writing / reading the file) separately, i tried : 
Maya : 
cmds.skinPercent
MFnSkinCluster.setWeights
get/setAttr (maya.cmds)
get/setAttr (MPlug)
I get the best results with get/setAttr MPlug. I don't remember the nb of vertices of my test mesh, but i guess it was the same than in my previous email (i.e. 39k) and it was around 1.4s for importing weights (which is the tricky part), against 4.7s with maya.cmds. This time includes the time to read the file (as i was interested by the entire operation, i didn't take time to format the code and separate my timers).
Exporting is roughly similar between cmds and mplug (less than 1s for the same 10k mesh).
IO : 
I finally went for a json dict, and the weight entry is compressed using cPickle. It seemed to be the fastest way, allowing me to keep everything in one file, easily editable and understandable !
There must be better options (e.g. using zlib, or hdf5, although i'd like to keep something native), i'll try to see that more in depth later !
Thanks a lot for your help, anyway !
Reply all
Reply to author
Forward
0 new messages