Before getting into how to make something fast, I must point out that your journey towards performance is meaningless without having something to run. Your first call to port should be to build your tool in the simplest and most straightforward and obvious way possible, and then start to look at how to make it faster.
Now, when it comes to speed, reading and writing are generally at conflict with each other. The faster it writes, the slower it reads.
The fastest thing to write is also the smallest - that could mean compressing your data for example.
import zlib
data = b"My very long string, "
compressed = zlib.compress(data)
At this point, writing will be at a peak speed, only dependent on the quality of your chosen compression method and amount of content you compress.
But reading suffers.
The fastest thing to read is also the smallest - that could mean decompressing your data.
import zlib
data = b"My very long string, "
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert decompressed == data
If you measure the results of compressed
against decompressed
, you’ll find that it’s actually larger than the original.
import zlib
data = b"My very long string, "
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert len(compressed) > len(decompressed)
What gives? Because the data is so small, the added overhead of the compression method outweighs the benefits. Compression, with zlib and likely other algorithms, are most effective on large, repetitive data structures.
import zlib
data = b"My very long string, " * 10000
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert len(compressed) < len(decompressed) # Notice the flipped comparison sign
To get a sense of the savings made, you could try something like this.
import sys
import zlib
data = b"My very long string, " * 10000
compressed = zlib.compress(data)
decompressed = zlib.decompress(compressed)
assert decompressed == data
original_size = sys.getsizeof(data)
compressed_size = sys.getsizeof(compressed)
print("original: %.2f kb\ncompressed: %.2f kb\n= %i times smaller" % (
original_size / 1024.0,
compressed_size / 1024.0,
original_size / compressed_size)
)
Which on my system (Windows 10, Python 3.3x64) prints:
original: 205.11 kb
compressed: 0.58 kb
= 356 times smaller
Now which is it, do you need it to read fast, or write fast? :)
Best,
Marcus
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOCAjZUST94WFyWE-JjmzN1z9m9hkHbpXtJyLWWv14X-LA%40mail.gmail.com.
Generally speaking, binary compressed data would help in faster io.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOCAjZUST94WFyWE-JjmzN1z9m9hkHbpXtJyLWWv14X-LA%40mail.gmail.com.
----
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_m...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAPaTLMR3JYvWTTLwf2kbj%3Dtp0R3qqTSqxn3ORvrtEw4E2jn1jA%40mail.gmail.com.
On Thu, 6 Oct 2016, 2:25 AM Alok Gandhi <alok.ga...@gmail.com> wrote:Generally speaking, binary compressed data would help in faster io.Yes agreed.Using something like zlib, as Marcus suggested, would actually mean you have to do more work because first you have to serialize your python objects to strings and then zlib encode them. While you may save disk space on the resulting string data, you now need to zlib decompress before then marshaling the plain strings back into objects again.
Generally speaking, binary compressed data would help in faster io.
@fruity, I think you'll need to clarify whether you are looking for advice on serialisation/deserialisation or on reading/writing from disk. You can have something that serialises quickly, but writes slowly. You can also have something which serialises slowly, but writes quickly.For your tool, you'll need to determine where the bottleneck is, and focus on that. Perhaps it will be in serialising the data to json. Perhaps it will be writing because you are writing to the cloud. Best advice will differ based on which you are having problems with.
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/CAFRtmOBpaakkOftf%2BjoypGsYiYRxHA3yVYgzke626OdK3z37gQ%40mail.gmail.com.
cPickle looks great
If you want it to be human readable, you can’t encrypt, compress or otherwise obfuscate it, so both pickling and zlib is out. But think about that for a second. If you are talking about multi-megabyte/gigabytes worth of data, the mere quantity would make it uneditable regardless of the format.
The part that i’m not sure of is the memory management when working with huge chunks of data, and i’m even not comfortable at how to mesure and control it.
I’m not sure why you would worry about that, unless you’ve experienced some issues already? So long as you read the file into a variable, and that variable isn’t kept around globally, then garbage collection would handle this for you no matter which format you choose.
To get a memory leak, you’d really have to try.
import json
leak = list()
def read_file(fname):
with open(fname) as f:
data = json.load(f)
leak.append(data) # builds up over time
Cool :)
About reading in a file, a little bit at a time, it’s not as difficult as it seems.
f = open("100gb.json")
At this point, you’ve opened a file of an imaginary 100 gigabytes. What is the effect on memory? 0. Nada.
Now, if you were to this..
data = f.read()
You’d be in trouble.
Now you’ve told Python (and the OS) to go out there and bring back the entire contents of this file and put it in your variable data
. That’s no good.
But there are other ways of reading from a file.
first_line = next(f)
Bam, you’ve opened a huge file, read the first line and stopped. No more data is read, memory is barely affected.
You can iterate this too.
for line in f:
print(line)
It will read one line at a time, print it and throw the data away. Memory is barely affected.
The thing about various file formats, is that some of them can’t be read like this. Some of them won’t make sense until you’ve read the entire file.
For example, consider JSON.
{
"key": "value"
}
That first line is meaningless. The second line too. For this file to make sense, you will need to read the entire file.
Some formats, including a variation of JSON, support “streaming”, which is what we did up there. So if you’re looking to conserve memory, you’ll have to add this criteria to your desired file format.
ps. Don’t forget to close the file, or use a context manager.
f.close()
--
You received this message because you are subscribed to the Google Groups "Python Programming for Autodesk Maya" group.
To unsubscribe from this group and stop receiving emails from it, send an email to python_inside_maya+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/python_inside_maya/211afd52-15e9-4b4f-81bb-05d11fb702b4%40googlegroups.com.
Hey @fruity, how did it go with this? Did you make any progress? :)
I came to think of another constraint or method which to do what you’re after - in regards to random access. That is, being able to query weights for any given vertex, without (1) reading it all into memory and (2) physically searching for it.
There’s a file format called HDF5 which was designed for this purpose (which has Python bindings as well). It’s written by the scientific community, but applies well to VFX in that they also deal with large datasets of high precision (in this case, millions of vertices and floating point weights). To give you some intuition for how it works, I formulated a StackOverflow question about it a while back that compares it to a “filesystem in a file” that has some good discussion around it.
In more technical terms, you can think of it as Alembic. In fact Alembic is a “fork” of HDF5, which was later rewritten (i.e. “Ogawa“) but maintains (to my knowledge) the gist of how things are organised and accessed internally.
At the end of the day, it means you can store the results of your weights in one of these hdf5 files, and read it back either as you would any normal file (i.e. entirely into memory) or random access - for example, if you’re only interested in applying weights to a selected area of a highly dense polygonal mesh. Or if you have multiple “channels” or “versions” of weights within the same file (e.g. 50 gb of weights), you could pick one without requiring all that memory to be readily available.