(De) Serializing a list a expressions and memory usage

114 views
Skip to first unread message

Saleh Jamali

unread,
Dec 2, 2024, 5:30:11 AMDec 2
to symengine
Hi there,
Greetings,

When I serialize a list of expressions (constructed with Add, Mul, Pow, FuncSymbols), save them to disk, wipe everything, and then load them back from the disk, the memory consumption skyrockets. 

I prepared a benchmark suit to measure in-use memory over time for each stage. https://github.com/salehjg/Symengine-benchmarks 

Screenshot From 2024-12-02 11-22-50.png

Pay attention to "Total In-use". Compare the dark blue part which is memory in-use while loading expressions from disk (with a 5second sleep when finished to see the final memory in-use) with the orange line which is the memory needed for generating the expressions and storing all in RAM. This is for 1024 expressions. For 1M expressions, things get a LOT worse.

I understand that SymEngine uses a customized Cereal implementation to serialize smart pointers of SymEngine class instances. My initial guess is that serializing and deserializing expressions individually leads to repeated serialization of the same instances in different expressions.
Like if I have two expressions like E1 and E2 below and I serialize them individually (not in the save Cereal archive), the symbols C and B will be duplicated to C0, C1 and B0, B1. Even though on RAM, for both expressions, B and C point to unique instances of B0 and C0.

E1 = A + B + C
E2 = D + B + C

Do I have a correct intuition here?
Maybe we can introduce a method to take a vec_basic as input and serialize it all in a single Cereal binary archive, returning the binary data as std::string?

** Benchmarks are built against SymEngine 0.13 with G++14.2, -O3 on Archlinux.

Saleh Jamali

unread,
Dec 11, 2024, 5:29:24 AMDec 11
to symengine
It seems that my assumption is truly the case.
When the same symbol is used in multiple terms of more than just one expression and when these expressions are serialized using `Basic::dumps()`, the shared symbols are serialized multiple times (which is not a surprise as each expr has its own Cereal archive).
I wrote a visitor class to visit all symbols in expressions of a vec_basic. This visitor searches for symbols that have the same name but not the same object address in memory.

Here is the log that shows after deserializing, there are many duplicate instances of Symbol class for mathematically identical symbols in the expressions of the vec_basic:

```
cmake-build-release/src/benchmarks/bench01/bench01_main
==============================================
*** Benchmark bench01 created
*** Benchmark bench01 started
*** Benchmark bench01 preparation finished
Generating 16 expressions of length 2048 and power 5
######################  Current memory usage: 0.08 GB, Allocated: 0.00 GB, In-use: 0.00 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.16 GB, Allocated: 0.01 GB, In-use: 0.01 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.16 GB, Allocated: 0.01 GB, In-use: 0.01 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.17 GB, Allocated: 0.02 GB, In-use: 0.02 GB, Ext-fragmantation: 1.00  ****
Saving the exprs onto the disk.
expr_0 size: 359099 has been saved
expr_1 size: 359774 has been saved
expr_2 size: 359849 has been saved
expr_3 size: 359174 has been saved
expr_4 size: 357899 has been saved
expr_5 size: 359399 has been saved
expr_6 size: 359249 has been saved
expr_7 size: 360974 has been saved
expr_8 size: 361499 has been saved
expr_9 size: 360074 has been saved
expr_10 size: 361049 has been saved
expr_11 size: 360074 has been saved
expr_12 size: 357974 has been saved
expr_13 size: 356924 has been saved
expr_14 size: 357899 has been saved
expr_15 size: 357749 has been saved
Checking for duplicates (symbols)
Total number of symbols registered: 6144
Number of duplicate symbols found: 0
Wiping everything
Loading the exprs from the disk.
Loaded expr_0
Loaded expr_1
Loaded expr_2
Loaded expr_3
######################  Current memory usage: 0.17 GB, Allocated: 0.02 GB, In-use: 0.00 GB, Ext-fragmantation: 1.00  ****
Loaded expr_4
Loaded expr_5
Loaded expr_6
Loaded expr_7
Loaded expr_8
Loaded expr_9
Loaded expr_10
Loaded expr_11
Loaded expr_12
Loaded expr_13
Loaded expr_14
Loaded expr_15
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
######################  Current memory usage: 0.19 GB, Allocated: 0.04 GB, In-use: 0.04 GB, Ext-fragmantation: 1.00  ****
Checking for duplicates (symbols) again
Duplicate symbol found: a_1702
|___> Address 1: 0x62c5bf5f9700, Address 2: 0x62c5c0606d50
Duplicate symbol found: b_1702
|___> Address 1: 0x62c5bf5f9650, Address 2: 0x62c5c0606bd0
Duplicate symbol found: c_1701
|___> Address 1: 0x62c5bf5f9580, Address 2: 0x62c5c0607240
Duplicate symbol found: a_302
|___> Address 1: 0x62c5bf5f94d0, Address 2: 0x62c5bf5aa0e0
Duplicate symbol found: a_1698
|___> Address 1: 0x62c5bf5f9400, Address 2: 0x62c5c0608290
Duplicate symbol found: a_1078
|___> Address 1: 0x62c5bf5f9350, Address 2: 0x62c5bf72fb50
Duplicate symbol found: b_1078
|___> Address 1: 0x62c5bf5f9280, Address 2: 0x62c5bf72f9d0
Duplicate symbol found: c_1078
|___> Address 1: 0x62c5bf5f97d0, Address 2: 0x62c5bf72f7e0
Duplicate symbol found: a_295
|___> Address 1: 0x62c5bf5f9d50, Address 2: 0x62c5c06c1d80
Duplicate symbol found: b_295
|___> Address 1: 0x62c5bf5f9c80, Address 2: 0x62c5c06c0840
Duplicate symbol found: c_295
|___> Address 1: 0x62c5bf5f9bd0, Address 2: 0x62c5c06b7640
Duplicate symbol found: b_1695
|___> Address 1: 0x62c5bf5f9a90, Address 2: 0x62c5bf781ca0
Duplicate symbol found: c_1695
|___> Address 1: 0x62c5bf5f99e0, Address 2: 0x62c5bf781ab0
Duplicate symbol found: a_1533
|___> Address 1: 0x62c5bf5f9940, Address 2: 0x62c5bf65cd30
Duplicate symbol found: b_1533
|___> Address 1: 0x62c5bf5f9880, Address 2: 0x62c5bf7f3120
Duplicate symbol found: c_1533
|___> Address 1: 0x62c5bf5f9e00, Address 2: 0x62c5c0567000
Duplicate symbol found: a_1506
|___> Address 1: 0x62c5bf5fa2f0, Address 2: 0x62c5bf7e6960
Duplicate symbol found: c_1506
|___> Address 1: 0x62c5bf5fa240, Address 2: 0x62c5bf7e67e0
Duplicate symbol found: b_1506
|___> Address 1: 0x62c5bf5fa1a0, Address 2: 0x62c5bf7e65f0
Duplicate symbol found: a_1691
|___> Address 1: 0x62c5bf5fa0e0, Address 2: 0x62c5bf803240
Duplicate symbol found: b_1691
|___> Address 1: 0x62c5bf5fa030, Address 2: 0x62c5bf8030c0
Duplicate symbol found: c_1691
|___> Address 1: 0x62c5bf5f9fa0, Address 2: 0x62c5bf802ed0
Duplicate symbol found: a_1687
|___> Address 1: 0x62c5bf5f9ef0, Address 2: 0x62c5bf8281c0
Duplicate symbol found: b_1687
|___> Address 1: 0x62c5bf5fa430, Address 2: 0x62c5bf828040
Duplicate symbol found: c_1687
|___> Address 1: 0x62c5bf5fa940, Address 2: 0x62c5bf827e50
Duplicate symbol found: b_1686
|___> Address 1: 0x62c5bf5fa890, Address 2: 0x62c5bf8277e0
Duplicate symbol found: a_1680
|___> Address 1: 0x62c5bf5fa800, Address 2: 0x62c5bf8265a0
Duplicate symbol found: b_1680
|___> Address 1: 0x62c5bf5fa750, Address 2: 0x62c5bf826420
Duplicate symbol found: c_1680
|___> Address 1: 0x62c5bf5fa660, Address 2: 0x62c5bf826230
Duplicate symbol found: a_1679
|___> Address 1: 0x62c5bf5fa5b0, Address 2: 0x62c5bf733090
Duplicate symbol found: a_1678
|___> Address 1: 0x62c5bf5fa4e0, Address 2: 0x62c5bf631bc0
Duplicate symbol found: b_1678
|___> Address 1: 0x62c5bf5faa00, Address 2: 0x62c5bf631a40
Duplicate symbol found: c_1678
|___> Address 1: 0x62c5bf5fafb0, Address 2: 0x62c5bf631850
Duplicate symbol found: a_668
|___> Address 1: 0x62c5bf5faec0, Address 2: 0x62c5bf5eff20
Duplicate symbol found: b_668
|___> Address 1: 0x62c5bf5fae10, Address 2: 0x62c5bf5efda0
Duplicate symbol found: c_668
|___> Address 1: 0x62c5bf5fad40, Address 2: 0x62c5bf5efbb0
...
...
...
```

Isuru Fernando

unread,
Dec 11, 2024, 5:47:49 AMDec 11
to syme...@googlegroups.com
Hi Saleh,

Yes, you are right that if the same symbol is used in different expressions, new symbols are created.

It should be easy to add a method to serialize a vec_basic or a DenseMatrix. Here's a PR that serializes a Densematrix
https://github.com/symengine/symengine/pull/2069

Isuru


--
You received this message because you are subscribed to the Google Groups "symengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to symengine+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/symengine/042200a3-2353-4b93-9e6d-f008d15ea519n%40googlegroups.com.

Saleh Jamali

unread,
Dec 11, 2024, 5:55:22 AMDec 11
to symengine
Thank you for your kind reply.
I am experimenting but it seems that even with a unique Cereal archive for all expressions in a vec_basic, I still get duplicates:
```
std::string Basic::dumps0(const vec_basic &v)
{
    std::ostringstream oss;
    unsigned short major = SYMENGINE_MAJOR_VERSION;
    unsigned short minor = SYMENGINE_MINOR_VERSION;
    size_t vec_len = v.size();
    cereal::PortableBinaryOutputArchive ser(oss);

    ser(major, minor, vec_len);
    for (size_t i = 0; i < vec_len; i++) {
        ser(v[i]);
    }
    return oss.str();
}

vec_basic Basic::loads_vec0(const std::string &str)
{
    unsigned short major, minor;
    size_t vec_len;
    vec_basic v;
    std::istringstream iss(str);
    cereal::PortableBinaryInputArchive iarchive{iss};
    iarchive(major, minor, vec_len);
    if (major != SYMENGINE_MAJOR_VERSION or minor != SYMENGINE_MINOR_VERSION) {
        throw SerializationError(StreamFmt()
                                 << "SymEngine-" << SYMENGINE_MAJOR_VERSION
                                 << "." << SYMENGINE_MINOR_VERSION
                                 << " was asked to deserialize an object "
                                 << "created using SymEngine-" << major << "."
                                 << minor << ".");
    }
    if (vec_len == 0) {
        throw SerializationError("Cannot deserialize an empty vector.");
    }

    for (size_t i = 0; i < vec_len; ++i) {
        RCP<const Basic> p;
        iarchive(p);
        v.push_back(p);
    }
    return v;
}
```

I specifically checked to see if we have duplicates in a single expr and that's NOT the case.
Any advice?

Isuru Fernando

unread,
Dec 12, 2024, 7:14:53 AMDec 12
to syme...@googlegroups.com

Saleh Jamali

unread,
Dec 14, 2024, 3:18:04 PM (12 days ago) Dec 14
to symengine
Thank you so much.
This PR solves the memory usage problem completely!

I used the vec_basic version of dumps() and loads() (that I posted above) to run this benchmark:
Also, the symbol visitor can no longer find any duplicate symbols. I assume there are no duplicate sub-expressions either since memory-in-use stays the same before and after serialization/deserialization.

Again, thank you!

Screenshot From 2024-12-14 13-41-13.png
Reply all
Reply to author
Forward
0 new messages