DataFrame Performance

34 views
Skip to first unread message

Sean DeNigris

unread,
Apr 11, 2022, 10:19:36 AM4/11/22
to PolyMath
I am wondering about the performance of `DataFrame`. I recently needed value counts for a DataSeries with about 8,000 rows.

In GT, I did:
```smalltalk
aDataFrame column: 'col_name'.
phoneCounts := phoneSeries valueCounts
```
It took enough time to get a cup of coffee - about 3 minutes.

As a sanity check, I then did in Python:
```python
import pandas as pd
#...
df['col_name'].value_counts()
```
which evaluated near-instantly.

Of course, the next steps in Pharo were much easier/faster due to the live, dynamic environment, but the difference was a bit startling.

Since I am not a data science professional, I wanted to discuss with the community to understand better the situation.

Offray Vladimir Luna Cárdenas

unread,
Apr 11, 2022, 11:23:35 AM4/11/22
to polymath...@googlegroups.com

Hi Sean,

I can not tell about DataFrame performance in particular. But recently, for TiddlyWikiPharo[1], we needed to read the data store inside the HTML TiddlyWiki and we did a fluent prototype of it in Pharo/GT, but when performance was needed, we replace the data reader/writer by a custom Nim[2] implementation and we get between 10 to 15 times faster performance. You can get a debugging data story notes (in Spanish) at [3].

[1] https://code.tupale.co/Offray/TiddlyWikiPharo/
[2] https://nim-lang.org/
[3] http://mutabit.com/repos.fossil/mutabit/doc/trunk/wiki/es/notas-offray--4p69o.md.html

In my experience, once the amount of data to read/serialize outside the image becomes not so large, this hybrid approaches of quick prototyping in Pharo and (de)serialization in faster languages (like the pretty cool Nim) seems like the best one to balance performance of the developer time and the machine execution.... Ummm I wonder what the future holds regarding such hybridations, like writing in Pharo and exporting to Nim :-).

Cheers,

Offray

--
You received this message because you are subscribed to the Google Groups "PolyMath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to polymath-proje...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/polymath-project/b25abbb6-bcd9-4a7d-9f8f-988f8134cc35n%40googlegroups.com.

Sean DeNigris

unread,
Apr 19, 2022, 12:33:48 PM4/19/22
to PolyMath
Cross posting to Google Group and Discord channel - not sure where more appropriate, feedback about that welcome...

Is the following anything like what you had in mind? I'm experimenting with keeping frames and series on Python side until needed on Pharo side, so I get all the speed until I need Pharo's special magical powers.

For example, I create an object which holds onto a python variable name that lives on the python side of the bridge:
```smalltalk
PyDataFrame>>#readFromCsv: file
   
    | pyVarName |
    pyVarName := PBApplication uniqueInstance newCommandSourceFactory
    source: 'import pandas as pd
pd.read_csv(file)';
        useRandomVariableName;
        bindings: {
            #file -> file fullName };
        sendAndWait;
        resultVariableName.
       
    ^ self new
        pythonVariableName: pyVarName;
        yourself
```
Here is an example of using that object as the target of a python command:
```smalltalk
PyDataFrame>>#at: index
    | cmd |
    cmd := PBApplication uniqueInstance newCommandSourceFactory
    template: '{self}[{index}]'
        format: {
            #self -> self pythonVariableName.
            #index -> index printString }.
    ^ CadSeries fromCommand: cmd
```

And here another such object is converted to a "real" PolyMath DataSeries object:
PyDataSeries>>#asDataSeries
    ^ DataSeries newFrom: self toDictionary associations
```

Offray Vladimir Luna Cárdenas

unread,
Apr 29, 2022, 6:14:39 PM4/29/22
to polymath...@googlegroups.com

Hi Sean,

Sorry for my late response. I'm in a tight deadline regarding a data narratives project (I hope to share it soon with the broader Pharo community).

Yes, something like what you're doing is similar to what we're doing, but using Nim instead of Python, so without the bridge. We just use Nim for quick HTML parsing and data serialization from/to JSON and after that we keep data as dynamic objects in the Pharo side. So we use Nim when we strike a performance problem in data reading and (de)serialization from/to JSON and the file system and the all rest is done in Pharo.

Cheers,

Offray

Reply all
Reply to author
Forward
0 new messages