Support both Python2 and Python3 using virtualenv for plpython

372 views
Skip to first unread message

Yandong Yao

unread,
May 28, 2018, 3:56:10 AM5/28/18
to Greenplum Developers
Hi Community,

PostgreSQL's plpython supports both Python 2 and Python 3 from 9.0, while Greenplum supports only Python2 for plpython as many tools in Greenplum are using python 2.7, like gpstart, gpinitsystem etc.

This proposal supports both plpython2u(plpythonu) and plpython3u in same cluster.  (Not same session, as pointed from https://www.postgresql.org/docs/9.1/static/plpython-python23.htmlIt is not allowed to use PL/Python based on Python 2 and PL/Python based on Python 3 in the same session, because the symbols in the dynamic modules would clash, which could result in crashes of the PostgreSQL server process.)

​Step 1: build plpython3.s0​

​First step is to build plpython3.so using Python 3. PR #5052​ With this PR, we could build plpython3.so even if system python is 2.7

    PYTHON=/path/to/python3 ./configure --with-python ...

​Step 2: Active python 3 for plpython3u

virtualenv has an excellent 'activate_this.py' script which will activate specific python env for running process. By using it, we could activate python 3 for plpython3u on both master and segments.  (As Greenplum use python 2.7 by default, so no need to activate python 2, while we could if want)

We defined 2 UDFs activate_python2() and activate_python3() to activate python2 and python3 separately. (Needs to enhance them to avoid hardcoded path)

TODO: execute those UDF on all segments and master, by leveraging 'EXECUTE ON MASTER' and 'EXECUTE ON ALL SEGMENTS', it is trivial to do.

      CREATE OR REPLACE FUNCTION activate_python2()
       RETURNS text
     AS $$
       activate_this_file = "/home/gpadmin/.venv/python2/bin/activate_this.py"
       execfile(activate_this_file, dict(__file__=activate_this_file))
       return "succeed"
     $$ LANGUAGE plpythonu

     CREATE OR REPLACE FUNCTION activate_python3()
       RETURNS text
     AS $$


       def execfile(filepath, globals=None, locals=None):
       '''execfile is removed from python3, redefine it'''
           if globals is None:
               globals = {}
           globals.update({
               "__file__": filepath,
               "__name__": "__main__",
           })
           with open(filepath, 'rb') as file:
               exec(compile(file.read(), filepath, 'exec'), globals, locals)

       activate_this_file = "/home/gpadmin/.venv/python3/bin/activate_this.py"
       execfile(activate_this_file, dict(__file__=activate_this_file))
       return "succeed"
     $$ LANGUAGE plpython3u

​Step 3: Verification

gpadmin=# Create extension plpython3u;

gpadmin=# SELECT * from activate_python3();
 activate_python3
------------------
 succeed
(1 row)

CREATE OR REPLACE FUNCTION python3version ()
       RETURNS text
AS $$
       import sys
       return sys.version
$$ LANGUAGE plpython3u;


gpadmin=# SELECT * from python3version();
             python3version
-----------------------------------------
 3.6.5 (default, Apr 10 2018, 17:08:37) +
 [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)]
(1 row)


gpadmin=#    DROP TYPE IF EXISTS named_value;
gpadmin=#  CREATE TYPE named_value AS (
  name  text,
  value  integer
);

--Returning a set of results using SETOF
gpadmin=#  CREATE OR REPLACE FUNCTION make_pair_sets (name text)
RETURNS SETOF named_value
AS $$
import numpy as np
​   // Note, numpy is located under python3's site-package.​
return ((name, i) for i in np.arange(3))
$$ LANGUAGE plpython3u;
gpadmin=# SELECT * from make_pair_sets('test');
 name | value
------+-------
 test |     0
 test |     1
 test |     2
(3 rows)

​Thoughts?​

--
Best Regards,
Yandong

Hubert Zhang

unread,
May 28, 2018, 6:31:15 AM5/28/18
to Yandong Yao, Greenplum Developers
Hi Yandong,

I followed your PR to compile a plpython3.so and I was able to run python3 numpy UDF without call activate_python3(). 

In fact, ldd command showed that it already linked to libpython3.6m.so
ldd ~/gpdb.devel/lib/postgresql/plpython3.so
linux-vdso.so.1 =>  (0x00007ffcb4b3d000)
libpython3.6m.so.1.0 => /lib64/libpython3.6m.so.1.0 (0x00007f3d6f904000)

While for Python2, ldd result would be
ldd ~/gpdb.devel/lib/postgresql/plpython2.so
linux-vdso.so.1 =>  (0x00007fff4c302000)
libpython2.7.so.1.0 => /usr/lib64/libpython2.7.so.1.0 (0x00007f763e4ac000)

Embedded python will search for symbol of Python3 automatically without the need to reset the PYTHONPATH. So it seems that Python3 could work directly.

--  Hubert


--
Thanks

Hubert Zhang

Yandong Yao

unread,
May 28, 2018, 6:33:45 AM5/28/18
to Hubert Zhang, Greenplum Developers
For standard packages, it works fine without activate_python3, while if you want to use other packages, like numpy, pandas, then need to use it.
--
Best Regards,
Yandong

Xin Zhang

unread,
May 29, 2018, 12:12:30 PM5/29/18
to Yandong Yao, Hubert Zhang, Greenplum Developers
A very newbie question. How does pl/python work with pl/container for multiple python versions? I am referring to https://gpdb.docs.pivotal.io/570/ref_guide/extensions/pl_container.html#topic_ehl_r3q_dw.

What does it take to support multiple python versions through pl/container?

I am thinking more towards containerize most of env dependency (PL/*) out of GPDB code base and put them under containers, so that can even simplify the Greenplum deployment and environment dependencies.

Thoughts?

Thanks,
Shin
--
Shin
Pivotal | Sr. Principal Software Eng, Data R&D

Robert Eckhardt

unread,
May 29, 2018, 12:15:44 PM5/29/18
to Xin Zhang, Yandong Yao, Hubert Zhang, Greenplum Developers
On Tue, May 29, 2018 at 12:11 PM, Xin Zhang <xzh...@pivotal.io> wrote:
A very newbie question. How does pl/python work with pl/container for multiple python versions? I am referring to https://gpdb.docs.pivotal.io/570/ref_guide/extensions/pl_container.html#topic_ehl_r3q_dw.

PL/Python leverages the Python version on the OS, this is why it is untrusted. 
 

What does it take to support multiple python versions through pl/container?

PL/Container leverages whatever language is in the container image. 

-- Rob

Xin Zhang

unread,
May 29, 2018, 1:15:43 PM5/29/18
to Robert Eckhardt, Yandong Yao, Hubert Zhang, Greenplum Developers
Yeah, is there anything people can do with pl/python but cannot be done using pl/container?

Sorry, this is already diverted from the original `virtualenv` discussion. I tried to stay away from a complicated OS env dependencies to a cleaner containerized environment.

If that's possible, then instead of doing the UDF to switch between environments, which depends on different env on the hosting OS, keeping them all inside different versions of containers might be a cleaner solution, e.g. adding `plpy.setEnv(version)`?

Thanks,
Shin

Hubert Zhang

unread,
May 29, 2018, 11:14:04 PM5/29/18
to Robert Eckhardt, Xin Zhang, Yandong Yao, Greenplum Developers
PLContainer support multi-version python by setting "runtime"
When you create a PLContainer UDF, you must specify your runtime name at the beginning of UDF.
The runtime specify the docker image(you can choose the version of python in your image)

Containerization is the trend to decoupling. But embedded plpython/plr still play their roles. They're good at reducing data transfer time, all the data will be processed in the same processes(QD/QE). While containerization has to introduce an additional process and IPC at least.  Of course, if the python UDF is complex, the IPC overhead could be ignored.


--
Thanks

Hubert Zhang

Hubert Zhang

unread,
May 29, 2018, 11:46:27 PM5/29/18
to Robert Eckhardt, Xin Zhang, Yandong Yao, Greenplum Developers
Let's move back to the original topic on supporting plpython3u on Greenplum.

Suppose we've already generated plpython3.so and plpython2.so

Running plpython3u UDF will load plpython3.so, while running plpython2u UDF will load plpython2.so

Both shared libraries will call Py_Initialize() at _PG_Init() when loading the libraries.
It will search for python module from PYTHONPATH and PYTHONHOME/lib before system path.
If the PYTHONHOME is set to python2 but you are loading a python3 library, error happens.

As for greenplum, the PYTHONHOME is set to GPHOME/ext/python, which is used by tools like gpstart.
gpadmin start postmaster with PYTHONHOME=GPHOME/ext/python and the forked QD/QE inherit the PYTHONHOME env.
So one method to support plpython3u on Greenplum is to unset the PYTHONHOME and PYTHONPATH env before running Py_Initialize()
And then Py_Initialize() will search for python module at system path, like /usr/lib64/. If you install your Python3 with yum, you could run plpython3u on Greenplum already.

As for the case that users install their python in a separate folder, we could use GUCs: gp_python3_home to record the location where users specified python3 path, and then set it before Py_Initialize() by Py_SetPythonHome(gp_python3_home) (and LD_LIBRARY_PATH if needed). 

GUC is better than activate_python3 UDF in greenplum, since the UDF only takes effect on one gang of QE, but a later complex query in the same session may create more gangs and new QE will lose the new Python environment.

The above method doesn't use the virtualenv, but follow the same principle,  Any thought on it?

--Hubert

--
Thanks

Hubert Zhang

Robert Eckhardt

unread,
May 29, 2018, 11:57:08 PM5/29/18
to Hubert Zhang, Xin Zhang, Yandong Yao, Greenplum Developers


On Tue, May 29, 2018, 11:46 PM Hubert Zhang <hzh...@pivotal.io> wrote:
Let's move back to the original topic on supporting plpython3u on Greenplum.

What is the advantages of Python in a virtual environment vs PL/Container? 

Rob 

Yandong Yao

unread,
May 30, 2018, 2:13:27 AM5/30/18
to Hubert Zhang, Robert Eckhardt, Xin Zhang, Greenplum Developers
Well done, Hubert!

Shall we set PATH also?
--
Best Regards,
Yandong

Hubert Zhang

unread,
May 30, 2018, 2:55:06 AM5/30/18
to Yandong Yao, Robert Eckhardt, Xin Zhang, Greenplum Developers
---- Shall we set PATH also?
Probably yes. And if yes, we could set it "PATH=$PYTHONHOME/bin:$PATH".
--
Thanks

Hubert Zhang
Reply all
Reply to author
Forward
0 new messages