Integrate PyHive and Alembic

727 views
Skip to first unread message

Alexander Peletz

unread,
Feb 10, 2017, 7:41:07 AM2/10/17
to sqlalchemy-alembic
Hello,

I would like to use Alembic to manage my Hive Metastore. I have installed, PyHive, SqlAlchemy, and Alembic. I am able to create a functional engine object using the 'hive' dialect in sqlalchemy, however I cannot get Alembic to recognize this dialect. The problem appears to be a lack of a HiveImpl class in the Alembic package. I attempted to resolve this by creating an alembic/ddl/hive.py module and pasting the following code into that module:


from .impl import DefaultImpl

class HiveImpl(DefaultImpl):
__dialect__ = 'hive'



I simply want to be able to execute raw SQL against a Hive instance (no ORM implementation needed) and I was hoping to use Alembic to manage the minimum upgrade/downgrade functionality. Are there any simple edits I can make to the Alembic source code to allow me to achieve this goal?


Thanks,
Alexander

mike bayer

unread,
Feb 10, 2017, 9:45:38 AM2/10/17
to sqlalchem...@googlegroups.com


On 02/10/2017 07:41 AM, Alexander Peletz wrote:
> Hello,
>
> I would like to use Alembic to manage my Hive Metastore. I have
> installed, PyHive, SqlAlchemy, and Alembic. I am able to create a
> functional engine object using the 'hive' dialect in sqlalchemy, however
> I cannot get Alembic to recognize this dialect. The problem appears to
> be a lack of a HiveImpl class in the Alembic package. I attempted to
> resolve this by creating an alembic/ddl/hive.py module and pasting the
> following code into that module:
>
>
> from .impl import DefaultImpl
>
> class HiveImpl(DefaultImpl):
> __dialect__ = 'hive'



you don't actually have to create a "hive.py" file. Anywhere in your
hive dialect, simply put the above code that you have (using the correct
imports of course). The DefaultImpl uses a metaclass that will allow
the "hive" name to be available to alembic as a result of this class
being created.


>
>
>
>
> I simply want to be able to execute raw SQL against a Hive instance (no
> ORM implementation needed) and I was hoping to use Alembic to manage the
> minimum upgrade/downgrade functionality. Are there any simple edits I
> can make to the Alembic source code to allow me to achieve this goal?
>
>
>
> Thanks,
>
> Alexander
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy-alembic" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy-alem...@googlegroups.com
> <mailto:sqlalchemy-alem...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Alexander Peletz

unread,
Feb 10, 2017, 12:41:48 PM2/10/17
to sqlalchemy-alembic
Great thank you for the quick response! I actually found this morning that I can put the class definition in alembic/ddl/impl.py and it works (seemed like a logical module for it?).

Next issue is the CREATE TABLE statement used to create the alembic_version table uses keywords that are not valid HiveQL keywords. I assume somewhere in the HiveImpl class I could override the contents of this statement, but I couldn't figure out where/how to do that.


Thanks,
Alexander

mike bayer

unread,
Feb 10, 2017, 1:26:20 PM2/10/17
to sqlalchem...@googlegroups.com


On 02/10/2017 12:41 PM, Alexander Peletz wrote:
> Great thank you for the quick response! I actually found this morning
> that I can put the class definition in alembic/ddl/impl.py and it works
> (seemed like a logical module for it?).
>
> Next issue is the CREATE TABLE statement used to create the
> alembic_version table uses keywords that are not valid HiveQL keywords.
> I assume somewhere in the HiveImpl class I could override the contents
> of this statement, but I couldn't figure out where/how to do that.


for CREATE TABLE assuming you've written a whole SQLAlchemy dialect,
you'll want to look into your DDLCompiler in the visit_create_table
method. This is on the SQLAlchemy side, not Alembic.

However, usually you don't need to override visit_create_table as a
whole, I'd imagine you're needing to deal with something special about
columns, datatypes, or constraints. There are individual methods that
deal with those things, I'd take a look at some of the existing
SQLAlchemy dialects to see how those things get set up.
> > an email to sqlalchemy-alem...@googlegroups.com
> <javascript:>
> > <mailto:sqlalchemy-alem...@googlegroups.com
> <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy-alembic" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy-alem...@googlegroups.com
> <mailto:sqlalchemy-alem...@googlegroups.com>.

Alexander Peletz

unread,
Mar 1, 2017, 2:13:09 PM3/1/17
to sqlalchemy-alembic
I am using the pyhive library which includes a partial implementation of the 'hive' dialect. I have edited various aspects of this dialect but I am still unable to affect the initial create table statement that Alembic is issuing to build the alembic_version table. Here is what Alembic is issuing:

'CREATE TABLE `alembic_version` (\n\t`version_num` STRING NOT NULL, \n\tCONSTRAINT `alembic_version_pkc` PRIMARY KEY (`version_num`)\n)\n\n'

And here is what I need:
'CREATE TABLE `alembic_version` (\n\t`version_num` STRING\n)\n\n'

I have attempted to add a visit_primary_key_constraint() method to my dialect which simply returns '' but this is having no effect. And I am a bit lost in finding which method I should override to ensure I never append a 'NOT NULL' string to the end of a column specification.

Any recommendations?

Thanks,
Alexander
>     > an email to sqlalchemy-alembic+unsub...@googlegroups.com
>     <javascript:>
>     > <mailto:sqlalchemy-alembic+unsub...@googlegroups.com
>     <javascript:>>.
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy-alembic" group.
> To unsubscribe from this group and stop receiving emails from it, send

mike bayer

unread,
Mar 1, 2017, 2:51:42 PM3/1/17
to sqlalchem...@googlegroups.com
does your dialect override DDLCompiler.get_column_specification()?
That's the best way to get in extra things like NOT NULL, etc. Example:
https://github.com/zzzeek/sqlalchemy/blob/master/lib/sqlalchemy/dialects/mysql/base.py#L982
> > > an email to sqlalchemy-alem...@googlegroups.com
> <javascript:>
> > <javascript:>
> > > <mailto:sqlalchemy-alem...@googlegroups.com
> <javascript:>
> > <javascript:>>.
> > > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>
> > <https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>>.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "sqlalchemy-alembic" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to sqlalchemy-alem...@googlegroups.com
> <javascript:>
> > <mailto:sqlalchemy-alem...@googlegroups.com
> <javascript:>>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy-alembic" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sqlalchemy-alem...@googlegroups.com
> <mailto:sqlalchemy-alem...@googlegroups.com>.

Alexander Peletz

unread,
Mar 1, 2017, 3:46:40 PM3/1/17
to sqlalchemy-alembic
The dialect subclassed the SQLCompiler but not the DDLCompiler. Putting the visit_* statements in a DDLCompiler subclass worked! Thanks!

Now to override the UPDATE and DELETE statements since Hive doesn't support those... Will give it a whirl before asking more questions.

Thanks,
Alexander
>     >     > an email to sqlalchemy-alembic+unsub...@googlegroups.com
>     <javascript:>
>     >     <javascript:>
>     >     > <mailto:sqlalchemy-alembic+unsub...@googlegroups.com
>     <javascript:>
>     >     <javascript:>>.
>     >     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>
>     >     <https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>>.
>     >
>     > --
>     > You received this message because you are subscribed to the Google
>     > Groups "sqlalchemy-alembic" group.
>     > To unsubscribe from this group and stop receiving emails from it,
>     send
>     > an email to sqlalchemy-alembic+unsub...@googlegroups.com
>     <javascript:>
>     > <mailto:sqlalchemy-alembic+unsub...@googlegroups.com
>     <javascript:>>.
>     > For more options, visit https://groups.google.com/d/optout
>     <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "sqlalchemy-alembic" group.
> To unsubscribe from this group and stop receiving emails from it, send

Ke Zhu

unread,
Jan 24, 2020, 1:56:59 PM1/24/20
to sqlalchemy-alembic
Just discovered this post when trying to do exact same thing (besides planning to support one more dialect).

> Anywhere in your  hive dialect, simply put the above code that you have (using the correct imports of course). 

Does it mean it must introduce dependency to alembic (since it uses alembic.ddl.impl.DefaultImpl) in a package (.e.g, pyHive) that supports sqlalchemy interfaces?

If not, is there any guidance to support this at alembic level in a plug-gable way? E.g., declare a HiveImpl class in `env.py` of a project uses alembic?

PS: I raised this question in stackoverflow but raised this group is a better place to get help.


On Friday, February 10, 2017 at 9:45:38 AM UTC-5, mike bayer wrote:

Mike Bayer

unread,
Jan 25, 2020, 6:19:23 PM1/25/20
to sqlalchem...@googlegroups.com


On Fri, Jan 24, 2020, at 1:56 PM, Ke Zhu wrote:
Just discovered this post when trying to do exact same thing (besides planning to support one more dialect).

> Anywhere in your  hive dialect, simply put the above code that you have (using the correct imports of course). 

Does it mean it must introduce dependency to alembic (since it uses alembic.ddl.impl.DefaultImpl) in a package (.e.g, pyHive) that supports sqlalchemy interfaces?

well you have to put it in a try/except ImportError block so that if alembic isn't installed, it silently passes.   there's a github issue to add support for real entrypoints but it hasn't been that critical.


If not, is there any guidance to support this at alembic level in a plug-gable way? E.g., declare a HiveImpl class in `env.py` of a project uses alembic?

you could put one in your env.py also but if you are the person working on the dialect you can have this built in, see the example in sqlalchemy-redshift: https://github.com/sqlalchemy-redshift/sqlalchemy-redshift/blob/master/sqlalchemy_redshift/dialect.py#L27



--
You received this message because you are subscribed to the Google Groups "sqlalchemy-alembic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy-alem...@googlegroups.com.

Ke Zhu - kzhu@us.ibm.com

unread,
Feb 24, 2020, 3:49:19 PM2/24/20
to sqlalchem...@googlegroups.com
Mike,

Thanks for the pointers. I've figured out the programming part and discovers more things when integrating PyHive:

1. It requires the table alembic_version to be transactional so that it can do Update/Delete when upgrade/downgrade. which is challege for Hive3 which has limited ACID support. Since I could not figure out a way to program a transactional table via SQLAlechemy API (see https://github.com/dropbox/PyHive/issues/314), it ended up a decoration to patch sqlalchemy.schema.CreateTable by appending table properties.
2. PyHive doesn't fully support Update/Delete for Hive/Presto yet. it's easy to patch PyHive but the key problem is it doesn't support transactional DDL like rollback in Hive3.

Although I've managed to get `alembic upgrade` and `alembic downgrade` to work on Hive 3, it's still not a fully transactional experience (e.g., changed schema sccessfully but failed to update table alembic_version).

I wonder if there's any design direction in Alembic to allowing storing the version table `alembic_version` in another db when dealing with non-RDBMS SQL engine (Hive/Presto). e.g., supporting a postgres db to store table alembic while delivering the actual changes to Hive. I had a PoC to using multi-db template to manage the table `alembic_version` in a RDBMS while denying any operation on table `alembic_version` in Hive/Presto. it works now but does it sound right?

Mike Bayer

unread,
Feb 24, 2020, 6:28:26 PM2/24/20
to sqlalchem...@googlegroups.com


On Mon, Feb 24, 2020, at 3:44 PM, Ke Zhu - kz...@us.ibm.com wrote:
Mike,

Thanks for the pointers. I've figured out the programming part and discovers more things when integrating PyHive:

1. It requires the table alembic_version to be transactional so that it can do Update/Delete when upgrade/downgrade. which is challege for Hive3 which has limited ACID support. Since I could not figure out a way to program a transactional table via SQLAlechemy API (see https://github.com/dropbox/PyHive/issues/314), it ended up a decoration to patch sqlalchemy.schema.CreateTable by appending table properties.

it's not a hard requirement that there's a DB transaction in use, you could run alembic in an "autocommit" mode and that shouldn't cause any problem.  you just won't be able to roll back if something fails.   im not completely sure what you mean by "the table to be transactional so that it can do update/delete" but from Alembic's point of view it just needs to run INSERT/UPDATE/DELETE but there doesn't have to be any ACID guarantees.


2. PyHive doesn't fully support Update/Delete for Hive/Presto yet. it's easy to patch PyHive but the key problem is it doesn't support transactional DDL like rollback in Hive3.

Although I've managed to get `alembic upgrade` and `alembic downgrade` to work on Hive 3, it's still not a fully transactional experience (e.g., changed schema sccessfully but failed to update table alembic_version).

so..this is a database that can change schema structures but not insert/update/delete rows?  im not following.  is there some kind of SQL layer that has to be in use that isn't there when you create structures?



I wonder if there's any design direction in Alembic to allowing storing the version table `alembic_version` in another db when dealing with non-RDBMS SQL engine (Hive/Presto). e.g., supporting a postgres db to store table alembic while delivering the actual changes to Hive. I had a PoC to using multi-db template to manage the table `alembic_version` in a RDBMS while denying any operation on table `alembic_version` in Hive/Presto. it works now but does it sound right?

that wouldn't be very easy but also it would be way better to store the version info in the target DB itself.   I don't know anything about Hive/Presto, but they *are* databases so I'd assume you can put data in them.


Ke Zhu - kzhu@us.ibm.com

unread,
Feb 25, 2020, 11:07:03 AM2/25/20
to sqlalchem...@googlegroups.com
According to Apache Hive, Update/Delete can only be performed on tables that supports ACID. see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update

> this is a database that can change schema structures but not insert/update/delete rows? is there some kind of SQL layer that has to be in use that isn't there when you create structures?

Yes! Basically it separate data store and data model so that you can update table schema as metadata while the actual data can be loaded/streamed into external data store w/o SQL interface.

> that wouldn't be very easy but also it would be way better to store the version info in the target DB itself.

totally understand this. I see this design as advantage as well just like other schema management tool like sqitch. I'm just looking for possiblilities since I've seen it uses the same connection to execute schema changes and versioning change in a migration context.

> I don't know anything about Hive/Presto, but they *are* databases so I'd assume you can put data in them

Yes to Hive, when using PyHive, it just needs to fix https://github.com/dropbox/PyHive/issues/315

While, the answer will be no to PrestoSQL which is just a SQL query engine that delegates data model and data store to query targets (MySQL/Postgres/Kafka/Elasticsearch etc) via connectors.

Mike Bayer

unread,
Feb 26, 2020, 11:07:42 AM2/26/20
to sqlalchem...@googlegroups.com


On Tue, Feb 25, 2020, at 11:06 AM, Ke Zhu - kz...@us.ibm.com wrote:
According to Apache Hive, Update/Delete can only be performed on tables that supports ACID. see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update

> this is a database that can change schema structures but not insert/update/delete rows? is there some kind of SQL layer that has to be in use that isn't there when you create structures?

Yes! Basically it separate data store and data model so that you can update table schema as metadata while the actual data can be loaded/streamed into external data store w/o SQL interface.

> that wouldn't be very easy but also it would be way better to store the version info in the target DB itself.

totally understand this. I see this design as advantage as well just like other schema management tool like sqitch. I'm just looking for possiblilities since I've seen it uses the same connection to execute schema changes and versioning change in a migration context.

> I don't know anything about Hive/Presto, but they *are* databases so I'd assume you can put data in them

Yes to Hive, when using PyHive, it just needs to fix https://github.com/dropbox/PyHive/issues/315

oh, that issue is talking about rowcount.  Alembic does not need rowcount to function correctly.    I see that Alembic is doing this now, however there is a dialect-level flag called "supports_sane_rowcount", if this were False, Alembic should be checking this and skipping that particular check.

is the "rowcount" check the only issue here?   this part can be changed on the Alembic side.


Ke Zhu - kzhu@us.ibm.com

unread,
Feb 27, 2020, 2:49:29 PM2/27/20
to sqlalchem...@googlegroups.com
On Wed, 2020-02-26 at 11:07 -0500, Mike Bayer wrote:
oh, that issue is talking about rowcount.  Alembic does not need rowcount to function correctly.    I see that Alembic is doing this now, however there is a dialect-level flag called "supports_sane_rowcount", if this were False, Alembic should be checking this and skipping that particular check.

I see. it would be great if Alembic will respect the "support_sane_rowcount" specified in a SQLAlchemy dialect when updating/deleting revision. If this is a suggested design, I can patch it via a PR.

According to specific dialect, it just need to use this option instead of returning "-1". I can document this in the issue https://github.com/dropbox/PyHive/issues/315

is the "rowcount" check the only issue here?   this part can be changed on the Alembic side.

Besides the above issue, I want to get input on https://github.com/dropbox/PyHive/issues/314 as well. How could I use SQLAlchemy API to create table like what this sql does:
CREATE TABLE hello_acid (key int, value int)
STORED AS ORC TBLPROPERTIES ('transactional'='true');
I've no idea how to make `STORED AS and TBLPROPERTIES` part for a Hive table. Now I just use SQLAlchemy compiler to append them for CreateTable.


Mike Bayer

unread,
Feb 27, 2020, 3:39:18 PM2/27/20
to sqlalchem...@googlegroups.com


On Thu, Feb 27, 2020, at 2:49 PM, Ke Zhu - kz...@us.ibm.com wrote:
On Wed, 2020-02-26 at 11:07 -0500, Mike Bayer wrote:
oh, that issue is talking about rowcount.  Alembic does not need rowcount to function correctly.    I see that Alembic is doing this now, however there is a dialect-level flag called "supports_sane_rowcount", if this were False, Alembic should be checking this and skipping that particular check.

I see. it would be great if Alembic will respect the "support_sane_rowcount" specified in a SQLAlchemy dialect when updating/deleting revision. If this is a suggested design, I can patch it via a PR.

yes.



According to specific dialect, it just need to use this option instead of returning "-1". I can document this in the issue https://github.com/dropbox/PyHive/issues/315

the pyhive dialect advertises the attribute correctly right?


is the "rowcount" check the only issue here?   this part can be changed on the Alembic side.

Besides the above issue, I want to get input on https://github.com/dropbox/PyHive/issues/314 as well. How could I use SQLAlchemy API to create table like what this sql does:
CREATE TABLE hello_acid (key int, value int)
STORED AS ORC TBLPROPERTIES ('transactional'='true');
I've no idea how to make `STORED AS and TBLPROPERTIES` part for a Hive table. Now I just use SQLAlchemy compiler to append them for CreateTable.

There's a few paths to make this work depending on where and how far you want to go, or at least how this works.


Assuming these are *optional* arguments for the tables that you create with the pyhive dialect, the official way is that the dialect can be made to support these (I thought we had a generic "suffixes" section right now, but we don't).   The way they work can be seen in the Postgresql dialect.  start with the keywords you want, here I might call it pyhive_stored_as="ORC"  pyhive_tblproperties="('transactional'='true').  then they'd be set up as "stored_as" and "tblproperties" in the dialect's construct_arguments like Postgresql does here:


then you consume these in post_create_table:



if you need Alembic's alembic_version to have these keywords, im not sure we have a keyword for that right now however you can use metadata events to intercept when alembic_version is built up, or you can intercept the compiler as you're doing now.

if alternatively all pyhive tables need these keywords unconditionally, then you'd just implement post_create_table in the compiler to add them in.






--
You received this message because you are subscribed to the Google Groups "sqlalchemy-alembic" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sqlalchemy-alem...@googlegroups.com.

Ke Zhu - kzhu@us.ibm.com

unread,
Mar 2, 2020, 10:36:45 AM3/2/20
to sqlalchem...@googlegroups.com
Mike,

Thanks for merging my change on Alembic!

On Thu, 2020-02-27 at 15:38 -0500, Mike Bayer wrote:
>
>
> On Thu, Feb 27, 2020, at 2:49 PM, Ke Zhu - kz...@us.ibm.com wrote:
> > On Wed, 2020-02-26 at 11:07 -0500, Mike Bayer wrote:
> > > oh, that issue is talking about rowcount. Alembic does not need
> > > rowcount to function correctly. I see that Alembic is doing
> > > this now, however there is a dialect-level flag called
> > > "supports_sane_rowcount", if this were False, Alembic should be
> > > checking this and skipping that particular check.
> >
> > I see. it would be great if Alembic will respect the
> > "support_sane_rowcount" specified in a SQLAlchemy dialect when
> > updating/deleting revision. If this is a suggested design, I can
> > patch it via a PR.
>
> yes.
>
>
> > According to specific dialect, it just need to use this option
> > instead of returning "-1". I can document this in the issue
> > https://github.com/dropbox/PyHive/issues/315
>
> the pyhive dialect advertises the attribute correctly right?

That's the fix I proposed in that issue.
Thanks for the pointers! It's very useful. I will try these paths since
not all Hive tables need such keywords so far. It depends on how data
is ingested in Hive/hadoop.
Reply all
Reply to author
Forward
0 new messages