configure a Kerberized Hive datasource

Alex Woolford

unread,

Dec 25, 2016, 10:34:10 PM12/25/16

to airbnb_superset

I'm trying to add a Hive database to Superset that's running on a Kerberized cluster. I'm able to connect from Airbnb's Airflow and Tableau using the following settings:

Airflow:

Conn type: Hive Server 2 Thrift

host: hadoop01.woolford.io

port: 10500

schema: [empty]

Extras:

{

"proxy_user": "login",
"use_beeline": true,
"principal": "hive/hadoop01.w...@WOOLFORD.IO"

}

Tableau:

... so I know it's possible to connect, authenticated via Kerberos, to Hive.

Like Airflow, Superset's database connection form also has an 'Extras' field. The JSON format to enter in this field is slightly different: Superset asks for separate metadata and engine parameters, whereas Airflow accepts flat JSON containing key/values. It's therefore not possible to simply cut/paste the 'Extras' JSON from Airflow to Superset.

Looking through the Superset config.py, I didn't see a section for Kerberos. Does anyone have any advice for setting this up (perhaps an example of the 'metadata_params' and 'engine_params' JSON, values in superset_config.py, keytab creation steps for FreeIPA, etc...)?

Cheers,

Alex Woolford

Maxime Beauchemin

unread,

Dec 27, 2016, 1:12:10 PM12/27/16

to airbnb_superset

Airflow uses a different python library (impyla) then Superset (pyhive) to access Hive. The api and feature support is different.

I'm unclear on whether pyhive supports Kerberos authentication as we don't use Kerberos authentication on Hive at Airbnb (yet). Kerberos support was community contributed to Airflow.

You'll have to find a way to get a SqlAlchemy connection with Kerberos authentication for Hive. Superset should be exposing all the hooks you may need for that. When you find a way, please add the how-to in the Superset documentation.

Thanks,

Max

Alex Woolford

unread,

Dec 27, 2016, 4:57:12 PM12/27/16

to airbnb_superset

Thanks Max. I appreciate your insight/wisdom.

The most recent release of PyHive (v0.2.1) doesn't support Kerberos. I notice that someone from Yahoo, Wu Junxian (Rupert), created a pull request that provides support for Kerberos. This PR has not yet been merged/released and so I just copied the code from Rupert's PR into `/usr/lib/python2.7/site-packages/pyhive/hive.py` on the Superset server and added the following params:
{
  "metadata_params": {},
  "engine_params": {
"connect_args": {
"auth": "KERBEROS",
"kerberos_service_name": "hive"
}
  }
}

I then ran `kinit [superset_hive_user]` as the superset user and am now able to use Kerberized Hive as a datasource. I'll need to find a way to refresh the kerberos ticket (perhaps expect + cron).

This is pretty hacky. I've asked the maintainer of the PyHive package if he'll merge/release Rupert's PR. If he does that, I'll add a 'how-to'.

Maxime Beauchemin

unread,

Dec 27, 2016, 5:26:06 PM12/27/16

to airbnb_superset

I `+1`ed the PR on the pyhive side as I know we'll need this in the future at Airbnb.

Thanks for sharing your hack, it could be helpful to others. For Airflow, the contributor (@bolke if I remember right) added a `airflow kerberos` CLI subcommand that would take care of refreshing the Kerberos ticket. Similarly, it could be nice if `pyhive` was to implement something similar as part of their lib.

Max

Alex Woolford

unread,

Dec 28, 2016, 3:09:24 AM12/28/16

to airbnb_superset

Thanks Max.

I looked at the code for the `airflow kerberos` utility and used that to create a dirty hack, outside of Superset, that seems to work: https://github.com/alexwoolford/kerberos_reinit_hack. `airflow kerberos`, which reads kerberos properties from `airflow.cfg`, is certainly a better way to do this.

Reply all

Reply to author

Forward