Dynamic creation of Source Tables

618 views
Skip to first unread message

Tarsis Azevedo

unread,
Mar 31, 2017, 5:16:09 PM3/31/17
to airbnb_superset
Hi,
I have to create a lot of tables and I want to create dynamic source tables in superset and manipulate it.
Theres is any API I can use for that?

Maxime Beauchemin

unread,
Mar 31, 2017, 9:00:13 PM3/31/17
to airbnb_superset
Yes. So many options:

1. if you're ok with writing Python you use the SQLAlchemy ORM.

```
from superset.connectors.sqla.models import SqlaTable, SqlMetric, Column
```
http://docs.sqlalchemy.org/en/latest/orm/

2. you could write some sort of data pipeline or script that loads data in your metadata database directly
3. FAB (flask app builder) ships with a REST api that is mostly usable, reading is easy, though it's not super easy to get the create and update endpoints working (CRSF!) and you may have to tweak the ModelViews to do so:

About 3, I have a PR out on FAB to make things a bit better:

Max

Fabian Menges

unread,
Jun 2, 2017, 5:22:38 PM6/2/17
to airbnb_superset
Wanted to resurrect this thread. I followed #1 and wrote some python that can load datasources from a yaml file looking like this:

databases:
  - database_name: TestDB
   
verbose_name: "Test DB"
    sqlalchemy_uri: "sqlite:////home/test/.superset/superset.db"
    password: null
   
cache_timeout: 1000
   
expose_in_sqllab: true
   
tables:
      - schema: null
       
table_name: 'user'
        columns:
          - column_name: 'id'
            verbose_name: 'User Id'
            is_active: true
           
type: BIGINT
           
groupby: true
           
count_distinct: true
           
sum: true
           
filterable: true
         
- column_name: 'name'
            verbose_name: 'User Name'
            is_active: true
           
type: VARCHAR
       
metrics:
          - metric_name: count
           
verbose_name: count
           
metric_type: count
           
expression: "COUNT(*)"


Its pretty straight forward and makes use of the models in superset.connectors.sqla.models and the `export_fields` field on those models. Currently I have it as a standalone script but I think it would be helpful to include this right into superset.

I'm imagining something like adding it to the superset cli: superset add_datasources <yaml file> 
Another option is to add an "Import Datasource" page in the "Manage" tab of the Web UI that allows us to load datasources through the web app. Obviously cli and web are not exclusive.

I could probably add YAML functionality to the dashboard import/export as well (in 2nd step), in addition to the existing pickle import/export. Benefits of YAML are

1. can edit in editor
2. store in source control
3. more human readable than json (comments, etc)
4. can easily be generated from another language

Any thoughts?

If you are interested in the feature, I would implement the cli feature first and go from there.

Fabian

Nitin Dhawan

unread,
Jun 3, 2017, 7:09:57 AM6/3/17
to Fabian Menges, airbnb_superset
Hi Fabian,

That sounds great.
I would like to try your python import script even in the standalone form.

Would you be willing to share it?

Regards,
-Nitin

--
You received this message because you are subscribed to the Google Groups "airbnb_superset" group.
To unsubscribe from this group and stop receiving emails from it, send an email to airbnb_supers...@googlegroups.com.
To post to this group, send email to airbnb_...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/airbnb_superset/7488ed4c-0817-4010-adaf-103111fb9f09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Maxime Beauchemin

unread,
Jun 4, 2017, 8:41:15 PM6/4/17
to airbnb_superset
This is great! We've had a few but significant instances of people altering datasources configuration and breaking slices/dashboards as a result. Having the option to manage those from source control is something we've been thinking about.

It totally should be documented and part of the CLI. I like the idea to allow exporting datasources as well. I'd say this should be supported in the CLI, and perhaps in the UI as well as you mentioned.

On top of that, I'd add a new Boolean field to BaseDatasource `managed_by_script` which would prevent editing from the UI for datasources that have been synced by script/yaml. The idea would be to put the yaml in source control and make it the single source of truth for the datasources defined in there, and prevent users from getting it out of sync in the UI. The CLI's `superset import_datasources foo.yaml` would set that `manage_by_script` flag to True for all datasources defined in there. When trying to edit those datasources in the UI, you'd just have text that says "This datasource is maintained by a script, here's the related YAML"

As a side note, in the near future we want to allow for multiple datasource entries for the same physical table (currently there's a unique constraint) and manage ownership (the same way we do for slices and dashboards), meaning that for a given physical table, 2 different users could have diverging entries with different configurations. That way one user cannot mess up someone else's configuration and there's no need for consensus. Each user would only be able to alter hers/his entry. It brings the question of what is the unique identified that we'd use to upsert/merge the entries when loading the yaml, perhaps a new arbitrary `unique_identifier` field.

Max

Fabian Menges

unread,
Jun 5, 2017, 11:53:00 AM6/5/17
to airbnb_superset
Hi Max,

Great, I'll start working on that. To address your of comments:
  • Is the proposed column `managed_by_script` necessary? When we add databases/datasources using the CLI (e.g. currently the example datasets), the "created_by_fk" and "changed_by_fk" are NULL since the actions were not performed by a "user".  I propose: if the DB/table/metric etc. was not created (created_by_fk==NULL) by a user, it can't be changed/edited by a user.
  • I already ran into the issue of multiple entries for the same physical table. The data model for SQL databases already (kind of) supports this, since only the SQLAlchemy connection string is relevant You can create multiple database entries with different names pointing to the same physical resource. We should probably make this consistent between Druid and SqlA.
@Nitin: Here we go https://github.com/tc-dc/superset/blob/DatasourceUpload/upload_datasources.py Its poorly named since it currently supports only SQL Databases/Tables/Columns but it should be straight forward to add the logic for Druid datasources.
Its a bit rough and I won't continue working on it since I'll  get the same functionality into the cli but I don't see the yaml format changing (much).
`print-schema` and `download` are too verbose (spit out too many columns) to be that useful. 

Fabian

Fabian Menges

unread,
Jun 21, 2017, 4:13:27 PM6/21/17
to airbnb_superset
I'm sure you saw the pull request already, just wanted to close the loop here.
https://github.com/airbnb/superset/pull/2993

Besides adding and fixing some unique constraints, I have not made any changes to the data model so far (e.g. adding the `managed_by_script` field).
The Importer/Exporter picks up changes to the schema dynamically so it should support multiple datasources for a single physical table.

I'm happy to add more functionality and/or make changes but wanted get some feedback before continuing working on this.
I think the most important thing to fix is the YAML schema/format, so we don't have to change it again going forward.
Currently its identical to what I posted earlier in this thread. 

Ran into this bug https://github.com/airbnb/superset/pull/3011 when running the integration tests, which I factored out of the main pull request so it can be applied more timely.

Thanks,

Fabian
Reply all
Reply to author
Forward
0 new messages