Queries regarding CDAP Wrangler and namespaces

103 views
Skip to first unread message

Harshad Thombare

unread,
Sep 9, 2022, 5:51:42 AM9/9/22
to CDAP User

Hi Folks, 

I am evaluating CDAP to see if it suits our use case. Following are some question I have

  1. In CDAP wrangler can we only create connections for S3, CloudSQL MySQL, CloudSQL PostgreSQL, Database, SQL Server, MySQL, Oracle, PostgreSQL, File, BigQuery, GCS, Spanner and Kafka? Isn't there any way of creating connections with Azure Data Lake Storage?
  2. We have a use case where we want to have multiple namespaces in CDAP specific to segregate the user base. For each namespace we want to have a different datasource and we want to limit access to just that datasource only . Is it possible with CDAP?
  3. In CDAP, is it possible to restrict users to just access Wrangler and not the studio?
  4. Is there any rest api to access the insights data (Profiled data) from Wrangler? 

Can someone please help me out with this. Also sharing some references will help. 

Thank you

Regards,
Harshad

Albert Shau

unread,
Sep 9, 2022, 2:05:36 PM9/9/22
to cdap...@googlegroups.com
Hi Harshad,

1. ADLS currently has not been added as a wrangler connection. There is no technical reason why it can't be done, it just hasn't been prioritized. There is a source plugin, so it is possible to read from ADLS in the pipeline, it just can't be browsed and sampled without updating the plugin (https://github.com/data-integrations/azure -- contributions here would be very welcome!).

2. I'm not quite sure what you mean by a datasource so not sure if this will answer your question, but it is possible to control the sets of plugins available in each namespace. To do this you would delete all the system plugin artifacts that are packaged out of the box, then deploy just the ones you want in each individual namespace. See https://cdap.atlassian.net/wiki/spaces/DOCS/pages/477692148/Artifact+Microservices for more info about the artifacts API and https://cdap.atlassian.net/wiki/spaces/DOCS/pages/480379172/Plugins for more information about plugins.

3. CDAP does have authorization support (See https://cdap.atlassian.net/wiki/spaces/DOCS/pages/597492346/Authorization+Policies and https://cdap.atlassian.net/wiki/spaces/DOCS/pages/477790661/Security+Microservices). The authorization module is pluggable as well. So it is possible to make sure users don't have permission to create applications (like pipelines). But from a UI perspective, I don't think there's a way to hide those portions of the UI, you would just get errors whenever the UI interacts with the REST APIs. 

4. We are in the process of revamping the insights page. It is currently one of the only things in the UI that is not backed by a REST API.

Best Regards,
Albert

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/04fe7419-876d-4258-a8cc-6dceed3266abn%40googlegroups.com.

Harshad Thombare

unread,
Sep 12, 2022, 3:05:01 AM9/12/22
to CDAP User
Hi Albert,

Thank you for the response. I have a follow up question

As you mentioned its possible to limit access based on plugins I'm not quite sure what you mean by a datasource so not sure if this will answer your question, but it is possible to control the sets of plugins available in each namespace. The scenario we have is to provide provisioned namespaces for different set of users using CDAP and we want to restrict data access across multiple namespaces
E.g  Namespace = A,B
        Source = S1, S2, S3
        User = U1, U2
The desired controls we want is that 
- User U1 has access to Namespace A which can only read/write data from sources S1, S2
- Similarly, user U2 has access to Namespace B which can only read/write data from sources S3

If above setup is possible, how can we achieve it?
 

Regards,
Harshad Thombare

Albert Shau

unread,
Sep 12, 2022, 6:01:50 PM9/12/22
to cdap...@googlegroups.com
Hi Harshad,

It sounds like you're not talking about limiting which plugins are shown in which namespaces, but more about what data users in each namespace are allowed to access. The general pattern for this is to give a set of admins permissions to create and edit compute profiles and wrangler connections, while making sure that other users cannot touch them. Compute profiles and connections can be local to just a specific namespace. In the compute profile, you can specify which service account to use on Dataproc and restrict the permissions for that service account to what you desire. Similarly, the wrangler connections can have user/password information configured in them, and your admins would ensure that those users only have permission to access the right data. Of course, if a user happens to know the credentials for some other table, they will be able to input those directly into the source instead of using one of the admin defined connections, but then that means that user would have been able to access that table anyway through other tooling.

Regards,
Albert

Harshad Thombare

unread,
Sep 23, 2022, 1:42:23 AM9/23/22
to CDAP User
Hi Albert,

Thank you for your response. As you mentioned ADLS is not supported for Wrangling and the current plugin implementation can be extended to start supporting it. We would like contribute to enable Microsoft Azure Blob support for wrangling. We have started looking at the code base and we have few questions around it. 
                    1. Is there any documentation of code design specifically for developing a plugin?
                    2. Can you share the reference of code where the plugin code (for.e.g S3Connector) is invoked? 
                    3. We noticed that Azure plugin implementation is quite different than S3 and GCP implementation. Azure has some difference in design, is there any intentional reason behind difference in implementation for these plugins?

Some inputs on this would be very helpful. Thank you.

Albert Shau

unread,
Sep 23, 2022, 12:56:05 PM9/23/22
to cdap...@googlegroups.com
Hi Harshad,

Thanks for your interest, these are great questions! Do note that in order to contribute, you will need to sign a Contributor License Agreement (CLA -- cla.developers.google.com) otherwise the github PR will be automatically blocked from merging. 

1. https://cdap.atlassian.net/wiki/spaces/DOCS/pages/480313897/Developing+Plugins+Guide has information about developing pipeline plugins, and https://cdap.atlassian.net/wiki/spaces/DOCS/pages/480379172/Plugins has more information about the CDAP plugin system. 

2. Connection related code for browsing can be found at https://github.com/cdapio/cdap/blob/025fa21e0929d184f0df0bc5ada648a5ad229baa/cdap-app-templates/cdap-etl/cdap-data-pipeline-base/src/main/java/io/cdap/cdap/datapipeline/service/ConnectionHandler.java#L331. You won't need to modify anything there, but can take a look to see how the system is using the Connection API. In terms of implementing the plugin, it would be best to look at the GCS and S3 plugins as examples.

3. There is no intentional difference, this is only due to the fact that we haven't had time to maintain the Azure plugin so it is out of date. If we were to update it, we would have it follow the same pattern as the S3 and GCS implementation, where they extend the AbstractFileSource, and also implement a Connector so that it can be used in Wrangler to browse and sample. 

Best Regards,
Albert

SN Chakravarthy

unread,
Sep 27, 2022, 8:28:58 AM9/27/22
to CDAP User
Hi Albert, 

Thanks for the response (Harshad and I are from the same team). 
Wanted your input on one of the points that we mentioned earlier regarding restricting user access to a specified namespace.

We have been trying to follow https://cloud.google.com/data-fusion/docs/concepts/rbac page to get the desired result. However it does not seem to work.

Our Requirement: 
  1. Provision 1 cloud fusion service.
  2. Create 2 namespaces. (N1 and N2)
  3. Give user1 dedicated access to N1 so that he is only able to namespace N1 (and should not be able to view other namespace, i.e. N2) and create and run pipelines on it.

Can you help us figure out how can we achieve this?

Thank you!
Chakravarthy


Harshad Thombare

unread,
Sep 29, 2022, 5:01:16 AM9/29/22
to CDAP User
Hi Albert,

As discussed earlier, we tried to implement azure connector with basic ability to browse the blob containers. We were able to deploy plugin jar on CDAP version 6.6.0, but same jar didn't work on 6.7.1. Wrangler connections UI starts breaking.

Screenshot 2022-09-29 at 2.24.31 PM.png

Do we need to any specific changes to make it compatible with latest CDAP version? Can you please help us with this? 

Regards,
Harshad Thombare

Albert Shau

unread,
Sep 30, 2022, 5:57:18 PM9/30/22
to cdap...@googlegroups.com
There's not much to go off of in those screenshots, is there anything in the cdap.log or cdap-debug.log files? It could be some response the UI is not expecting, do you have version 6.7.1 of the relevant cdap dependencies in your pom.xml?

Harshad Thombare

unread,
Oct 3, 2022, 1:43:33 PM10/3/22
to CDAP User
Hi Albert,

I tried to update the CDAP dependencies, but no luck. I also checked the cdap.log and cdap-debug.log files, didn't encounter any error. 

Attaching the dependency tree for you reference.
dependency_tree
Reply all
Reply to author
Forward
0 new messages