impersonating hdfs super user

438 views
Skip to first unread message

Anirudh Kala

unread,
Mar 2, 2016, 11:29:41 AM3/2/16
to sdc-user
Hi THere,

I am trying to connect to my hadoop server using stream sets and I am unable to do so because, of the following error:

Permission Denied : Data Collector user cannot impersonate the hadoop user. I did make all the changes to the core-site.xml but still no success.

Regards
Anirudh Kala

Adam Kunicki

unread,
Mar 2, 2016, 12:46:29 PM3/2/16
to Anirudh Kala, sdc-user
Hello Anirudh

You must have the user that SDC is running as allowed as a proxy user for the "hadoop" user in core-site as you mentioned. Keep in mind that updating core-site does require you to restart your hdfs cluster in order for your NameNodes and DataNodes pick up your change. This is a change to the hdfs daemons and not simply a client configuration change. 

Have you restarted those services?

-Adam
_____________________________
From: Anirudh Kala <aniru...@gmail.com>
Sent: Wednesday, March 2, 2016 8:29 AM
Subject: [sdc-user] impersonating hdfs super user
To: sdc-user <sdc-...@streamsets.com>
--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.
Visit this group at https://groups.google.com/a/streamsets.com/group/sdc-user/.


anirudh kala

unread,
Mar 2, 2016, 10:10:19 PM3/2/16
to Adam Kunicki, sdc-user
Yes Adam I did.
--
thanks
Anirudh
Data Scientist - Big Data Analytics
Visit me at :
www.anirudhkala.com

Jose Rejas Hernan

unread,
May 9, 2016, 11:33:19 AM5/9/16
to sdc-user, aniru...@gmail.com


Hello everyone,

I've a similar problem with HDFS origin. I tested it with "cloudera-scm" user and the validation was correct. 

However, if I execute the pipeline, I obtain the next log error:

2016-05-09 15:19:42,361 [user:*?] [pipeline:PrimerTutorialcopy] [thread:runner-pool-2-thread-1] DEBUG ClusterProviderImpl - Waiting for application id, elapsed seconds: 6
2016-05-09 15:19:43,362 [user:*?] [pipeline:PrimerTutorialcopy] [thread:runner-pool-2-thread-1] DEBUG ClusterProviderImpl - Waiting for application id, elapsed seconds: 7

And finally:

 [user:*?] [pipeline:PrimerTutorialcopy] [thread:runner-pool-2-thread-1] DEBUG FilePipelineStateStore - Changing state of pipeline 'PrimerTutorialcopy','0','anonymous' to 'START_ERROR' in execution mode: 'CLUSTER_BATCH';status msg is 'Unexpected error starting pipeline: java.lang.IllegalStateException: Timed out after waiting 121 seconds for for cluster application to start. Submit command is not alive.


It seems that I'm not able to obtain a response from HDFS server. I checked the uri and it was correct.
So I couldn´t execute a simple pipeline with HDFS origin and LocalFS destination.

In spite of that, if I preview the pipeline, It gets the data in HDFS file (See previewHDFSdata.png). 


Please, could you help me? How would I have to configure the core-site.xml in HDFS server? Could show me an example?

Regards
previewHDFSData.PNG

Werner Lamprecht

unread,
Jul 19, 2016, 8:29:00 AM7/19/16
to sdc-user
Hi,

how did you solve this problem?

Werner

Alejandro Abdelnur

unread,
Jul 19, 2016, 10:31:31 AM7/19/16
to Werner Lamprecht, sdc-user
Hi Werner,

If your Hadoop cluster is Kerberized you must have a Kerberos service principal for the data collector, typically it should be sdc/<HOST> (where HOST is the hostname where the data collector runs) and the data collector user name for Hadoop is 'sdc'.

If your Hadoop cluster is not Kerberized, the data collector user name for Hadoop is the unix user name that started the data collector. This could be 'sdc' if you are running it as a service or your own user name.

Please determine your data collector user name for Hadoop. For the reminder of this email I'll refer to it as user 'foo'.

In the Hadoop FS destination, if want to impersonate a different Hadoop user than the one running the data collector (user 'foo'),  in the 'Hadoop FS' tab, you should set the 'HDFS User' to the desired user.  This it is all you have to do in the data collector.

Next, you'll have to configure the HDFS Namenode to allow the data collector user (user 'foo'), to be a proxyuser for other users. You do that by setting the following properties in the hdfs-site.xml of your Namenode:

hadoop.proxyuser.foo.host=*
hadoop.proxyuser.foo.groups=*

Remember, this is assuming your data collector is using the Hadoop user name 'foo'.

Once you make those changes, you need to restart the Namenode. 

Then you should be all set.

If you are running a production setup make sure you configure the proxyuser properties above in the most restrictive manner possible for your usage (instead of using '*' that means ALL).

NOTE: If you leave the 'Hadoop FS' destination 'Hadoop User' configuration empty, then your pipeline will interact with HDFS as the Hadoop user running the the data collector (user 'foo').

Hope this helps.

Alejandro


--
You received this message because you are subscribed to the Google Groups "sdc-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sdc-user+u...@streamsets.com.
Visit this group at https://groups.google.com/a/streamsets.com/group/sdc-user/.



--
Alejandro
Reply all
Reply to author
Forward
0 new messages