USE CASE : XML FIles ingestion from SFTP to HDFS

312 views
Skip to first unread message

Sadok Ben Yahia

unread,
Feb 3, 2016, 11:25:43 AM2/3/16
to gobblin-users
AS the title of this post states i want to transfer xml files from a SFTP server to Hadoop HDFS with Gobblin.
I want to say i'm really new to these things that's why i'll be asking a lot of questions.
1- is that possible to do ?
2- When i want to build Gobblin on my UBUNTU VM i get the follwing error:

$ ./gradlew clean build -PuseHadoop2        
Parallel execution with configuration on demand is an incubating feature.
From https://github.com/linkedin/gobblin
 * branch            master     -> FETCH_HEAD
Using latest tag for version: gobblin_0.6.2-4-g1f0109a
name=gobblin group=com.linkedin.gobblin
project.version=0.6.2-4-g1f0109a

FAILURE: Build failed with an exception.

* Where:
Build file '/home/sbyahya/downloads/gobblin/build.gradle' line: 393

* What went wrong:
A problem occurred evaluating project ':gobblin-admin'.
> Cannot invoke method getURLs() on null object

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 2.815 secs



Sahil Takiar

unread,
Feb 3, 2016, 4:12:55 PM2/3/16
to Sadok Ben Yahia, gobblin-users
For your build error take at look at this related GitHub Issue: https://github.com/linkedin/gobblin/issues/615

SFTP ingestion is possible, unfortunately, it does not seem to be well documented on the wiki. Let me see if we can get that fixed.

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gobblin-users/11985d06-53b9-43b6-90a8-29617f1dd78e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sahil Takiar

unread,
Feb 3, 2016, 4:40:46 PM2/3/16
to Sadok Ben Yahia, Pradhan Cadabam, gobblin-users
Pradhan has provided me a sample .pull file you can use for SFTP ingestion:

job.name=SftpDistcp
job.group=Distcp
job.description=Job to copy data from sftp to hdfs

# Source properties
source.filebased.fs.uri=sftp:///my.hostname.com:2222
source.class=gobblin.data.management.copy.CloseableFsCopySource
source.conn.private.key=/path/to/id.rsa
source.conn.username=mySftpUsername
source.conn.host=my.hostname.com
source.conn.port=2222

# Dataset properties
gobblin.dataset.pattern=/path/to/directory/to/be/copied/on/sftp

# Publisher properties
#data.publisher.type=gobblin.data.management.copy.publisher.CopyDataPublisher
data.publisher.final.dir=/path/to/destination/on/HDFS

# Writer properties
writer.builder.class=gobblin.data.management.copy.writer.FileAwareInputStreamDataWriterBuilder

Issac Buenrostro

unread,
Feb 3, 2016, 4:48:24 PM2/3/16
to Sahil Takiar, Sadok Ben Yahia, Pradhan Cadabam, gobblin-users
Note the above is simply a copy job, it will transfer the file, but not interpret any XML. If what you want is to process XML records, then you'll need to write an XML source, but you can leverage the SFTP connection code from the above job.
Could you give some more detail on what you are trying to do?

Sadok Ben Yahia

unread,
Feb 3, 2016, 6:07:07 PM2/3/16
to gobblin-users, sta...@linkedin.com, ben.yah...@gmail.com, pcad...@linkedin.com
I am working within a project work that want to do data mining on XML files with the help of hadoop (and the tools that exists in hadoop ecosystem). we are following for that the ELT process and my job is to extract these xml files from ur sftp server periodically and to load them into HDFS. I have tried kafka and flume but with no success each tool has its own problems. So for now i want to test Gobblin to know if it is possible to fulfill this objective.

Pradhan Cadabam

unread,
Feb 3, 2016, 6:31:10 PM2/3/16
to Sadok Ben Yahia, gobblin-users, Sahil Takiar
Hi Sadok,
Based on your description, seems like the example SFTP configs Sahil provided should work for you. This will help you setup a pipeline to copy data from your SFTP server to HDFS. Once you have the xml files on HDFS you can use other tools in the Hadoop ecosystem to analyze your data.
 
--
- Pradhan

Sadok Ben Yahia

unread,
Feb 3, 2016, 6:38:54 PM2/3/16
to gobblin-users, ben.yah...@gmail.com, pcad...@linkedin.com
Thank you both Sahil and Pradhan, i will consider this .pull file sample into my work.

Sadok Ben Yahia

unread,
Feb 3, 2016, 6:48:24 PM2/3/16
to gobblin-users, ben.yah...@gmail.com, sta...@linkedin.com
Is it possible to extract each xml Files and to load them in HDFS like they existed in the sftp server or will be some problem of serialization/deserialization, because in flume i have got this problem that flume read each line for each xml file as an event and then when the buffer is fulll it will be loaded in hdfs so that i loose my original xml file structure

Issac Buenrostro

unread,
Feb 3, 2016, 6:50:57 PM2/3/16
to Sadok Ben Yahia, gobblin-users, Sahil Takiar
Hi Sadok,
You will have no problem with ser/de. The setup suggested by Pradhan and Sahil is simply a byte transfer, unaware of any file formats.

Sadok Ben Yahia

unread,
Feb 4, 2016, 4:50:53 AM2/4/16
to gobblin-users
Hello everybody,

i have got today a privat server, it is more faster as my laptop.
But now i have this error during the building of gobblin mit -PuseHadoop2 -PhadoopVersion=2.7.1 -x test and without -PhadoopVersion=2.7.1

$ ./gradlew clean build -PuseHadoop2 -x test

Parallel execution with configuration on demand is an incubating feature.
From https://github.com/linkedin/gobblin
 * branch            master     -> FETCH_HEAD
Using latest tag for version: gobblin_0.6.2-4-g1f0109a
name=gobblin group=com.linkedin.gobblin
project.version=0.6.2-4-g1f0109a

FAILURE: Build failed with an exception.

* Where:
Build file '/home/sbyahya/downloads/gobblin/build.gradle' line: 393

* What went wrong:
A problem occurred evaluating project ':gobblin-admin'.
> Cannot invoke method getURLs() on null object

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output.

BUILD FAILED

Total time: 2.5 secs

Can anyone help ?

Sadok Ben Yahia

unread,
Feb 4, 2016, 6:09:51 AM2/4/16
to gobblin-users
here in detail the error that i got when trying to build gobblin:
-------
$ ./gradlew clean build -PuseHadoop2 -x test --stacktrace

Parallel execution with configuration on demand is an incubating feature.
From https://github.com/linkedin/gobblin
 * branch            master     -> FETCH_HEAD
Using latest tag for version: gobblin_0.6.2-4-g1f0109a
name=gobblin group=com.linkedin.gobblin
project.version=0.6.2-4-g1f0109a

FAILURE: Build failed with an exception.

* Where:
Build file '/home/sbyahya/downloads/gobblin/build.gradle' line: 393

* What went wrong:
A problem occurred evaluating project ':gobblin-admin'.
> Cannot invoke method getURLs() on null object

* Try:
Run with --info or --debug option to get more log output.

* Exception is:
org.gradle.api.GradleScriptException: A problem occurred evaluating project ':gobblin-admin'.
    at org.gradle.groovy.scripts.internal.DefaultScriptRunnerFactory$ScriptRunnerImpl.run(DefaultScriptRunnerFactory.java:54)
    at org.gradle.configuration.DefaultScriptPluginFactory$ScriptPluginImpl.apply(DefaultScriptPluginFactory.java:152)
    at org.gradle.configuration.project.BuildScriptProcessor.execute(BuildScriptProcessor.java:40)
    at org.gradle.configuration.project.BuildScriptProcessor.execute(BuildScriptProcessor.java:26)
    at org.gradle.configuration.project.ConfigureActionsProjectEvaluator.evaluate(ConfigureActionsProjectEvaluator.java:34)
    at org.gradle.configuration.project.LifecycleProjectEvaluator.evaluate(LifecycleProjectEvaluator.java:55)
    at org.gradle.api.internal.project.AbstractProject.evaluate(AbstractProject.java:493)
    at org.gradle.api.internal.project.AbstractProject.evaluate(AbstractProject.java:80)
    at org.gradle.execution.TaskPathProjectEvaluator.evaluateByPath(TaskPathProjectEvaluator.java:43)
    at org.gradle.execution.ProjectEvaluatingAction.configure(ProjectEvaluatingAction.java:50)
    at org.gradle.execution.DefaultBuildExecuter.configure(DefaultBuildExecuter.java:42)
    at org.gradle.execution.DefaultBuildExecuter.select(DefaultBuildExecuter.java:35)
    at org.gradle.initialization.DefaultGradleLauncher.doBuildStages(DefaultGradleLauncher.java:155)
    at org.gradle.initialization.DefaultGradleLauncher.doBuild(DefaultGradleLauncher.java:113)
    at org.gradle.initialization.DefaultGradleLauncher.run(DefaultGradleLauncher.java:81)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter$DefaultBuildController.run(InProcessBuildActionExecuter.java:64)
    at org.gradle.launcher.cli.ExecuteBuildAction.run(ExecuteBuildAction.java:33)
    at org.gradle.launcher.cli.ExecuteBuildAction.run(ExecuteBuildAction.java:24)
    at org.gradle.launcher.exec.InProcessBuildActionExecuter.execute(InProcessBuildActionExecuter.java:35)
    at org.gradle.launcher.daemon.server.exec.ExecuteBuild.doBuild(ExecuteBuild.java:45)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:34)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.WatchForDisconnection.execute(WatchForDisconnection.java:42)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.ResetDeprecationLogger.execute(ResetDeprecationLogger.java:24)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.StartStopIfBuildAndStop.execute(StartStopIfBuildAndStop.java:33)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.ReturnResult.execute(ReturnResult.java:34)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.ForwardClientInput$2.call(ForwardClientInput.java:71)
    at org.gradle.launcher.daemon.server.exec.ForwardClientInput$2.call(ForwardClientInput.java:69)
    at org.gradle.util.Swapper.swap(Swapper.java:38)
    at org.gradle.launcher.daemon.server.exec.ForwardClientInput.execute(ForwardClientInput.java:69)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.LogToClient.doBuild(LogToClient.java:60)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:34)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.EstablishBuildEnvironment.doBuild(EstablishBuildEnvironment.java:60)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:34)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy$1.run(StartBuildOrRespondWithBusy.java:45)
    at org.gradle.launcher.daemon.server.DaemonStateCoordinator.runCommand(DaemonStateCoordinator.java:186)
    at org.gradle.launcher.daemon.server.exec.StartBuildOrRespondWithBusy.doBuild(StartBuildOrRespondWithBusy.java:49)
    at org.gradle.launcher.daemon.server.exec.BuildCommandOnly.execute(BuildCommandOnly.java:34)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.HandleStop.execute(HandleStop.java:36)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.DaemonHygieneAction.execute(DaemonHygieneAction.java:39)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.CatchAndForwardDaemonFailure.execute(CatchAndForwardDaemonFailure.java:32)
    at org.gradle.launcher.daemon.server.exec.DaemonCommandExecution.proceed(DaemonCommandExecution.java:125)
    at org.gradle.launcher.daemon.server.exec.DefaultDaemonCommandExecuter.executeCommand(DefaultDaemonCommandExecuter.java:51)
    at org.gradle.launcher.daemon.server.DefaultIncomingConnectionHandler$ConnectionWorker.handleCommand(DefaultIncomingConnectionHandler.java:155)
    at org.gradle.launcher.daemon.server.DefaultIncomingConnectionHandler$ConnectionWorker.receiveAndHandleCommand(DefaultIncomingConnectionHandler.java:128)
    at org.gradle.launcher.daemon.server.DefaultIncomingConnectionHandler$ConnectionWorker.run(DefaultIncomingConnectionHandler.java:116)
    at org.gradle.internal.concurrent.DefaultExecutorFactory$StoppableExecutorImpl$1.run(DefaultExecutorFactory.java:64)
Caused by: java.lang.NullPointerException: Cannot invoke method getURLs() on null object
    at build_5fog9nuua7s6tlhblsb8qg02ac$_run_closure11_closure33_closure35_closure46.doCall(/home/sbyahya/downloads/gobblin/build.gradle:393)
    at org.gradle.api.internal.ClosureBackedAction.execute(ClosureBackedAction.java:58)
    at org.gradle.util.ConfigureUtil.configure(ConfigureUtil.java:130)
    at org.gradle.util.ConfigureUtil.configure(ConfigureUtil.java:91)
    at org.gradle.api.internal.project.AbstractProject.dependencies(AbstractProject.java:912)
    at org.gradle.api.internal.BeanDynamicObject$GroovyObjectAdapter.invokeMethod(BeanDynamicObject.java:268)
    at org.gradle.api.internal.BeanDynamicObject.invokeMethod(BeanDynamicObject.java:129)
    at org.gradle.api.internal.ConfigureDelegate.invokeMethod(ConfigureDelegate.java:69)
    at build_5fog9nuua7s6tlhblsb8qg02ac$_run_closure11_closure33_closure35.doCall(/home/sbyahya/downloads/gobblin/build.gradle:375)
    at org.gradle.api.internal.ClosureBackedAction.execute(ClosureBackedAction.java:58)
    at org.gradle.util.ConfigureUtil.configure(ConfigureUtil.java:130)
    at org.gradle.util.ConfigureUtil.configure(ConfigureUtil.java:91)
    at org.gradle.api.internal.AbstractNamedDomainObjectContainer.configure(AbstractNamedDomainObjectContainer.java:68)
    at org.gradle.api.internal.AbstractNamedDomainObjectContainer.configure(AbstractNamedDomainObjectContainer.java:24)
    at org.gradle.api.internal.project.AbstractProject.configurations(AbstractProject.java:904)
    at build_5fog9nuua7s6tlhblsb8qg02ac$_run_closure11_closure33.doCall(/home/sbyahya/downloads/gobblin/build.gradle:373)
    at org.gradle.api.internal.ClosureBackedAction.execute(ClosureBackedAction.java:58)
    at org.gradle.internal.Actions$FilteredAction.execute(Actions.java:203)
    at org.gradle.listener.ActionBroadcast.execute(ActionBroadcast.java:39)
    at org.gradle.api.internal.DefaultDomainObjectCollection.doAdd(DefaultDomainObjectCollection.java:164)
    at org.gradle.api.internal.DefaultDomainObjectCollection.add(DefaultDomainObjectCollection.java:159)
    at org.gradle.api.internal.plugins.DefaultPluginContainer.addPluginInternal(DefaultPluginContainer.java:69)
    at org.gradle.api.internal.plugins.DefaultPluginContainer.apply(DefaultPluginContainer.java:34)
    at org.gradle.api.internal.plugins.DefaultObjectConfigurationAction.applyPlugin(DefaultObjectConfigurationAction.java:116)
    at org.gradle.api.internal.plugins.DefaultObjectConfigurationAction.access$200(DefaultObjectConfigurationAction.java:36)
    at org.gradle.api.internal.plugins.DefaultObjectConfigurationAction$3.run(DefaultObjectConfigurationAction.java:85)
    at org.gradle.api.internal.plugins.DefaultObjectConfigurationAction.execute(DefaultObjectConfigurationAction.java:129)
    at org.gradle.api.internal.project.AbstractPluginAware.apply(AbstractPluginAware.java:41)
    at org.gradle.api.Project$apply.call(Unknown Source)
    at org.gradle.api.internal.project.ProjectScript.apply(ProjectScript.groovy:34)
    at org.gradle.api.Script$apply.callCurrent(Unknown Source)
    at build_6s40jau7flrr7jbsui9nu7u453.run(/home/sbyahya/downloads/gobblin/gobblin-admin/build.gradle:12)
    at org.gradle.groovy.scripts.internal.DefaultScriptRunnerFactory$ScriptRunnerImpl.run(DefaultScriptRunnerFactory.java:52)
    ... 56 more


BUILD FAILED

Total time: 2.141 secs
-------


Am Mittwoch, 3. Februar 2016 17:25:43 UTC+1 schrieb Sadok Ben Yahia:

Sahil Takiar

unread,
Feb 9, 2016, 11:01:44 PM2/9/16
to Sadok Ben Yahia, gobblin-users
Check out this related Github Issue: https://github.com/linkedin/gobblin/issues/615

--
You received this message because you are subscribed to the Google Groups "gobblin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gobblin-user...@googlegroups.com.
To post to this group, send email to gobbli...@googlegroups.com.

Sadok Ben Yahia

unread,
Feb 10, 2016, 1:51:43 AM2/10/16
to gobblin-users, ben.yah...@gmail.com
@Sahil Takiar Thank you, i have already solved the problem. Indeed JCE was a solution. I will post a message when i am finished with my use case

Saurabh Paliwal

unread,
May 5, 2016, 3:45:29 PM5/5/16
to gobblin-users, ben.yah...@gmail.com
Hello @Sahil and @Sadok. I have the exact same use case, and unfortunately I had to go through a lot of issues to finally arrive at that job_config file, only to find that there is already a thread here. 
Anyway my problem is if gobblin.dataset.pattern is used for finding datasets, the datasetUrn is a long string which also has "/" in it. So when the state for a job/task is persisted, it doesn't get saved in the usual state store directory but in a really weird path inside that.

for example -> ~/work-dir-gobblin/state-store/SftpMove/CopyableFile.DatasetAndPartition\(dataset=CopyableDatasetMetadata\(datasetRoot=/home/saurabh/Desktop/new_queries\,\ datasetTargetRoot=/home/saurabh/work-dir-gobblin/job-output/sftpmove\)\,\ partition=/home/saurabh/Desktop/
now when the next run happens, it doesn't see any state persisted for that job. and all the files are re-downloaded.
Thanks in advance
Reply all
Reply to author
Forward
0 new messages