Defining a job for a daily pull

142 views

Skip to first unread message

lbe...@gmail.com

unread,

Apr 13, 2015, 11:03:02 AM4/13/15

to gobbli...@googlegroups.com

Hi,

Let's assume, that a full dump on a table was done by using the following settings:
...
extract.is.full=true
#can we use snapshot_append here as defined in one of the examples?
extract.table.type=snapshot_only
source.querybased.extract.type=snapshot
...

Now if I want to have a daily load, what's the way to define that job?
According to the documentation we can have two options:

1. append:
...
source.querybased.extract.type=append_daily
extract.table.type=append_only
source.querybased.append.max.watermark.limit=CURRENTDATE-1
...

2. incremental:
...
extract.table.type=snapshot_append
source.querybased.extract.type=snapshot
source.querybased.low.watermark.backup.secs=86400
...

What is the difference betweeen the two and when to use what?

On a fresh project checkout (built with Hadoop2 profile) I wasn't able to run neither of them due to the following exception:
ERROR [AbstractJobLauncher] Failed to get work units for job job_mysql_import_mytable_1428935941024
java.lang.IllegalArgumentException: Invalid format: "0" is too short
    at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:673)
    at gobblin.source.extractor.utils.Utils.toDateTime(Utils.java:287)
    at gobblin.source.extractor.utils.Utils.toDateTime(Utils.java:299)
    at gobblin.source.extractor.partition.Partitioner.getAppendLowWatermark(Partitioner.java:183)
    at gobblin.source.extractor.partition.Partitioner.getLowWatermark(Partitioner.java:130)
    at gobblin.source.extractor.partition.Partitioner.getPartitions(Partitioner.java:75)
    at gobblin.source.extractor.extract.QueryBasedSource.getWorkunits(QueryBasedSource.java:67)

I remarked that the full dump didn't produce a task state file (.tst) under /state-store in HDFS.
Is there anything here I'm not aware of?

Thanks,
Lorand

Yinan Li

unread,

Apr 13, 2015, 1:28:15 PM4/13/15

to gobbli...@googlegroups.com

Lorand,

Regarding the .tst task state file. We stopped persisting task states into .tst files since a job state already includes the task states and is already persisted, so this is redundant. You should still see .jst files being written into the state store.