Exporting data from a CDAP Table dataset to a RDBMS based on a Table column

94 views
Skip to first unread message

Shoaib Quraishi

unread,
Jun 13, 2017, 5:12:39 PM6/13/17
to CDAP User
Hi all,

I am trying to export data from a CDAP Table dataset using a timestamp column. There is a CDAP plugin in Hydrator that reads an entire CDAP Table and emits structured records. Can I use this along with a javascript filter transform plugin to filter out records based on the timestamp column? Is this efficient? Or can I modify the Table batch source plugin to only scan desired records (greater than or equal to a timestamp)? Also, can this task (selecting and exporting to DB) be achieved using Hive queries on the CDAP Table dataset either using Hydrator or asa  custom CDAP application?

Thanks,

Shoaib

Vinisha Vyasa

unread,
Jun 19, 2017, 5:25:40 PM6/19/17
to CDAP User
Hey Shoaib,

Q: I am trying to export data from a CDAP Table dataset using a timestamp column. There is a CDAP plugin in Hydrator that reads an entire CDAP Table and emits structured records. Can I use this along with a javascript filter transform plugin to filter out records based on the timestamp column? Is this efficient? 
A: This can be achieved by Javascript transform. However, please note that javascript transform only allows keyvalue table lookups.

Q: can I modify the Table batch source plugin to only scan desired records (greater than or equal to a timestamp)? 
A: Could you please provide more information on how your data is structured, what is the row key for the table.

Q: Also, can this task (selecting and exporting to DB) be achieved using Hive queries on the CDAP Table dataset either using Hydrator or asa  custom CDAP application?
A: Hive queries can be executed on CDAP Datasets using explore queries. The select statement can have filters you want and then download the result file and use it to import into the relational DB.

Inline image 2

Thanks,
Vinisha

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.
To post to this group, send email to cdap...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cdap-user/1e3aee23-12fa-4cbf-96f3-ecf264b1ffde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shoaib Quraishi

unread,
Jun 20, 2017, 9:34:30 AM6/20/17
to CDAP User
Vinisha,

I am trying to export a Table dataset (not key value table) to a RDBMS. This export has to be automated and would be nice to visually see records being sent to the RDBMS (like in the case of hydrator plugins we have record count visible). The structure of the Table dataset is as follows:

Schema.recordOf(
                "table_name",
                Schema.Field.of("key", Schema.of(Schema.Type.STRING)),
                Schema.Field.of("col1", Schema.of(Schema.Type.STRING)),
                Schema.Field.of("col2", Schema.of(Schema.Type.STRING)),
                Schema.Field.of("col3", Schema.of(Schema.Type.STRING)),
                Schema.Field.of("time_stamp", Schema.of(Schema.Type.STRING)), //
(time_stamp is epoch timestamp)                ...........
                ...........
                Schema.Field.of("coln", Schema.nullableOf(Schema.of(Schema.Type.STRING)))
              );

                createDataset("table_name", Table.class.getName(), DatasetProperties.builder()
                .add(Table.PROPERTY_SCHEMA, logSchema.toString())
                .add(Table.PROPERTY_SCHEMA_ROW_FIELD, "key")
                .build());


Here the key is defined as col1:col2:col3. I would like to periodically export this data to a RDBMS based on the last exported timestamp in an automated fashion. Running manual explore query is not an option unless it can be automated (including the DB export). The other option is to read the entire Table dataset using the Table plugin and then have a JavaScript filter transform that looks up the last exported timestamp (stored in a key value table) and then filters the records that need to be exported using a database sink (is this the best option or is there something better?)





On Monday, June 19, 2017 at 4:25:40 PM UTC-5, Vinisha Vyasa wrote:
Hey Shoaib,

Q: I am trying to export data from a CDAP Table dataset using a timestamp column. There is a CDAP plugin in Hydrator that reads an entire CDAP Table and emits structured records. Can I use this along with a javascript filter transform plugin to filter out records based on the timestamp column? Is this efficient? 
A: This can be achieved by Javascript transform. However, please note that javascript transform only allows keyvalue table lookups.

Q: can I modify the Table batch source plugin to only scan desired records (greater than or equal to a timestamp)? 
A: Could you please provide more information on how your data is structured, what is the row key for the table.

Q: Also, can this task (selecting and exporting to DB) be achieved using Hive queries on the CDAP Table dataset either using Hydrator or asa  custom CDAP application?
A: Hive queries can be executed on CDAP Datasets using explore queries. The select statement can have filters you want and then download the result file and use it to import into the relational DB.

Inline image 2

Thanks,
Vinisha
On Tue, Jun 13, 2017 at 2:12 PM, Shoaib Quraishi <msq...@gmail.com> wrote:
Hi all,

I am trying to export data from a CDAP Table dataset using a timestamp column. There is a CDAP plugin in Hydrator that reads an entire CDAP Table and emits structured records. Can I use this along with a javascript filter transform plugin to filter out records based on the timestamp column? Is this efficient? Or can I modify the Table batch source plugin to only scan desired records (greater than or equal to a timestamp)? Also, can this task (selecting and exporting to DB) be achieved using Hive queries on the CDAP Table dataset either using Hydrator or asa  custom CDAP application?

Thanks,

Shoaib

--
You received this message because you are subscribed to the Google Groups "CDAP User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+...@googlegroups.com.

Vinisha Vyasa

unread,
Jun 21, 2017, 1:38:10 AM6/21/17
to CDAP User
Hey Shoaib,

HBase scan works on row keys meaning filtering is possible if the the prefix of row key is using timestamp. I am wondering how big is your cdap table dataset? Also is it possible for you to construct row key using timestamp? 

Thanks,
Vinisha

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.

Shoaib Quraishi

unread,
Jun 26, 2017, 6:44:02 PM6/26/17
to CDAP User
Vinisha,

Do you have an example I can follow for filtering out rows based on timestamp range in the table key using a hydrator pipeline. I can build on top of that. The key has the following format:


YYYMMDDHHMiSSsss:EventName:FileName:FeedName

Eg:

201706260541123:ARRIVED:abc.txt:Feed00

201706260730456:DETECTED:xyz.txt:Feed00

Is it possible to get all rows of the Hbase table between the above two keys.

Thanks,
Shoaib

Vinisha Vyasa

unread,
Jun 27, 2017, 6:52:07 PM6/27/17
to CDAP User
Hey Shoaib,

In order to understand your usecase better, could you please provide answers to following questions:

- What is the rate of your writes to the table dataset? Since you are using monotonically increasing row keys, it may lead to hot spotting in HBase.
- Another option is to use Time Partitioned Fileset for incremental data processing. This will create different time based partitions on hdfs which can be used to filter data based on timestamp. I was wondering if it will satisfy your requirements? 

To answer your questions:

1.) For a CDAP Table Dataset:
To scan CDAP Table based on start and stop key by modifying  BatchReadableSource.java to get splits based on startKey and stopKey as follows:

Table dataset = context.getDataset("dataset");
context.setInput(Input.ofDataset(properties.get(Properties.BatchReadableWritable.NAME), dataset.getSplits(numSplits, startKey, stopKey)));  

2.) For a raw HBase Table:
If you are using a raw HBase Table, you can configure `hbase.mapreduce.scan.row.start` and `hbase.mapreduce.scan.row.stop` properties of TableInputFormat.

Please let us know if you have any questions.

Thanks,
Vinisha

To unsubscribe from this group and stop receiving emails from it, send an email to cdap-user+unsubscribe@googlegroups.com.

To post to this group, send email to cdap...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages