The raw data which generated by TPCx-BB pdgf tool contains dirty data.

63 views
Skip to first unread message

王力锋

unread,
Apr 20, 2017, 3:17:50 AM4/20/17
to Big Data Benchmark for BigBench
I used TPCx-BB default data generate tool to generate 1GB raw data. Then I found the product_reviews table contains dirty data.
TPCx-BB use "|" as the field delimiter, but in product_reviews table, some records in pr_review_content column also contains "|" symbol.
For example, one record in product_reviews table in 1GB dataset is as follows:

22659|2005-05-09|22:38:43|3|15236|26449|29296|This product does the job if you like hard bed don't buy it.But if you do) Once you download path for it. It is totally disgusting. Besides the fact that this one is defective too, but over all don't think this is such great bed for ten days while on the market.||Exodus||

For this records, we can see the value in pr_review_content column also contains "|" symbol.

Michael Frank

unread,
Apr 20, 2017, 2:47:12 PM4/20/17
to Big Data Benchmark for BigBench
Hi,

Yes, the review_content data is pretty dirty - as BigData data tends to be.. Currently we are not concerned with this and use the relaxed csv parsing of hive which pretty much ingores additional unintenitonal columns produced by dirty data. The review content simply gets truncated.

Do you have a serious issue with this behaviour?

If the truncating behaviour is an issue for you, you could write a simple pre-processor. As the schema has a known fixed number of cloums, simply quote the pr_review_content e.g. in "<pr_review_content>" and escape any occuring " chars within the text.

Cheers,
Michael
Reply all
Reply to author
Forward
0 new messages