Can't Batch Ingest Parquet File from Hadoop

Chanh Le

unread,

Aug 11, 2016, 6:53:29 AM8/11/16

to Druid User

Hi everyone,

I am using imply-1.3.0

The task to create a index following.

I already load

druid.extensions.loadList=["druid-datasketches", "druid-avro-extensions", "druid-parquet-extensions", "postgresql-metadata-storage", "druid-hdfs-storage"]

The task detail

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "io.druid.data.input.parquet.DruidParquetInputFormat",
        "paths": "hdfs://master1:9000/AD_COOKIE_REPORT"
      }
    },
    "dataSchema": {
      "dataSource": "no_metrics",
      "parser": {
        "type": "parquet",
        "parseSpec": {
          "format": "json",
          "timestampSpec": {
            "column": "time",
            "format": "yyyy-mm-dd-HH"
          },
          "dimensionsSpec": {
            "dimensions": [
              "advertiser_id",
              "campaign_id",
              "payment_id",
              "creative_id",
              "website_id",
              "channel_id",
              "section_id",
              "zone_id",
              "ad_default",
              "topic_id",
              "interest_id",
              "inmarket_id",
              "audience_id",
              "os_id",
              "browser_id",
              "device_type",
              "device_id",
              "location_id",
              "age_id",
              "gender_id",
              "network_id",
              "merchant_cate",
              "userId"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        },
        {
          "name": "impression",
          "type": "longSum",
          "fieldName": "impression"
        },
        {
          "name": "viewable",
          "type": "longSum",
          "fieldName": "viewable"
        },
        {
          "name": "revenue",
          "type": "longSum",
          "fieldName": "revenue"
        },
        {
          "name": "proceeds",
          "type": "longSum",
          "fieldName": "proceeds"
        },
        {
          "name": "spent",
          "type": "longSum",
          "fieldName": "spent"
        },
        {
          "name": "click_fraud",
          "type": "longSum",
          "fieldName": "click_fraud"
        },
        {
          "name": "click",
          "type": "longSum",
          "fieldName": "clickdelta"
        },
        {
          "name": "user_unique",
          "type": "hyperUnique",
          "fieldName": "userId"
        }
      ],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "ALL",
        "intervals": [
          "2016-08-09/2016-08-11"
        ]
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "type": "hashed",
        "targetPartitionSize": 5000000
      },
      "jobProperties": {}
    }
  }
}

But failed without raise any error or log file that I can investigate

{
  "task": "index_hadoop_no_metrics_2016-08-11T02:35:45.667Z",
  "status": {
    "id": "index_hadoop_no_metrics_2016-08-11T02:35:45.667Z",
    "status": "FAILED",
    "duration": 7385
  }
}

How do I find the logs of this task.

I already find

/data/imply-1.3.0/var/druid/task/

But the task failed so fast to get a log file here.

Thank in advance.

Chanh Le

unread,

Aug 11, 2016, 11:24:49 PM8/11/16

to Druid User

I included

➜  imply-1.3.0 ll dist/druid/extensions/druid-parquet-extensions
total 16368
-rw-r--r--@ 1 giaosudau  staff     8937 Aug 12 09:13 druid-parquet-extensions-0.9.1.1.jar
-rw-r--r--@ 1 giaosudau  staff   109569 Aug 12 09:45 parquet-avro-1.8.1.jar
-rw-r--r--@ 1 giaosudau  staff   945914 Aug 12 09:36 parquet-column-1.8.1.jar
-rw-r--r--@ 1 giaosudau  staff    38604 Aug 12 09:43 parquet-common-1.8.1.jar
-rw-r--r--@ 1 giaosudau  staff   285479 Aug 12 09:53 parquet-encoding-1.8.1.jar
-rw-r--r--@ 1 giaosudau  staff   390733 Aug 12 09:47 parquet-format-2.3.1.jar
-rw-r--r--@ 1 giaosudau  staff   218076 Aug 12 09:30 parquet-hadoop-1.8.1.jar
-rw-r--r--@ 1 giaosudau  staff  1048117 Aug 12 09:53 parquet-jackson-1.8.1.jar
-rw-r--r--@ 1 giaosudau  staff  5320231 Aug 12 09:53 parquet-tools-1.8.1.jar

It works.

But I have problem with the partition parquet.

FACT_AD_STATS_DAILY/time=2016-07-16/network_id=31713/part-r-00000-5e5c7291-e1e1-462d-9cc6-7ef2d5be892f.snappy.parquet

the timestamp field is on folder_name.

2016-08-12T03:17:33,059 ERROR [task-runner-0-priority-0] io.druid.indexing.overlord.ThreadPoolTaskRunner - Exception while running task[HadoopIndexTask{id=index_hadoop_no_metrics_2016-08-12T03:17:23.355Z, type=index_hadoop, dataSource=no_metrics}]
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
 at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.1.jar:?]
 at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:204) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexing.common.task.HadoopIndexTask.run(HadoopIndexTask.java:208) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:436) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexing.overlord.ThreadPoolTaskRunner$ThreadPoolTaskRunnerCallable.call(ThreadPoolTaskRunner.java:408) [druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_77]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
 at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
 Caused by: java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
 at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
 at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 ... 7 more
 Caused by: java.lang.RuntimeException: java.lang.RuntimeException: No buckets?? seems there is no data to index.
 at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:211) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
 at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
 at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 ... 7 more
 Caused by: java.lang.RuntimeException: No buckets?? seems there is no data to index.
 at io.druid.indexer.IndexGeneratorJob.run(IndexGeneratorJob.java:172) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexer.JobHelper.runJobs(JobHelper.java:323) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexer.HadoopDruidIndexerJob.run(HadoopDruidIndexerJob.java:94) ~[druid-indexing-hadoop-0.9.1.1.jar:0.9.1.1]
 at io.druid.indexing.common.task.HadoopIndexTask$HadoopIndexGeneratorInnerProcessing.runTask(HadoopIndexTask.java:261) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_77]
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_77]
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
 at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
 at io.druid.indexing.common.task.HadoopTask.invokeForeignLoader(HadoopTask.java:201) ~[druid-indexing-service-0.9.1.1.jar:0.9.1.1]
 ... 7 more
 2016-08-12T03:17:33,071 INFO [task-runner-0-priority-0] io.druid.indexing.overlord.TaskRunnerUtils - Task [index_hadoop_no_metrics_2016-08-12T03:17:23.355Z] status changed to [FAILED].
 2016-08-12T03:17:33,077 INFO [task-runner-0-priority-0] io.druid.indexing.worker.executor.ExecutorLifecycle - Task completed with status: {
 "id" : "index_hadoop_no_metrics_2016-08-12T03:17:23.355Z",
 "status" : "FAILED",
 "duration" : 6102
 }

I upload the logs here

How do I import this kind of file?

Chanh Le

unread,

Aug 15, 2016, 5:55:14 AM8/15/16

to Druid User

Hi everyone,

Any ideas on that would be appreciate?

Thanks.

On Thursday, August 11, 2016 at 5:53:29 PM UTC+7, Chanh Le wrote:

Fangjin Yang

unread,

Aug 15, 2016, 8:36:04 PM8/15/16

to Druid User

HI Chanh, Parquet is a community extension not officially supported by the Druid committers but the original author is around for help.

In this particular case, the problem is that

"intervals": [

          "2016-08-09/2016-08-11"
        ]
      }

is not matching your actual data. Ensure you have the correct timezone for your data and that the data listed at your location is actually for the interval you provided.

Chanh Le

unread,

Aug 15, 2016, 10:19:03 PM8/15/16

to druid...@googlegroups.com

Hi Fangjin,

Thanks for suggestion.

BTW Is there anyway to ignore the intervals time just import all the thing in the file?

Because sometime we just have a bunch of data and just want to import them all.

Regards,

Chanh

--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/mGWwfUhjDOE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/aa65bc1e-136c-4fab-a265-414e45769849%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fangjin Yang

unread,

Aug 16, 2016, 1:08:09 PM8/16/16

to Druid User

This functionality should most definitely be added in the near future :)

To unsubscribe from this group and all its topics, send an email to druid-user+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward