I created a cluster of 5 workers. When I ran a modification of the bigquery connector example [1] as I no output table, the spark bigquery connector does not seems to be well configured. My table is approximately of 500 MB and 3,000,000 rows. I can see in my logs:
15/10/27 14:33:44 WARN com.google.cloud.hadoop.io.bigquery.ShardedExportToCloudStorage: Estimated max shards < desired num maps (2 < 20); clipping to 2.
15/10/27 14:33:44 INFO com.google.cloud.hadoop.io.bigquery.ShardedExportToCloudStorage: Computed '2' shards for sharded BigQuery export.
which seems to be sub optimal to use only 2 shards as I have more than 20 cores available. It takes more than 10 minutes to count all the rows. Is there a way to make the BigQuery connector smarter ? My goal is to run spark jobs on terabytes of data (currently stored in BQ).
import com.google.cloud.hadoop.io.bigquery.GsonBigQueryInputFormat
import com.google.cloud.hadoop.io.bigquery.mapred.BigQueryMapredInputFormat
import com.google.gson.JsonObject
import org.apache.hadoop.io.LongWritable
val projectId = "xxx"
val fullyQualifiedInputTableId = "xxx:yyy.zzz"
val jobName = "wordcount"
val conf = sc.hadoopConfiguration
// Set the job-level projectId.
conf.set(BigQueryConfiguration.PROJECT_ID_KEY, projectId)
// Use the systemBucket for temporary BigQuery export data used by the InputFormat.
val systemBucket = conf.get("fs.gs.system.bucket")
conf.set(BigQueryConfiguration.GCS_BUCKET_KEY, systemBucket)
// Configure input and output for BigQuery access.
BigQueryConfiguration.configureBigQueryInput(conf, fullyQualifiedInputTableId)
val fieldName = "word"
val tableData = sc.newAPIHadoopRDD(conf,
classOf[GsonBigQueryInputFormat], classOf[LongWritable], classOf[JsonObject])
tableData.count()