Hi Fangjin:
I've been trying to understand a little more about the problem and so have some background data before I answer your question.
First, the data I'm trying to load is about 80 columns, about 450 bytes per line. In a given day there are about 38 million rows, plus or minus. Previously, when I was using the old indexer on 0.5.48, I had managed to load a partial dayfile that contained about 30M rows covering 20 hours of data. At that time, after the load completed, it took several hours for queries to return data (that is, they returned empty arrays until then), including the time-boundary query. I think I'm just now understanding your comment about partial queries if data hasn't loaded yet, and I'm wondering if the time-boundary query actually requires that everything be loaded (it would make sense).
When I switched to 0.6 indexing service, I was monitoring in the overlord console and saw success, then could look at the status and when I replace the "status" URI frag with "segments" I can see the segment appear, just as it is in the MySQL database. When I stop everything (all Druid processes, zookeeper etc.) and restart, the console is empty, which seemed a little odd. The segment is in-place on disk at the location specified in the metadata returned from the console and a MySQL query. They are still there even though the overlord console no longer shows the segment after a restart.
At the overlord's "cluster.html" link, I get the "Loading segment data... this may take a few minutes" and it never returns.
For my background investigation, I went back to using the HadoopDruidIndexerMain, but in 0.6 (that is, "io.druid.cli.Main index hadoop"), starting with a small file, for successively larger subsets of the original 1M-row file, using head -nnn on the original file. For files containing the first 10, 100, 1000, 10000, and 100000 lines of the original file, the time-boundary query returns a non-empty array within at most 15-30 seconds of the completion of the load, but when I load the full 1M lines, I get the same result as when I use the new Indexing Service (which I believe is actually just a thin wrapper around the old Hadoop indexer, right?) -- all queries on that load (which I place in a different datasource to avoid confusion) return empty arrays.
I'm assuming that my configuration must be producing a segment which the historical node cannot bring into memory. I feel I'm somewhat constrained on granularity, etc. because of the one-reducer/one-segment issue, but I'd like to know more about why this doesn't seem to work well before I go off and build a full Hadoop cluster just to see what happens.
Re: querying over a certain time range: the time-boundary query should work regardless, right? I suspect that that query must return data before I even try executing any other type of query. Or does that query require everything be loaded, from the entire datasource?
If I have a file that is approximately 17GB in size, containing 24 hours of data at minute granularity, where there are about 30K rows of data per minute, what are reasonable values to use for the granularity spec and the rollup granularity?
I've had to set targetPartitionSize to 0 to avoid the partitioning-algorithm error noted elsewhere.
I suspect if I could segment my data per hour that things would improve (although this 1M-row example is only 25 minutes of data), but I don't know if I can even do that with the default local Hadoop instance unless I chop my data files into hourfiles. Should I set up a real Hadoop cluster?
One last thing -- just a couple of minutes ago, I tried loading the smaller 100K-row table using the indexing service. That appeared to succeed in the console, and within seconds I was able to query it. I did not see any "Announcing segment..." messages in any of the Druid-node log files.
2014-01-10 21:48:16,221 INFO [Thread-33] org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done. 2014-01-10 21:48:16,223 INFO [Thread-33] org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@7928ec65 2014-01-10 21:48:16,223 INFO [Thread-33] org.apache.hadoop.mapred.LocalJobRunner - 2014-01-10 21:48:16,224 INFO [Thread-33] org.apache.hadoop.mapred.Merger - Merging 1 sorted segments 2014-01-10 21:48:16,224 INFO [Thread-33] org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 33 bytes 2014-01-10 21:48:16,224 INFO [Thread-33] org.apache.hadoop.mapred.LocalJobRunner - 2014-01-10 21:48:16,227 INFO [Thread-33] io.druid.indexer.DeterminePartitionsJob - Determining partitions for interval: 2013-08-02T00:00:00.000Z/2013-08-03T00:00:00.000Z 2014-01-10 21:48:16,228 WARN [Thread-33] org.apache.hadoop.mapred.LocalJobRunner - job_local_0002 com.metamx.common.ISE: No suitable partitioning dimension found! at io.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionReducer.innerReduce(DeterminePartitionsJob.java:689) at io.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionBaseReducer.reduce(DeterminePartitionsJob.java:480) at io.druid.indexer.DeterminePartitionsJob$DeterminePartitionsDimSelectionBaseReducer.reduce(DeterminePartitionsJob.java:453) at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:417) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)By "noted elsewhere" I was referring to another post on the group titled "HadoopIndexerProblem using CSV". I don't know if my issue is similar to that one (I certainly have more rows!), but on the other hand I've not yet been able to ingest any file without having to set this value to 0 first.
Hi Fangjin:
I have a partial solution to my issue. The problem turned out to be this: I had started the Historical node with the default properties shown in the documentation, and it wasn't clear to me at the time that "druid.server.maxSize" refers to the maximum size of segment that the node will query before it defaults to returning an empty array. This was set to 100M. When I restarted with it set to 10G everything worked! Note: index.zip is about 63MB; not sure how that relates to "druid.server.maxSize" but it was definitely lower than the max before, when the query was not returning.
I don't quite understand the relationship between the following Historical-node properties:
druid.server.maxSize=10000000000
druid.processing.buffer.sizeBytes=1000000000
druid.segmentCache.locations=[{"path": "/tmp/druid/indexCache", "maxSize"\: 10000000000}]
but I bumped all 3 of them up. Did I need to do that?
btw "maxSegmentSize" might have caught my eye a lot sooner than "maxSize". ;->
So, until I can figure out why my ingest crashes every time I set targetPartitionSize to a value other than 0, I can at least control the segment size by loading my data by hour. I still haven't solved the issue of the failure I get when I try to load a full day. For now I'm going to try to load per hour, and when I figure out how to control the partition size I'll try loading a full day again.
Wow, nice! Wayne, you seem to have generated a segment with inverted
indexes larger than 2GB total.
I just got caught up on this thread and you mentioned that you
separated your data out into hour chunks to get some stuff loaded and
that you were worried about doing that for everything. I think you
should actually do that instead. With Druid, you can index each
individual hour into an hour segment and Druid will expose them to you
the same as if you had indexed the day together.
In general, we've kept segments to less than 10 million rows and
actually try to target 5 million or so per segment. IIRC, you have 37
million rows generated regularly over 1 day, so if you do them in
hourly chunks, you will have roughly 1.5M rows per segment, which
should be good. You should be able to do this by specifying the
segment granularity at "hour".
Also, you were asking about the partitionSize, I wonder what values
you set for that? A targetPartitionSize of "5000000" for example,
specifies that you want roughly 5,000,000 rows per segment. I forget
the exact algorithm that it uses to pick a partition dimension, but I
know it depends on finding boundaries that get close to the target,
somehow. If you just index at hourly granularity, though, you won't
have to worry about partitioning, because the segments will be small
enough.
Sorry for the trouble you've been having. The ingestion code was
written with the assumption that large files would be indexed via
Hadoop jobs on a Hadoop cluster, and I think you've proven that it has
many areas that can be further improved ;).