Small files not combined in mapper

65 views
Skip to first unread message

Nikhil J Joshi

unread,
Jan 6, 2017, 5:52:21 PM1/6/17
to Scalding Development
Hi,


I recently converted a Pig script to an equivalent scalding. While running the pig script on the input consisting of many small files I see the inputs are combined as per logs here:


org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1000 06-01-2017 14:37:58 PST referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1000 06-01-2017 14:37:58 PST referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic 
org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 77 06-01-2017 14:37:58 PST referral-scoring_scoring_feature-generation-v2_extract-postfeast-fields-jobs-basic INFO - 2017-01-06 22:37:58,517 org.apache.hadoop.mapreduce.JobSubmitter - number of splits:77

However the scalding job doesn't seem to combine and run 1000 mappers, one per input file which is causing bad performance. Is there something wrong with the way I am executing the scalding job?

The part of the script responsible for the step above is 

private val ids: TypedPipe[Int] = TypedPipe
    .from(PackedAvroSource[Identifiers](args("identifiers")))
    .map{ featureNamePrefix match {
      case "member" => _.getMemberId.toInt
      case "item" => _.getItemId.toInt
    }}

Any help is greatly appreciated.
Thanks,
Nikhil

Oscar Boykin

unread,
Jan 6, 2017, 7:59:07 PM1/6/17
to Nikhil J Joshi, Scalding Development
You want to set this config:


"cascading.hadoop.hfs.combine.files" -> true

which you can do in the job:

override def config = super.config + ("cascading.hadoop.hfs.combine.files" -> true)

or with a -Dcascading.hadoop.hfs.combine.files=true


option to hadoop.

That should work. Let us know if it does not.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Levenson

unread,
Jan 6, 2017, 8:07:19 PM1/6/17
to Oscar Boykin, Nikhil J Joshi, Scalding Development
I think you can set this per-source as well (instead of for all sources) by overriding `tapConfig` here: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/HfsConfPropertySetter.scala#L55

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,
Jan 6, 2017, 8:07:45 PM1/6/17
to Oscar Boykin, Nikhil J Joshi, Scalding Development
You probably also need to tune how much or how little combining you want to happen
--
Alex Levenson
@THISWILLWORK

Kostya Salomatin

unread,
Jan 6, 2017, 9:00:46 PM1/6/17
to Alex Levenson, Oscar Boykin, Nikhil J Joshi, Scalding Development
Wow, per source config is really useful. I've needed this feature for a while, did not know it already existed.

Kostya
Konstantin                              mailto:salo...@gmail.com

Alex Levenson

unread,
Jan 6, 2017, 9:12:43 PM1/6/17
to Kostya Salomatin, Oscar Boykin, Nikhil J Joshi, Scalding Development
Yeah per-source config is done via Tap.sourceConfInit and Tap.sinkConfInit -- so these custom settings will only apply after one of those methods is called.

So it can't be used to control things that happen before then, eg, the heap size of your mappers or things like that.
--
Alex Levenson
@THISWILLWORK

Nikhil J Joshi

unread,
Jan 6, 2017, 9:23:44 PM1/6/17
to Alex Levenson, Kostya Salomatin, Oscar Boykin, Scalding Development
Thanks Oscar and Alex. I will follow up and update you on these incredible ideas. 
Have a great weekend,
Nikhil

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Konstantin                              mailto:salo...@gmail.com



--
Alex Levenson
@THISWILLWORK
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.

Nikhil J Joshi

unread,
Jan 10, 2017, 12:49:58 PM1/10/17
to Alex Levenson, Kostya Salomatin, Oscar Boykin, Scalding Development
Hi Alex,

I am trying the `HfsConfPropertySetter` way. I couldn't find an example to implement it correctly, it seems. Could you share with me some more details on this? An example code will be great.

Thanks again,
Nikhil

Alex Levenson

unread,
Jan 10, 2017, 4:05:56 PM1/10/17
to Nikhil J Joshi, Kostya Salomatin, Oscar Boykin, Scalding Development
If PackedAvroSource extends FileSource (which extends HfsTapProvider) -- or if it just extends HfsTapProvider on its own, then you can just do something like:

new PackedAvroSource[Identifiers](args("identifiers"))) with HfsConfPropertySetter {
  override def tapConfig = Config(Map("foo" -> "bar"))
}

Does that make sense?

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Konstantin                              mailto:salo...@gmail.com



--
Alex Levenson
@THISWILLWORK
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.



--
Alex Levenson
@THISWILLWORK

Nikhil J Joshi

unread,
Jan 10, 2017, 5:11:21 PM1/10/17
to Alex Levenson, Kostya Salomatin, Oscar Boykin, Scalding Development
Hi Alex,

Thanks for the explanation. I realized that we are still on 0.13 with scala 2.10 and some of the things were not introduced before 0.16. I will need to figure out a work around this issue.

Thanks,
Nikhil

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Konstantin                              mailto:salo...@gmail.com



--
Alex Levenson
@THISWILLWORK
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.



--
Alex Levenson
@THISWILLWORK

Alex Levenson

unread,
Jan 10, 2017, 5:21:36 PM1/10/17
to Nikhil J Joshi, Kostya Salomatin, Oscar Boykin, Scalding Development
If you look at how HfsConfPropertySetter is implemented, you just need to use a Tap that overrides sourceConfInit and adds some things to the (mutable) config object there. So you can do that pretty easily yourself w/o using HfsConfPropertySetter if you need to.
The important bit is getting your settings into the configuration via the sourceConfInit method of the Tap.

Or just set it globally if that works for you -- all this is to keep the configs separated for each source.

To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Alex Levenson
@THISWILLWORK

--
You received this message because you are subscribed to the Google Groups "Scalding Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to scalding-dev+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Konstantin                              mailto:salo...@gmail.com



--
Alex Levenson
@THISWILLWORK
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.



--
Alex Levenson
@THISWILLWORK
--

Nikhil J Joshi
Senior Applied Researcher - Machine Learning, Data Science
LinkedIn Corp.



--
Alex Levenson
@THISWILLWORK
Reply all
Reply to author
Forward
0 new messages