Does spark support bz2 (or other compressed) files?

4,477 views
Skip to first unread message

Lingyun Zhang

unread,
Apr 5, 2013, 2:42:13 PM4/5/13
to spark...@googlegroups.com
I am using pyspark in spark 7. 

When I use 
tfile=sc.textFile("tt")
It works fine, uses all the workers/CPUs when I do some map&reduce funtions on it. 

When I use 
tfile=sc.textFile("tt.bz2")
It will still get the results eventually, but very slow and it seems to be only using one cpu on one machine.

(Hadoop deals with bz2 no problem because it is split-able compression.)

Thank you!
Lingyun

Matei Zaharia

unread,
Apr 5, 2013, 5:58:20 PM4/5/13
to spark...@googlegroups.com
Spark uses the exact same input library as Hadoop, so it will get the same performance. Make sure you link against the right version of Hadoop with your features though (for example, maybe Cloudera Hadoop has splittable compression but our default of Hadoop 1.0.4 doesn't). Also make sure to put the Hadoop native libraries on SPARK_LIBRARY_PATH in conf/spark-env.sh (otherwise it will complain about that and fall back to slower Java ones).

Matei

--
You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Lingyun Zhang

unread,
Apr 5, 2013, 7:30:54 PM4/5/13
to spark...@googlegroups.com
Hi Matei
Thank you for your response. I went to your talk in UCSD a few weeks
ago. I was very inspired and thus is trying to set SPARK and SHARK up
in our servers.

It's good to know that bz2 is supported but just something's wrong
with the setup.
The issue does not seem to be not using the Hadoop version vs the Java version.
One main symptom is that when I use plain txt file, I can see many
CPUs on many workers going crazy doing python.
But when I use bz2 file, I can see no CPU going crazy any server but
just one on the machine where the job is submitted ...
any idea what may cause that?

or is there a way to feed sc standard-in through a bash command,
something like tfile=sc.textFile("bzcat tt.bz2") ?
> You received this message because you are subscribed to a topic in the
> Google Groups "Spark Users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/spark-users/0360Gn70vRU/unsubscribe?hl=en.
> To unsubscribe from this group and all its topics, send an email to

Lingyun Zhang

unread,
Apr 5, 2013, 8:09:31 PM4/5/13
to spark...@googlegroups.com
OK I confirmed that Spark is getting the hadoop native library. Still
having the same problem.

Has anyone successfully used pyspark with bz2 files?
Does anyone have similar problem?

Lingyun Zhang

unread,
Apr 5, 2013, 8:20:51 PM4/5/13
to spark...@googlegroups.com
Has the same problem when I read bz2 from hdfs. Cannot spot CPU activities.

When I use hadoop streaming on the same file, it works fine, all CPUs going.

Ashish

unread,
Apr 5, 2013, 9:29:48 PM4/5/13
to spark...@googlegroups.com
Not sure what your exact problem is, yet we had a similar problem reading Snappy compressed files.

Please see:
https://groups.google.com/forum/m/?fromgroups#!topic/spark-developers/pmg0b68jQZs

Ashish

unread,
Apr 5, 2013, 9:31:18 PM4/5/13
to spark...@googlegroups.com
Are you running in Standalone mode?

Lingyun Zhang

unread,
Apr 5, 2013, 11:34:49 PM4/5/13
to spark...@googlegroups.com
yes I am running in Standalone mode with 5 servers. still has not figured out what was wrong. it is not the codec ...


On Fri, Apr 5, 2013 at 6:31 PM, Ashish <aran...@gmail.com> wrote:
Are you running in Standalone mode?

Matei Zaharia

unread,
Apr 8, 2013, 12:23:06 AM4/8/13
to spark...@googlegroups.com
Did you guys update your Hadoop version in the SBT build file? Try using the Cloudera version there.. there might be bugs in the Hadoop 1.0.4 that we depend on.

Matei

You received this message because you are subscribed to the Google Groups "Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spark-users...@googlegroups.com.

Lingyun Zhang

unread,
Apr 8, 2013, 12:46:37 AM4/8/13
to spark...@googlegroups.com
I thought I don't need to have hadoop to run spark?  My hadoop works fine with bz2, so is HIVE. 

I used the prebuilt file for spark 7. (also tried 6, similar problem. My shark also has the same issue that it works fine with plain text but not bz2).

Do you think it could be a problem with using the prebuilt file? Should I try to build myself instead?







Lingyun Zhang

unread,
Apr 8, 2013, 12:53:51 AM4/8/13
to spark...@googlegroups.com
Ah I see what you are saying now. 
I have the same issue even if I am using local files -- plain text OK, bz2 not OK (no hdfs in the picture). 
When I use hdfs, same thing, plain text OK (so spark talks to hdfs just fine), but bz2 not OK. 
So is shark, it creats tables and loads files to hdfs just fine. But queries only work right with plain text not bz2. 

I can try to build the source code tomorrow though.
Reply all
Reply to author
Forward
0 new messages