Flume + Hive realtime problem with temporary files

Showing 1-6 of 6 messages
Flume + Hive realtime problem with temporary files r.pavlenko 11/8/12 10:50 PM
Hello. We have a problem with Flume + Hive realtime data collecting and analytics.
Flume collect data and write files to HDFS folder. This folder is data location for Hive table.
Flume roll file every minute. We run hive query and while MR job executing, Flume close and rename file from "bla-bla.tmp" to "bla-bla" without extension "tmp".
And then MR container crash with "file not found bla-bla.tmp" exception!

Can i tune Hive that it skip tmp file? Maybe other settings?

We use CDH Version 4.1.2 with flume 1.2.
Sorry for my light english :)
Re: Flume + Hive realtime problem with temporary files Brock Noland 11/9/12 7:30 AM
Hi,

FileInputFormat should have a setInputPathFilter method which you can use to ignore the .tmp files.

Brock

--
 
 
 



--
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
Re: Flume + Hive realtime problem with temporary files djl 11/11/12 2:19 PM
You can implement a very simple PathFilter class with your own specific filtering rules.
eg
public class FileFilterExcludeTmpFiles implements PathFilter {
    public boolean accept(Path p) {
String name = p.getName(); 
return !name.startsWith("_") && !name.startsWith(".") && !name.endsWith(".tmp");
    }
}

You then need to assign this class to the "mapred.input.pathFilter.class" property.

eg from the hive CLI:
Hive> SET mapred.input.pathFilter.class=com.yoururl.FileFilterExcludeTmpFiles;

By default hive ignores files beginning with underscore ("_") or dot (".").
It may be easier to have flume suffix these .tmp files with an underscore to ensure full compatability with hive/hadoop? 
Re: Flume + Hive realtime problem with temporary files djl 11/11/12 2:25 PM
sorry that post was mean to say:
It may be easier to have flume 'prefix' these .tmp files with an underscore to ensure full compatability with hive/hadoop? 
Re: Flume + Hive realtime problem with temporary files r.pavlenko 11/11/12 9:50 PM
Thank you for your answer. I'll try use PathFilter class.


It may be easier to have flume 'prefix' these .tmp files with an underscore to ensure full compatability with hive/hadoop?

I do not know how do it :(

понедельник, 12 ноября 2012 г., 2:25:55 UTC+4 пользователь djl написал:понедельник, 12 ноября 2012 г., 2:25:55 UTC+4 пользователь djl написал:понедельник, 12 ноября 2012 г., 2:25:55 UTC+4 пользователь djl написал:
Re: Flume + Hive realtime problem with temporary files Dan Sandler 1/20/13 7:46 AM
I had the same issue when implementing the Cloudera Twitter feed example.  I followed Brock and djl's workaround to FLUME-1702 and it worked as advertised,  The code I implemented is exactly as djl listed (plus the required packages).  There were also configuration changes to hive-site.xml.  The code and the configuration changes are listed below.  For those interested in the CDH twitter example and how to make it real-time with this workaround (and Oozie configuration), please refer to the Cloudera blog post on the Tutorial and reference the comments.


Java Code, build and JAR the class, and place the JAR in /usr/lib/hadoop.

package com.twitter.util;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;

public class FileFilterExcludeTmpFiles implements PathFilter {
public boolean accept(Path p) {
String name = p.getName();
return !name.startsWith(“_”) && !name.startsWith(“.”) && !name.endsWith(“.tmp”);
}
}

Changes to hive-site.xml.  Bounce the hive services following the configuration changes:

<property>
  <name>hive.aux.jars.path</name>
  <value>file:///usr/lib/hadoop/hive-serdes-1.0-SNAPSHOT.jar,file:///usr/lib/hadoop/TwitterUtil.jar</value>   
</property>

<property>
    <name>mapred.input.pathFilter.class</name>
    <value>com.twitter.util.FileFilterExcludeTmpFiles</value>
</property>