is avro format is supported by rhadoop mapreduce function

105 views
Skip to first unread message

sai krishna bala

unread,
Sep 15, 2014, 2:39:52 PM9/15/14
to rha...@googlegroups.com
i was looking into the mapreduce function provided by rmr library and just wanted to know if there is any support to read the data which is in avro serialized format.
Right now, my data is avro serialized and i want to extract some aggregated metrics from the logs using mapreduce functionality in Rhadoop with make.input.format=avro and schema=<path_of_schema>. Can this be done?

Antonio Piccolboni

unread,
Sep 16, 2014, 5:39:55 PM9/16/14
to rha...@googlegroups.com
You can check this example https://github.com/RevolutionAnalytics/rmr2/blob/15a0dbd7233087cfd7b020f90e317d65a207c872/pkg/examples/avro.R

Built in support is expected for the next minor release, already in dev branch, input-only


Antonio

Renan Pinzon

unread,
Sep 17, 2014, 5:15:24 PM9/17/14
to rha...@googlegroups.com
Hi Piccolboni,

I'm trying to read an avro file from HDFS based on your example and also on the dev branch and both do not works. In your example I got an error saying that it expects a quote (") but gets a left curly braces ({) and the code based on dev branch always expects a local file. So, is there a way to read files from HDFS?

I looked at the code in dev branch and I noticed that the variable AVRO_LIBS is loaded but I couldn't find which libs are expected. Do you have any idea?

When I call the command read.avro('/path/to/my/file.avro') from R it works fine, so I believe that my avro file is fine.

Regards,
Renan

Antonio Piccolboni

unread,
Sep 17, 2014, 5:42:25 PM9/17/14
to RHadoop Google Group
It'd be better if you shared your code exactly. There is a way, but it is not your way. That's all I can tell. Luckily for you I run the unit tests for avro input in dev and they don't pass. So you don't need to take on the humble task of actually sharing your code, in addition to your opinion. I will let you know what I find out.

That AVRO_LIBS setting is not documented very well, is it? That's got to improve before we release. If you have ravro installed, take a look at ravro::AVRO_TOOLS. Setting them the same way should do.


Antonio

 

When I call the command read.avro('/path/to/my/file.avro') from R it works fine, so I believe that my avro file is fine.

Regards,
Renan

On Tuesday, September 16, 2014 6:39:55 PM UTC-3, Antonio Piccolboni wrote:
You can check this example https://github.com/RevolutionAnalytics/rmr2/blob/15a0dbd7233087cfd7b020f90e317d65a207c872/pkg/examples/avro.R

Built in support is expected for the next minor release, already in dev branch, input-only


Antonio

On Monday, September 15, 2014 11:39:52 AM UTC-7, sai krishna bala wrote:
i was looking into the mapreduce function provided by rmr library and just wanted to know if there is any support to read the data which is in avro serialized format.
Right now, my data is avro serialized and i want to extract some aggregated metrics from the logs using mapreduce functionality in Rhadoop with make.input.format=avro and schema=<path_of_schema>. Can this be done?

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Antonio Piccolboni

unread,
Sep 17, 2014, 6:09:08 PM9/17/14
to rha...@googlegroups.com, ant...@piccolboni.info
Alright, so the only problem with the failed tests was the setting of AVRO_LIBS. We need to document that clearly. For now start with a

Sys.setenv(AVRO_LIBS = ravro::AVRO_TOOLS)


If that doesn't do it please share a small example and I'll try to run it. Thanks


On Wednesday, September 17, 2014 2:42:25 PM UTC-7, Antonio Piccolboni wrote:
It'd be better if you shared your code exactly. There is a way, but it is not your way. That's all I can tell. Luckily for you I run the unit tests for avro input in dev and they don't pass. So you don't need to take on the humble task of actually sharing your code, in addition to your opinion. I will let you know what I find out.

That AVRO_LIBS setting is not documented very well, is it? That's got to improve before we release. If you have ravro installed, take a look at ravro::AVRO_TOOLS. Setting them the same way should do.


Antonio

 

When I call the command read.avro('/path/to/my/file.avro') from R it works fine, so I believe that my avro file is fine.

Regards,
Renan

On Tuesday, September 16, 2014 6:39:55 PM UTC-3, Antonio Piccolboni wrote:
You can check this example https://github.com/RevolutionAnalytics/rmr2/blob/15a0dbd7233087cfd7b020f90e317d65a207c872/pkg/examples/avro.R

Built in support is expected for the next minor release, already in dev branch, input-only


Antonio

On Monday, September 15, 2014 11:39:52 AM UTC-7, sai krishna bala wrote:
i was looking into the mapreduce function provided by rmr library and just wanted to know if there is any support to read the data which is in avro serialized format.
Right now, my data is avro serialized and i want to extract some aggregated metrics from the logs using mapreduce functionality in Rhadoop with make.input.format=avro and schema=<path_of_schema>. Can this be done?

web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+unsubscribe@googlegroups.com.

Renan Pinzon

unread,
Sep 17, 2014, 6:18:56 PM9/17/14
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Piccolboni,

Thank you for your prompt answer.

So, I'm definning an avro input format for my mapreduce like this:

avro.input.format = function(schema.file, ..., read.size = 10^5) {
  schema = ravro:::avro_get_schema(file = schema.file)
  function(con) {
    lines = readLines(con = con, n = read.size)
    if (length(lines) == 0)
      NULL
    else {
      x = splat(paste.fromJSON)(lines)
      y = ravro:::parse_avro(x, schema, encoded_unions=FALSE, ...)
      keyval(NULL, y)
    }
  }
}

In input.format attribute in mapreduce I'm using this:

  input.format = make.input.format(
    format = avro.input.format,
    mode = 'text',
    streaming.format = "org.apache.avro.mapred.AvroAsTextInputFormat",
    backend.parameters = list(hadoop = list(
      libjars = '/home/hdfs/avro-mapred-1.7.4-hadoop2.jar'
    ))
  )

As the ravro:::avro_get_schema expects a local file I'm stuck at this point because I'm trying to read the file from HDFS.

About the AVRO_LIBS, that's ok.. Running ravro::AVRO_TOOLS I get "/usr/lib64/R/library/ravro/java/avro-tools-1.7.4.jar".

On Wednesday, September 17, 2014 7:09:08 PM UTC-3, Antonio Piccolboni wrote:
Alright, so the only problem with the failed tests was the setting of AVRO_LIBS. We need to document that clearly. For now start with a

Sys.setenv(AVRO_LIBS = ravro::AVRO_TOOLS)


If that doesn't do it please share a small example and I'll try to run it. Thanks


On Wednesday, September 17, 2014 2:42:25 PM UTC-7, Antonio Piccolboni wrote:
It'd be better if you shared your code exactly. There is a way, but it is not your way. That's all I can tell. Luckily for you I run the unit tests for avro input in dev and they don't pass. So you don't need to take on the humble task of actually sharing your code, in addition to your opinion. I will let you know what I find out.

That AVRO_LIBS setting is not documented very well, is it? That's got to improve before we release. If you have ravro installed, take a look at ravro::AVRO_TOOLS. Setting them the same way should do.


Antonio

 

When I call the command read.avro('/path/to/my/file.avro') from R it works fine, so I believe that my avro file is fine.

Regards,
Renan

On Tuesday, September 16, 2014 6:39:55 PM UTC-3, Antonio Piccolboni wrote:
You can check this example https://github.com/RevolutionAnalytics/rmr2/blob/15a0dbd7233087cfd7b020f90e317d65a207c872/pkg/examples/avro.R

Built in support is expected for the next minor release, already in dev branch, input-only


Antonio

On Monday, September 15, 2014 11:39:52 AM UTC-7, sai krishna bala wrote:
i was looking into the mapreduce function provided by rmr library and just wanted to know if there is any support to read the data which is in avro serialized format.
Right now, my data is avro serialized and i want to extract some aggregated metrics from the logs using mapreduce functionality in Rhadoop with make.input.format=avro and schema=<path_of_schema>. Can this be done?

web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.

Antonio Piccolboni

unread,
Sep 17, 2014, 6:27:25 PM9/17/14
to Renan Pinzon, RHadoop Google Group
Not sure where you got that but let's focus on dev. Find pkg/tests/mapreduce.R  Look for the avro test. That's how it's done from now on. The only thing, in theory you should be able to specify the schema with a complete url hfds://path. That's not in the unit tests, but I think we tested it developing ravro and should work here as well as a consequence.

Renan Pinzon

unread,
Sep 17, 2014, 7:15:08 PM9/17/14
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

When I run the command ravro::read.avro('/path/to/local/file.avro') it works fine but if I run ravro::read.avro('hdfs://localhost:8020/path/to/hdfs/file.avro') it fails.

I'm not using the dev branch in fact cause I'm not sure if it is compatible with the version 3.2.0 (packaged and available in the wiki).

Is it ok to change to dev branch or I can face with compatibility issues?

Regards,
Renan

Antonio Piccolboni

unread,
Sep 17, 2014, 7:44:43 PM9/17/14
to Renan Pinzon, RHadoop Google Group
On Wed, Sep 17, 2014 at 4:15 PM, Renan Pinzon <rpi...@gmail.com> wrote:
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

You will forgive a degree of perplexity when I see the code of the library pasted back in a message. Libraries are used with library(), not by cut and paste.
 
I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

And? 


When I run the command ravro::read.avro('/path/to/local/file.avro') it works fine but if I run ravro::read.avro('hdfs://localhost:8020/path/to/hdfs/file.avro') it fails.

That't talking. Will try to repro. You just did an hdfs dfs -put local.file.avro hdfs.file.avro, correct?
 

I'm not using the dev branch in fact cause I'm not sure if it is compatible with the version 3.2.0 (packaged and available in the wiki).

You mean as in backward compatibility? It is supposed to.  Backwards compatibility is determined by the highest number (same first number). Data compatibility has been a little more shaky of late and in general it's not recommended to use the "native" format for archival purposes.
 

Is it ok to change to dev branch or I can face with compatibility issues?

Sorry, if you build from git there's no guarantee whatsoever, independent of branch. I mean, legally there isn't any warranty for anything but for releases we at least tested them. Between commits, you are living dangerously. Listen, I could have told you not-available-yet-sorry end of conversation. We have a chance to develop this thing together, but you can't have it in production tomorrow. 

Antonio

Renan Pinzon

unread,
Sep 18, 2014, 10:41:04 AM9/18/14
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info


On Wednesday, September 17, 2014 8:44:43 PM UTC-3, Antonio Piccolboni wrote:


On Wed, Sep 17, 2014 at 4:15 PM, Renan Pinzon <rpi...@gmail.com> wrote:
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

You will forgive a degree of perplexity when I see the code of the library pasted back in a message. Libraries are used with library(), not by cut and paste.

Yes, I'm sure about this, but I wrote this piece of code to create an avro input format cause it's only available on dev branch which I'm not secure to use, so I'm trying to do in my own code running the latest released version of the library.
 
 
I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

And? 

I was wondering about reading the schema from the same file in HDFS that contains the entire data.
 


When I run the command ravro::read.avro('/path/to/local/file.avro') it works fine but if I run ravro::read.avro('hdfs://localhost:8020/path/to/hdfs/file.avro') it fails.

That't talking. Will try to repro. You just did an hdfs dfs -put local.file.avro hdfs.file.avro, correct?

Yes, the file is already in HDFS. If I run the command hdfs dfs -ls hdfs://localhost:8020/path/to/hdfs I can list the contents of the directory and see the file there, so the localhost:8020 is correct and accessible.
 

I'm not using the dev branch in fact cause I'm not sure if it is compatible with the version 3.2.0 (packaged and available in the wiki).

You mean as in backward compatibility? It is supposed to.  Backwards compatibility is determined by the highest number (same first number). Data compatibility has been a little more shaky of late and in general it's not recommended to use the "native" format for archival purposes. 
 
Yes, that's the point. I believe that it answer the next question too.
 

Is it ok to change to dev branch or I can face with compatibility issues?

Sorry, if you build from git there's no guarantee whatsoever, independent of branch. I mean, legally there isn't any warranty for anything but for releases we at least tested them. Between commits, you are living dangerously. Listen, I could have told you not-available-yet-sorry end of conversation. We have a chance to develop this thing together, but you can't have it in production tomorrow. 

Ok, I understand that but as I said in the previous question I was concerned about backward compatibility. So I know that I cannot go to production while the dev branch is not released.
 
Renan

Antonio Piccolboni

unread,
Sep 18, 2014, 12:36:16 PM9/18/14
to RHadoop Google Group, Renan Pinzon
On Thu, Sep 18, 2014 at 7:41 AM, Renan Pinzon <rpi...@gmail.com> wrote:


On Wednesday, September 17, 2014 8:44:43 PM UTC-3, Antonio Piccolboni wrote:


On Wed, Sep 17, 2014 at 4:15 PM, Renan Pinzon <rpi...@gmail.com> wrote:
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

You will forgive a degree of perplexity when I see the code of the library pasted back in a message. Libraries are used with library(), not by cut and paste.

Yes, I'm sure about this, but I wrote this piece of code to create an avro input format cause it's only available on dev branch which I'm not secure to use, so I'm trying to do in my own code running the latest released version of the library.


So the way software engineers speak about is is that you tried to backport the avro format to the current stable version. You can't say "I wrote" for stuff that someone else wrote. That's called plagiarism. If you write that in an email to the author of the plagiarized code, there's other ways to describe it, none of which is flattering enough for me to use it on this forum.
 
 
 
I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

And? 

I was wondering about reading the schema from the same file in HDFS that contains the entire data.

Well the manual is pretty clear on this (make.input.format, "avro" entry)

"(input only) It has one mandatory additional argument, schema.file that should provide the URL of a file containing an appropriate avro schema, can be the same as file to be read. The user can specify the protocol, for instance file: or hdfs: as part of the URL, with the first being the default."

I am having problem with this feature though, let me get back to you.
 
 


When I run the command ravro::read.avro('/path/to/local/file.avro') it works fine but if I run ravro::read.avro('hdfs://localhost:8020/path/to/hdfs/file.avro') it fails.

That't talking. Will try to repro. You just did an hdfs dfs -put local.file.avro hdfs.file.avro, correct?

Yes, the file is already in HDFS. If I run the command hdfs dfs -ls hdfs://localhost:8020/path/to/hdfs I can list the contents of the directory and see the file there, so the localhost:8020 is correct and accessible.


Well, I need to be sure it's absolutely the same file. I will send  a script that we can both try.

Renan Pinzon

unread,
Sep 18, 2014, 1:19:19 PM9/18/14
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info


On Thursday, September 18, 2014 1:36:16 PM UTC-3, Antonio Piccolboni wrote:


On Thu, Sep 18, 2014 at 7:41 AM, Renan Pinzon <rpi...@gmail.com> wrote:


On Wednesday, September 17, 2014 8:44:43 PM UTC-3, Antonio Piccolboni wrote:


On Wed, Sep 17, 2014 at 4:15 PM, Renan Pinzon <rpi...@gmail.com> wrote:
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

You will forgive a degree of perplexity when I see the code of the library pasted back in a message. Libraries are used with library(), not by cut and paste.

Yes, I'm sure about this, but I wrote this piece of code to create an avro input format cause it's only available on dev branch which I'm not secure to use, so I'm trying to do in my own code running the latest released version of the library.


So the way software engineers speak about is is that you tried to backport the avro format to the current stable version. You can't say "I wrote" for stuff that someone else wrote. That's called plagiarism. If you write that in an email to the author of the plagiarized code, there's other ways to describe it, none of which is flattering enough for me to use it on this forum.

I understand what you say, but I only said that "I wrote" cause I was trying another solution using avro-utils (https://github.com/tomslabs/avro-utils) which requires to create an input format similar to this one and then I changed the code to this. In fact now the code is a bit different because I removed the line that reads the schema from the local file and placed the schema there, but it didn't solved the problem yet.
 
 
 
 
I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

And? 

I was wondering about reading the schema from the same file in HDFS that contains the entire data.

Well the manual is pretty clear on this (make.input.format, "avro" entry)

"(input only) It has one mandatory additional argument, schema.file that should provide the URL of a file containing an appropriate avro schema, can be the same as file to be read. The user can specify the protocol, for instance file: or hdfs: as part of the URL, with the first being the default."

I am having problem with this feature though, let me get back to you.

Ok, I'll be waiting and I'll also package the git dev branch and try to use it.
 
 
 


When I run the command ravro::read.avro('/path/to/local/file.avro') it works fine but if I run ravro::read.avro('hdfs://localhost:8020/path/to/hdfs/file.avro') it fails.

That't talking. Will try to repro. You just did an hdfs dfs -put local.file.avro hdfs.file.avro, correct?

Yes, the file is already in HDFS. If I run the command hdfs dfs -ls hdfs://localhost:8020/path/to/hdfs I can list the contents of the directory and see the file there, so the localhost:8020 is correct and accessible.


Well, I need to be sure it's absolutely the same file. I will send  a script that we can both try.

The file that I tried is exactly the same, I just ran the put command to send it to HDFS.

Antonio Piccolboni

unread,
Sep 18, 2014, 2:52:10 PM9/18/14
to RHadoop Google Group, Renan Pinzon
On Thu, Sep 18, 2014 at 10:19 AM, Renan Pinzon <rpi...@gmail.com> wrote:


On Thursday, September 18, 2014 1:36:16 PM UTC-3, Antonio Piccolboni wrote:


On Thu, Sep 18, 2014 at 7:41 AM, Renan Pinzon <rpi...@gmail.com> wrote:


On Wednesday, September 17, 2014 8:44:43 PM UTC-3, Antonio Piccolboni wrote:


On Wed, Sep 17, 2014 at 4:15 PM, Renan Pinzon <rpi...@gmail.com> wrote:
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

You will forgive a degree of perplexity when I see the code of the library pasted back in a message. Libraries are used with library(), not by cut and paste.

Yes, I'm sure about this, but I wrote this piece of code to create an avro input format cause it's only available on dev branch which I'm not secure to use, so I'm trying to do in my own code running the latest released version of the library.


So the way software engineers speak about is is that you tried to backport the avro format to the current stable version. You can't say "I wrote" for stuff that someone else wrote. That's called plagiarism. If you write that in an email to the author of the plagiarized code, there's other ways to describe it, none of which is flattering enough for me to use it on this forum.

I understand what you say, but I only said that "I wrote" cause I was trying another solution using avro-utils (https://github.com/tomslabs/avro-utils) which requires to create an input format similar to this one and then I changed the code to this. In fact now the code is a bit different because I removed the line that reads the schema from the local file and placed the schema there, but it didn't solved the problem yet.
 
 
 
 
I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

And? 

I was wondering about reading the schema from the same file in HDFS that contains the entire data.

Well the manual is pretty clear on this (make.input.format, "avro" entry)

"(input only) It has one mandatory additional argument, schema.file that should provide the URL of a file containing an appropriate avro schema, can be the same as file to be read. The user can specify the protocol, for instance file: or hdfs: as part of the URL, with the first being the default."

I am having problem with this feature though, let me get back to you.

Ok, I'll be waiting and I'll also package the git dev branch and try to use it.

It looks like we won't be able to support this feature and it'll have to be dropped. It was dropped from ravro because it was not supported in avro_tools and there was a lack of communication to the rmr2 team. Sorry about that. Is it a show stopper or something that can be worked around?


A

Renan Pinzon

unread,
Sep 19, 2014, 10:24:14 AM9/19/14
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info


On Thursday, September 18, 2014 3:52:10 PM UTC-3, Antonio Piccolboni wrote:


On Thu, Sep 18, 2014 at 10:19 AM, Renan Pinzon <rpi...@gmail.com> wrote:


On Thursday, September 18, 2014 1:36:16 PM UTC-3, Antonio Piccolboni wrote:


On Thu, Sep 18, 2014 at 7:41 AM, Renan Pinzon <rpi...@gmail.com> wrote:


On Wednesday, September 17, 2014 8:44:43 PM UTC-3, Antonio Piccolboni wrote:


On Wed, Sep 17, 2014 at 4:15 PM, Renan Pinzon <rpi...@gmail.com> wrote:
Hi Piccolboni,

I wrote this piece of code based on tests and dev branch.

You will forgive a degree of perplexity when I see the code of the library pasted back in a message. Libraries are used with library(), not by cut and paste.

Yes, I'm sure about this, but I wrote this piece of code to create an avro input format cause it's only available on dev branch which I'm not secure to use, so I'm trying to do in my own code running the latest released version of the library.


So the way software engineers speak about is is that you tried to backport the avro format to the current stable version. You can't say "I wrote" for stuff that someone else wrote. That's called plagiarism. If you write that in an email to the author of the plagiarized code, there's other ways to describe it, none of which is flattering enough for me to use it on this forum.

I understand what you say, but I only said that "I wrote" cause I was trying another solution using avro-utils (https://github.com/tomslabs/avro-utils) which requires to create an input format similar to this one and then I changed the code to this. In fact now the code is a bit different because I removed the line that reads the schema from the local file and placed the schema there, but it didn't solved the problem yet.
 
 
 
 
I saw the test that you said and I noticed that the schema is gotten from a temporary file which is local.

And? 

I was wondering about reading the schema from the same file in HDFS that contains the entire data.

Well the manual is pretty clear on this (make.input.format, "avro" entry)

"(input only) It has one mandatory additional argument, schema.file that should provide the URL of a file containing an appropriate avro schema, can be the same as file to be read. The user can specify the protocol, for instance file: or hdfs: as part of the URL, with the first being the default."

I am having problem with this feature though, let me get back to you.

Ok, I'll be waiting and I'll also package the git dev branch and try to use it.

It looks like we won't be able to support this feature and it'll have to be dropped. It was dropped from ravro because it was not supported in avro_tools and there was a lack of communication to the rmr2 team. Sorry about that. Is it a show stopper or something that can be worked around?

Ok. So, it is not a show stopper at all but I'll have to change my input to not work with avro for now. I'm also considering the possibility of doing an fork of avro to add support in avro-tools for files from HDFS but I'm not sure if I'll have enough time to do this.

I really appreciate your time and attention with this case and as soon as possible I'll return you if I'll can do this in avro-tools but I have to check whether their code are easily extendable or not because my deadline for this is today.


Renan

Renan Pinzon

unread,
Sep 19, 2014, 3:52:32 PM9/19/14
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info

I did a fork of avro and I've been working on a new branch of the release 1.7.4 which is the version that I'm running and I got the "ravro:::avro_get_schema" working fine with files from HDFS.

Now I'm facing problems with the next steps. Considering the following code, I did some changes to use "rhdfs:::hdfs.line.reader" instead of "readLine".

avro.input.format = function(schema.file, ..., read.size = 10^5) {
 schema = ravro:::avro_get_schema(file = schema.file)
 function(con) {
   lines = readLines(con = con, n = read.size)
   if (length(lines) == 0)
     NULL
   else {
     x = splat(paste.fromJSON)(lines)
     y = ravro:::parse_avro(x, schema, encoded_unions=FALSE, ...)
     keyval(NULL, y)
   }
 }
}


I'm stuck in the line "= splat(paste.fromJSON)(lines)".

The first problem that I found is the function "splat" which is not defined. Doing a search on github I found this code https://github.com/hadley/plyr/blob/master/R/splat.r but I'm not sure if this function is supposed to be used here.

The second point that I perceived is that the function "paste.fromJSON" is not defined, so I took a look at "rjson" library and I found only "fromJSON".

Antonio Piccolboni

unread,
Sep 19, 2014, 4:56:39 PM9/19/14
to RHadoop Google Group, Renan Pinzon
Yep, that's what you get cutting a pasting code. First a question: can you push your avro change upstream? We can support URI syntax only if your patch gets accepted. Please send me a link if you do so and I will follow and/or support your request. splat function is from plyr, correct and paste.from.JSON is from rmr2 dev branch. If  you look at the code you will see there's a message to use avro 1.7.7 or trunk. The jar you got with ravro is a patched 1.7.4. Since you forked off 1.7.4, you will miss the necessary patch, see https://issues.apache.org/jira/browse/AVRO-1454


Antonio

Renan Pinzon

unread,
Sep 19, 2014, 7:09:33 PM9/19/14
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info

Yes, I can do it but first of all I need to do a refactoring in order to correct things that I simplified in my test code.

Regarding to the avro I had to patch 1.7.4 cause it's the version that is running on my cluster due to CDH 4.7. I believe that it won't be a problem since I can backport the patch that you mentioned.

Renan

Antonio Piccolboni

unread,
Sep 19, 2014, 7:11:39 PM9/19/14
to RHadoop Google Group, Renan Pinzon
On Fri, Sep 19, 2014 at 4:09 PM, Renan Pinzon <rpi...@gmail.com> wrote:

Yes, I can do it but first of all I need to do a refactoring in order to correct things that I simplified in my test code.

Absolutely, first things first.
 

Regarding to the avro I had to patch 1.7.4 cause it's the version that is running on my cluster due to CDH 4.7. I believe that it won't be a problem since I can backport the patch that you mentioned.

Excellent.

Antony Rudkin

unread,
Jul 2, 2015, 8:07:35 AM7/2/15
to rha...@googlegroups.com, rpi...@gmail.com, ant...@piccolboni.info
is there a solution to reading avro from hdfs using Ravro?

Antonio Piccolboni

unread,
Jul 2, 2015, 9:31:38 AM7/2/15
to rha...@googlegroups.com
Please no thread hijacking. Starting a new thread is free.
Reply all
Reply to author
Forward
0 new messages