Datadr divide() java error

Jakub Paulina

unread,

May 17, 2016, 3:54:55 PM5/17/16

to Tessera-Users

Hello, 
I have problem with dividing 261744 rows and 11 col with function divide: divide(newYorkData ,update=TRUE, by="lang", output = hdfsConn("/paulina/hdfsFiles29", autoYes = TRUE), control = rhctl1)
rhctl1 <- rhipeControl(mapred = list(
  rhipe_map_buff_size = 100,
  mapred.max.split.size= 1024*1024,
  mapred.task.timeout=0,
  mapred.tasktracker.map.tasks.maximum = 4,
  mapreduce.map.cpu.vcores = 2,
  mapreduce.map.memory.mb = 3072+3072), jobname="pucTest")


First thing i noticed for every divide job is that it is not running Rhipe and hadoop. It is always running localy for some reason.
 And secondly for this medium dataset i am getting error 
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl,  : 
  java.lang.OutOfMemoryError: Java heap space
 

i tried type in Rstudio and restarted: options(java.parameters = "-Xmx8000m") or options(java.parameters = "-Xm3072") didnt work. Any ideas? Dont know if
this is tessera related problem or something other.

Jakub Paulina

unread,

May 17, 2016, 5:26:12 PM5/17/16

to Tessera-Users

I fixed it. It was Rjava allocation problem .
- First u need restart R session
- than setup

options(java.parameters = "-Xmx6g" ) 
6g means 6 GB
- After that u can load library(Rjava)
- if this is not working try to load more GB if u can 

But still i have question about divide(). Why is there control function ? I think divide is not using it

Ryan Hafen

unread,

May 17, 2016, 5:33:06 PM5/17/16

to Jakub Paulina, Tessera-Users

Glad to hear you got part of it working. Hadoop configuration can be very frustrating and difficult to debug.

A few questions / notes:

- Is your newYorkData object on HDFS or in memory? If it is not on HDFS, the divide will not be carried out using Rhipe / Hadoop. The back end that is used for computation is determined by the type of the input. If the output type is different from the input type, it will simply make a conversion to the output format after the computation has completed. So if newYorkData is not on HDFS, your Rhipe options will be ignored since it is not using Hadoop. If it is in memory and you want to write it to HDFS, there are several options (see ?convert or ?addData).

- Is the newYorkData object a single key-value pair?

- Check some of the values of rhoptions()$mropts to make sure that your Hadoop config is correct and and that Rhipe is talking to your cluster

- You should declare rhctl1 before the divide

- How big is each subset - is the distribution skewed? How big is the biggest subset (roughly)?

- Note that several Hadoop parameters cannot be overridden at the job level

Hopefully some of that will help sort things out.

--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/89e97d01-d71f-450e-bc15-5c08e12d9ee9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jakub Paulina

unread,

May 18, 2016, 6:52:09 AM5/18/16

to Tessera-Users, jakub.pa...@gmail.com

Thanks for reply. I better understand it now. Rhctl1 was loaded in memory.
But problem is that I coudnt convert my data to ddf becouse after this my divide() wouldnt work.
I had problem with list(list()) variables.

Error in data.frame(list(), list(indices = list(c(42L, 57L)), text = "EARTHangelHOUR"),  : 
  arguments imply differing number of rows: 0, 1, 2, 7, 6, 3, 5, 11, 8, 4, 12, 9, 10, 13, 14, 16

ddfData = ddf(newData)
system.time({twitterDDF = divide(ddfData ,update=TRUE, by="lang", output = hdfsConn("/paulina/hdfsFiles40", autoYes = TRUE), control = rhctl1)
              test = addTransform(ddfData , TimeSeparator)
              varMeans <- recombine(test, control=rhctl1, verbose = TRUE)
})
This is my magic circle. Becouse i want use addTransform after divide. But for divide use i need to have transformed data already. Transformation functions are taking care about problem above.
And funny thing is that if i didnt use ddf() function it will convert normally from data.frame but locally.

When i try as u said this: 
ddfData = ddf(newYorkData)
rhmkdir("/paulina/hdfsFiles42")
rhchmod("paulina/hdfsFiles42","777")
conn <- hdfsConn("/paulina/hdfsFiles42")
testNY = convert(ddfData, conn)

test1 = addTransform(testNY, TimeSeparator)
twitterDDF = divide(test1 ,update=TRUE, by="lang", output = hdfsConn("/paulina/hdfsFiles40", autoYes = TRUE), control = rhctl1)

Problem will be with addTransform. He will testing fn subset and this will take too long. Becouse data arent divided.
If i am not wrong

ddf(newYorkData) will make 1 key and Value from all data. And then addTransform will test function in all data. That is magic circle i have. 
I need transformed data to divide. But i need divide for fast transform. 
Do you have some solution for this? Am I understand it correctly ? maybe addTransform should have logical parameter to enable or disabling Testing on subset or maybe add control parameter to it so it can calculate it faster.

Ryan Hafen

unread,

May 18, 2016, 3:59:08 PM5/18/16

to Jakub Paulina, Tessera-Users

It sounds like your transformation function might be pretty computationally intensive. It would probably be nice to have the option not to test the transformation on a subset. What is the transformation doing to the data? If it needs the entire data set to be able to compute whatever it needs, you could just transform the data first and then cast the result as a ddf and write it to hdfs. Another option is to write your data set to hdfs in chunks, such that the transformation is applied to only a small subset of the data when it is tested. For example, you could save each chunk of 10k rows as a separate key/value pair to hdfs. This would assume that your transformation function can operate independently on different chunks of the data. I can help walk you through this latter case if it is applicable.

Ryan

--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/04f04dab-977b-47fe-a8cf-4df4146a02e9%40googlegroups.com.

Jakub Paulina

unread,

May 19, 2016, 7:51:07 AM5/19/16

to Tessera-Users, jakub.pa...@gmail.com

This is my transformation function.
TimeSeparator <- function(x)
{
for(i in 1:nrow(x)){
    x$Day[i] = x$created[[i]][1]
    x$Month[i] = x$created[[i]][2]
    x$DayNumber[i] = x$created[[i]][3]
    x$Time[i] = x$created[[i]][4]
    x$Year[i] = x$created[[i]][6]
}
x$created_at = NULL
x = TimeHourMinSeparator(x)
z = ListFixer(x)
z$media = NULL
z$user_mentions = NULL
z$coordinates = NULL
z$hashtags = NULL
return(z)
}
ListFixer <- function(x){
for(i in 1:nrow(x)){
    x$id[i] = x$id[[i]][1]
    x$coordLong[i] = x$coordinates[[i]][1]
    x$coordLat[i] = x$coordinates[[i]][2]
    if(length(x$hashtags[[i]]$text) == 0){
      x$hashtagsText[[i]] = NA
    } else {
      for(y in 1:length(x$hashtags[[i]]$text)){
        x$hashtagsText[[i]][y] = as.list(x$hashtags[[i]]$text[y])
      }
    }
    if(length(x$user_mentions[[i]]) == 0){
      x$user_mentions[[i]] = NA
    } else {
      x$user_mentions[[i]]$id_str = NULL
      x$user_mentions[[i]]$indices = NULL

    }
    if(length(x$media[[i]]) == 0){
      x$media_url[[i]] = NA
      x$media_type[[i]] = NA
    } else {
      for(y in 1:length(x$media[[i]]$type)){
        x$media_url[[i]][y] = as.list(x$media[[i]]$url[y])
        x$media_type[[i]][y] = as.list(x$media[[i]]$type[y])
      }
    }
}
return(x)
}
TimeHourMinSeparator <- function(x)
{
TimeSepared = strsplit(x$Time, split = ":")
for(i in 1:nrow(x)){
    x$Hours[i] = TimeSepared[[i]][1]
    x$Mins[i] = TimeSepared[[i]][2]
    x$Secs[i] = TimeSepared[[i]][2]
}
return(x)
}

I know its not optimal, But i am mainly classic programmer with c/c# so for cycle is natural for me. I want use hadoop to compute this transformation. I will try to split data thats pretty good idea and then i will post my result.

Jakub Paulina

unread,

May 19, 2016, 6:17:53 PM5/19/16

to Tessera-Users, jakub.pa...@gmail.com

So for now i used

rhctl3 <- rhipeControl(mapred = list(
mapreduce.map.memory.mb = 3072+3072,
mapreduce.reduce.memory.mb = 3072+3072
), jobname="Test1")
rhmkdir("/paulina/hdfsFiles18","777")
conn <- hdfsConn("/paulina/hdfsFiles18")
rhwrite(varMeans, "/paulina/hdfsFiles18", kvpairs = TRUE) #varMeans are key-value pairs
ddfTest = ddf(conn,update = TRUE, control = rhctl3, verbose = TRUE)
test2 = addTransform(ddfTest, ListFixer)
varMeans2 <- recombine(test2, control=rhctl3, verbose = TRUE)

i get error on ddf() its strange becouse i am using control function and data are on hdfs so i dont understand why it dont work it looks like it ignoring control parameter again, rhoptions() doesnt work too. Maybe this is some hadoop problem i will look at it tomorrow.

Container [pid=15470,containerID=container_1463693104474_0011_01_000002] is running beyond physical memory limits. Current usage: 3.0 GB of 1 GB physical memory used; 4.0 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1463693104474_0011_01_000002 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 15475 15470 15470 15470 (java) 725 37 1461231616 176779 /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/yarn/nm/usercache/paulina1/appcache/application_1463693104474_0011/container_1463693104474_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 147.232.202.109 59648 attempt_1463693104474_0011_m_000000_0 2 
	|- 15511 15475 15470 15470 (RhipeMapReduce) 617 79 2717884416 618134 /usr/local/lib64/R/library/Rhipe/bin/RhipeMapReduce --slave --silent --vanilla 
	|- 15470 15468 15470 15470 (bash) 0 0 108613632 338 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/yarn/nm/usercache/paulina1/appcache/application_1463693104474_0011/container_1463693104474_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 147.232.202.109 59648 attempt_1463693104474_0011_m_000000_0 2 1>/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000002/stdout 2>/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000002/stderr  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=15561,containerID=container_1463693104474_0011_01_000003] is running beyond physical memory limits. Current usage: 1.2 GB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1463693104474_0011_01_000003 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 15566 15561 15561 15561 (java) 688 29 1590136832 188857 /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/yarn/nm/usercache/paulina1/appcache/application_1463693104474_0011/container_1463693104474_0011_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 147.232.202.109 59648 attempt_1463693104474_0011_m_000000_1 3 
	|- 15601 15566 15561 15561 (RhipeMapReduce) 444 18 807518208 123816 /usr/local/lib64/R/library/Rhipe/bin/RhipeMapReduce --slave --silent --vanilla 
	|- 15561 15559 15561 15561 (bash) 0 0 108613632 338 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/yarn/nm/usercache/paulina1/appcache/application_1463693104474_0011/container_1463693104474_0011_01_000003/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000003 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 147.232.202.109 59648 attempt_1463693104474_0011_m_000000_1 3 1>/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000003/stdout 2>/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000003/stderr  

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Container [pid=15631,containerID=container_1463693104474_0011_01_000004] is running beyond physical memory limits. Current usage: 1.1 GB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1463693104474_0011_01_000004 :
	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
	|- 15636 15631 15631 15631 (java) 724 31 1594765312 177198 /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/yarn/nm/usercache/paulina1/appcache/application_1463693104474_0011/container_1463693104474_0011_01_000004/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000004 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 147.232.202.109 59648 attempt_1463693104474_0011_m_000000_2 4 
	|- 15671 15636 15631 15631 (RhipeMapReduce) 446 17 807526400 123817 /usr/local/lib64/R/library/Rhipe/bin/RhipeMapReduce --slave --silent --vanilla 
	|- 15631 15629 15631 15631 (bash) 0 0 108613632 338 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Djava.net.preferIPv4Stack=true -Xmx820m -Djava.io.tmpdir=/yarn/nm/usercache/paulina1/appcache/application_1463693104474_0011/container_1463693104474_0011_01_000004/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000004 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 147.232.202.109 59648 attempt_1463693104474_0011_m_000000_2 4 1>/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000004/stdout 2>/var/log/hadoop-yarn/container/application_1463693104474_0011/container_1463693104474_0011_01_000004/stderr

Btw i found your other project rbokeh. (http://hafen.github.io/rbokeh/#installation). Its nice package. Will this package become part of Tessera ? .

And finally some small output from my work on Tessera :)

Message has been deleted

Jakub Paulina

unread,

May 20, 2016, 10:29:10 AM5/20/16

to Tessera-Users, jakub.pa...@gmail.com

I managed to solve this problem only with global hadoop options in cloudera manager. Where i increased memory for mapreduce.map.memory.mb = 3072+3072, mapreduce.reduce.memory.mb = 3072+3072 operations. Dont like this solution but it works :/

Ryan Hafen

unread,

May 20, 2016, 11:11:56 AM5/20/16

to Jakub Paulina, Tessera-Users

Hi Jakub,

Sorry for the delay - been traveling. Yes unfortunately if any collection being processed by Hadoop exceeds the global Hadoop memory limits, you will run into issues. This can often be managed by setting the rhipe map buffer size to be very small (e.g. 1 at the smallest). Memory limits for containers cannot be changed on a per-job basis as I understand it, but it seems like you should be able to tweak map and reduce memory limits per-job. As hard as we try to make things as simple as possible when moving up to Hadoop, unfortunately thanks to the complexities of Hadoop there will always be snags.

Also, thanks for sharing your output! rbokeh is technically an independent product in that its use goes well beyond Tessera for general-purpose visualization in R. I use it almost exclusively for general plotting. However, it pairs nicely with Trelliscope and is therefore an honorary member of the framework.

Ryan

On May 20, 2016, at 9:29 AM, Jakub Paulina <jakub.pa...@gmail.com> wrote:

I managed to solve this problem only with global hadoop options in cloudera manager. Where i increased memory for mapreduce.map.memory.mb = 3072+3072, mapreduce.reduce.memory.mb = 3072+3072 operations. Dont like this solution but it works :/

--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/78c63ec8-d095-4879-8527-9c65b3afb77a%40googlegroups.com.

Jakub Paulina

unread,

May 20, 2016, 12:09:14 PM5/20/16

to Tessera-Users, jakub.pa...@gmail.com

Okej i will try tu use smaller buff size. Thank you very much. I will be updating my progress in work with Tessera. Btw in other article u wrote about some upcoming updates of Tessera. Are they planned for this year? As you may know I am making Thesis about Tessera and one of chapters could be future planned features in Tessera. Can you provide me some more info about it ? Ofcourse if this informations arent secret :)

Jakub Paulina

unread,

May 22, 2016, 11:15:54 AM5/22/16

to Tessera-Users, jakub.pa...@gmail.com

Ok i tried many things and i think divide with map reduce is horrible. I am sure i am doing something wrong but still dont know what
Ok so lets start. I have 46546 rows and 14 cols.
rhctl <- rhipeControl(mapred = list(
rhipe_map_buff_size = 1,
    mapred.max.split.size= 1024*1024,
    mapreduce.map.memory.mb = 3072+3072+3072+3072,
    mapreduce.reduce.memory.mb = 3072+3072+3072+3072
), jobname="pucTest")
rhmkdir("paulina/BTfile10","777")
rhexists("paulina/BTfile10")
connect <- hdfsConn("/paulina/BTfile10/", autoYes = TRUE)
addData(connect,list(list("Split",twitterData)))
ddfDat = ddf(connect, control = rhctl, update=TRUE)
twitterByLang = divide(ddfDat,update=TRUE, by="lang", output = hdfsConn("/paulina/hdfsFiles26", autoYes = TRUE), control = rhctl)
or i tried
twitterByLang = divide(connect,update=TRUE, by="lang", output = hdfsConn("/paulina/hdfsFiles26", autoYes = TRUE), control = rhctl)
MapReduce is trying to take more than 12 GB for this job. When divide is Verifying parameters on the node where i have Rsession it will take 30 GB of RAM ! . This is huge amount of ram for such small data. I tried tu change rhipe buff size, or delete split.size but this doesnt help.
btw when i try to load data just with divide localy it will takes few seconds. Any idea ? what am i doing wrong ? Thanks

> head(twitterData)
  lang              place                                                                                                  text        id Day Month
1   en         Queens, NY                                                          @JayyMarley LMFAOOO you know I do \U0001f602 321782606 Tue   Mar
2   en       Brooklyn, NY                                         Thank you @AliciaSilv for your support of #EARTHangelHOUR !!! 917249834 Tue   Mar
3   en       Brooklyn, NY                                                                       @aidenleslie Same to you. Enjoy 605652326 Tue   Mar
4   en Midtown, Manhattan               in the beautiful city with my sweet heart I couldn't be happier. http://t.co/Wx0olVhuuj 100870683 Tue   Mar
5   en         Verona, NJ                                              @Eminem @rosenberg I neeeeeed to. http://t.co/fzppEFOivk 967058612 Tue   Mar
6   es      Manhattan, NY n_earley took me to #Despaña for #lunch. Era delicioso. @ Despaña Vinos y Mas https://t.co/JumyRrIw7v  20028465 Tue   Mar
  DayNumber coordLong coordLat   hashtagsText          date Hours Mins Secs
1        17  -73.8007  40.6935             NA 3907903-11-03    18   23   23
2        17  -73.9396  40.7225 EARTHangelHOUR 3907903-11-03    18   23   23
3        17  -73.9229  40.6620             NA 3907903-11-04    18   24   24
4        17  -73.9824  40.7679             NA 3907903-11-04    18   24   24
5        17  -74.2482  40.8405             NA 3907903-11-05    18   24   24
6        17  -73.9983  40.7213 Despaña, lunch 3907903-11-06    18   24   24

Or divide is not meant to use with MapReduce ?

jeremiah rounds

unread,

May 22, 2016, 12:46:36 PM5/22/16

to Jakub Paulina, Tessera-Users

Out of curiosity what is the output of

x = ddfDat[1]

str(x)

Maybe nothing just want to see.

--

You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/7b6c433f-2852-4633-a26a-c35fdd0c948d%40googlegroups.com.

Jakub Paulina

unread,

May 22, 2016, 1:25:11 PM5/22/16

to Tessera-Users, jakub.pa...@gmail.com

List of 1
 $ :List of 2
  ..$ key  : chr "Split"
  ..$ value:'data.frame':	46546 obs. of  14 variables:
  .. ..$ lang        : Factor w/ 40 levels "ar","bg","bs",..: 8 8 8 8 8 9 8 8 8 8 ...
  .. ..$ place       : chr [1:46546] "Queens, NY" "Brooklyn, NY" "Brooklyn, NY" "Midtown, Manhattan" ...
  .. ..$ text        : chr [1:46546] "@JayyMarley LMFAOOO you know I do \U0001f602" "Thank you @AliciaSilv for your support of #EARTHangelHOUR !!!" "@aidenleslie Same to you. Enjoy" "in the beautiful city with my sweet heart I couldn't be happier. http://t.co/Wx0olVhuuj" ...
  .. ..$ id          :List of 46546
  .. .. ..$ : int 321782606
  .. .. ..$ : int 917249834
  .. .. ..$ : int 605652326
  .. .. ..$ : int 100870683
  .. .. ..$ : int 967058612
  .. .. ..$ : int 20028465
  .. .. ..$ : chr "2444259996"
  .. .. ..$ : chr "2270517068"
  .. .. ..$ : int 43392430
  .. .. ..$ : int 238255696
  .. .. ..$ : int 161583058
  .. .. ..$ : int 1089567061
  .. .. ..$ : chr "3086389435"
  .. .. ..$ : int 18264864
  .. .. ..$ : int 14094137
  .. .. ..$ : chr "2866041279"
  .. .. ..$ : int 19243536
  .. .. ..$ : int 156030607
  .. .. ..$ : int 18399094
  .. .. ..$ : int 830805157
  .. .. ..$ : int 43392430
  .. .. ..$ : int 698073
  .. .. ..$ : int 16211722
  .. .. ..$ : chr "3018536627"
  .. .. ..$ : int 303395427
  .. .. ..$ : int 59613403
  .. .. ..$ : chr "2285152153"
  .. .. ..$ : int 506520758
  .. .. ..$ : chr "2469647625"
  .. .. ..$ : int 1852746906
  .. .. ..$ : int 245429055
  .. .. ..$ : int 509282628
  .. .. ..$ : int 133847036
  .. .. ..$ : int 1320161460
  .. .. ..$ : int 20753474
  .. .. ..$ : int 227146218
  .. .. ..$ : chr "2444259996"
  .. .. ..$ : int 99529293
  .. .. ..$ : int 1258631880
  .. .. ..$ : int 865940286
  .. .. ..$ : chr "2232233572"
  .. .. ..$ : int 601929914
  .. .. ..$ : int 268944886
  .. .. ..$ : chr "2726856134"
  .. .. ..$ : int 979295208
  .. .. ..$ : int 542828934
  .. .. ..$ : chr "2866041279"
  .. .. ..$ : chr "2232233572"
  .. .. ..$ : int 39769919
  .. .. ..$ : chr "2160863864"
  .. .. ..$ : int 317397017
  .. .. ..$ : chr "2716523343"
  .. .. ..$ : chr "3006971021"
  .. .. ..$ : int 238255696
  .. .. ..$ : int 36107445
  .. .. ..$ : int 74694356
  .. .. ..$ : int 1708951410
  .. .. ..$ : int 19998429
  .. .. ..$ : int 133847036
  .. .. ..$ : int 23880149
  .. .. ..$ : int 303395427
  .. .. ..$ : chr "3013313075"
  .. .. ..$ : int 1525605420
  .. .. ..$ : int 24313651
  .. .. ..$ : int 281586001
  .. .. ..$ : int 189932645
  .. .. ..$ : int 63351193
  .. .. ..$ : int 315009700
  .. .. ..$ : int 105560657
  .. .. ..$ : chr "2488519472"
  .. .. ..$ : int 353327386
  .. .. ..$ : int 37797873
  .. .. ..$ : int 171163027
  .. .. ..$ : int 601929914
  .. .. ..$ : int 917215478
  .. .. ..$ : int 477300945
  .. .. ..$ : int 860841
  .. .. ..$ : int 19243536
  .. .. ..$ : int 390553714
  .. .. ..$ : int 43392430
  .. .. ..$ : chr "2878181717"
  .. .. ..$ : int 611955240
  .. .. ..$ : int 860578316
  .. .. ..$ : int 883174674
  .. .. ..$ : int 281586001
  .. .. ..$ : int 61375281
  .. .. ..$ : int 979295208
  .. .. ..$ : int 236644373
  .. .. ..$ : int 317397017
  .. .. ..$ : int 406530891
  .. .. ..$ : int 601929914
  .. .. ..$ : int 143596360
  .. .. ..$ : int 158657057
  .. .. ..$ : int 19243536
  .. .. ..$ : int 1439808247
  .. .. ..$ : int 491999779
  .. .. ..$ : chr "2690799971"
  .. .. ..$ : int 15805390
  .. .. ..$ : int 181715941
  .. .. .. [list output truncated]
  .. ..$ Day         : Factor w/ 7 levels "Mon","Tue","Wed",..: 2 2 2 2 2 2 2 2 2 2 ...
  .. ..$ Month       : Factor w/ 12 levels "Jan","Feb","Mar",..: 3 3 3 3 3 3 3 3 3 3 ...
  .. ..$ DayNumber   : chr [1:46546] "17" "17" "17" "17" ...
  .. ..$ coordLong   : num [1:46546] -73.8 -73.9 -73.9 -74 -74.2 ...
  .. ..$ coordLat    : num [1:46546] 40.7 40.7 40.7 40.8 40.8 ...
  .. ..$ hashtagsText:List of 46546
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "EARTHangelHOUR"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 2
  .. .. .. ..$ : chr "Despaña"
  .. .. .. ..$ : chr "lunch"
  .. .. ..$ : logi NA
  .. .. ..$ :List of 2
  .. .. .. ..$ : chr "BiggieSmalls"
  .. .. .. ..$ : chr "goat"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 7
  .. .. .. ..$ : chr "Laduree"
  .. .. .. ..$ : chr "lamodaya"
  .. .. .. ..$ : chr "dayagram"
  .. .. .. ..$ : chr "Rose"
  .. .. .. ..$ : chr "parisinny"
  .. .. .. ..$ : chr "pausa"
  .. .. .. ..$ : chr "break"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 2
  .. .. .. ..$ : chr "Designing"
  .. .. .. ..$ : chr "Icons"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 6
  .. .. .. ..$ : chr "penn_station"
  .. .. .. ..$ : chr "nyc"
  .. .. .. ..$ : chr "newyork"
  .. .. .. ..$ : chr "newyorkcity"
  .. .. .. ..$ : chr "florida"
  .. .. .. ..$ : chr "amtrak"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 3
  .. .. .. ..$ : chr "myoffice"
  .. .. .. ..$ : chr "jazzlife"
  .. .. .. ..$ : chr "yep"
  .. .. ..$ :List of 5
  .. .. .. ..$ : chr "Newark"
  .. .. .. ..$ : chr "Transportation"
  .. .. .. ..$ : chr "Job"
  .. .. .. ..$ : chr "Jobs"
  .. .. .. ..$ : chr "TweetMyJobs"
  .. .. ..$ :List of 5
  .. .. .. ..$ : chr "TweetMyJobs"
  .. .. .. ..$ : chr "Marketing"
  .. .. .. ..$ : chr "Job"
  .. .. .. ..$ : chr "Newark"
  .. .. .. ..$ : chr "Jobs"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "Coast2Coast"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 3
  .. .. .. ..$ : chr "TheRoyals"
  .. .. .. ..$ : chr "WilliamMoseley"
  .. .. .. ..$ : chr "ForNarnia"
  .. .. ..$ :List of 11
  .. .. .. ..$ : chr "spring"
  .. .. .. ..$ : chr "errands"
  .. .. .. ..$ : chr "peep"
  .. .. .. ..$ : chr "the"
  .. .. .. ..$ : chr "chauvinist"
  .. .. .. ..$ : chr "pigs"
  .. .. .. ..$ : chr "photobombing"
  .. .. .. ..$ : chr "my"
  .. .. .. ..$ : chr "selfie"
  .. .. .. ..$ : chr "so"
  .. .. .. ..$ : chr "meta"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "JaneTheVirgin"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "Coast2Coast"
  .. .. ..$ :List of 3
  .. .. .. ..$ : chr "InstallingMuscle"
  .. .. .. ..$ : chr "BestTrainer"
  .. .. .. ..$ : chr "BestPersonalTrainer"
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "CSW59"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "SuperfruitLive"
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "CrapWeasel"
  .. .. ..$ :List of 3
  .. .. .. ..$ : chr "StPatricksDay"
  .. .. .. ..$ : chr "blackhistory"
  .. .. .. ..$ : chr "comedy"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 8
  .. .. .. ..$ : chr "50DaysofVegan"
  .. .. .. ..$ : chr "Day23"
  .. .. .. ..$ : chr "breakfast"
  .. .. .. ..$ : chr "toasted"
  .. .. .. ..$ : chr "tortillas"
  .. .. .. ..$ : chr "refriedbeans"
  .. .. .. ..$ : chr "peppered"
  .. .. .. ..$ : chr "rice"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "1GottaGo"
  .. .. ..$ :List of 1
  .. .. .. ..$ : chr "StPatricksDay"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 4
  .. .. .. ..$ : chr "ALEXMIKA"
  .. .. .. ..$ : chr "Choker"
  .. .. .. ..$ : chr "Hamsa"
  .. .. .. ..$ : chr "Necklace"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ :List of 12
  .. .. .. ..$ : chr "ASLAM"
  .. .. .. ..$ : chr "LOCATIONS"
  .. .. .. ..$ : chr "SCOUTS"
  .. .. .. ..$ : chr "MOVIES"
  .. .. .. ..$ : chr "COMMERCIALS"
  .. .. .. ..$ : chr "FILMS"
  .. .. .. ..$ : chr "Luxury"
  .. .. .. ..$ : chr "Furnished"
  .. .. .. ..$ : chr "Condos"
  .. .. .. ..$ : chr "FDALLEN"
  .. .. .. ..$ : chr "NYC"
  .. .. .. ..$ : chr "fdallengroup"
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. ..$ : logi NA
  .. .. .. [list output truncated]
  .. ..$ date        : Date[1:46546], format: "3907903-11-03" "3907903-11-03" "3907903-11-04" "3907903-11-04" ...
  .. ..$ Hours       : chr [1:46546] "18" "18" "18" "18" ...
  .. ..$ Mins        : chr [1:46546] "23" "23" "24" "24" ...
  .. ..$ Secs        : chr [1:46546] "23" "23" "24" "24" ...
  ..- attr(*, "class")= chr [1:2] "kvPair" "list"

Jakub Paulina

unread,

May 22, 2016, 1:26:45 PM5/22/16

to Tessera-Users, jakub.pa...@gmail.com

Can be problem with factors ?

Jakub Paulina

unread,

May 22, 2016, 2:01:43 PM5/22/16

to Tessera-Users, jakub.pa...@gmail.com

List of IDs is wrong I already fixed it and removed Date format too so it should be work now better . But its same :/

jeremiah rounds

unread,

May 22, 2016, 3:52:24 PM5/22/16

to Jakub Paulina, Tessera-Users

Is hashTagText fixed. data.frames let you put some strange stuff in the columns (usually unintended by the user in my experience but allowed by design) and I wouldn't divide unless it looks like this all the way down because I am not sure divide has been tested with list inside of data.frame columns.

.. ... $ date        : Date[1:46546], format: "3907903-11-03" "3907903-11-03" "3907903-11-04" "3907903-11-04" ...
  .. ..$ Hours       : chr [1:46546] "18" "18" "18" "18" ...
  .. ..$ Mins        : chr [1:46546] "23" "23" "24" "24" ...
  .. ..$ Secs        : chr [1:46546] "23" "23" "24" "24" ...

Also you have one more lever to use which is rhipe_reduce_buff_size (number of key/value pairs emitted from Map that are loaded into a reduce buffer at a time).  datadr::divide will accumulate chunks in the reduce.  I am thinking en level of lang is emitting very large key/value pairs out of the map, so I would have that buffer be small.  In fact one of my default behaviors using these systems is to put buffer sizes down to very low numbers and ramp them up when it works.

On Sun, May 22, 2016 at 11:01 AM, Jakub Paulina <jakub.pa...@gmail.com> wrote:

List of IDs is wrong I already fixed it and removed Date format too so it should be work now better . But its same :/

--

You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/813a3f09-b971-4fb6-a0d8-5f0963b65b81%40googlegroups.com.

Jakub Paulina

unread,

May 23, 2016, 12:00:42 PM5/23/16

to Tessera-Users, jakub.pa...@gmail.com

The lists in lists was definitely problem. But that doesnt change anything about how the divide() work. He can cast data.frame to ddf really fast without problem about that lists localy. Main problem is beginning with divide and HDFS where that lists are starting memory leak. When i removed that lists it works. Ofcourse i found many bugs in my implementation i am not proud of it :/ . I wanted to use wordcloud by day in Trelliscope now i need to find another way. Maybe if this list will be of fixed length as you said it will fix my problem. I will update my progress. Sad thing is that i am starting to run out of Time to some more tunning. But i want to continue with Tessera in my final thesis. Maybe i will be able to contribute with some fixes or at least help with documentation ,future will show me my path! :D

jeremiah rounds

unread,

May 23, 2016, 3:07:51 PM5/23/16

to Jakub Paulina, Tessera-Users

Divide is just a helper function for the construction of distributed data.frame objects (ddf), and even though it says "data.frame" in the name there are some uses of data.frames, allowed by R, that make more sense in datadr's "distributed data objects" (ddo). List of list for example are a classic example of what Ryan was thinking about when making distirbuted data objects.

Fact of the matter most of my personal uses of Trelliscope is with DDOs.

What that means to you is if you want to break out of the constraints of divide you can use mrExec and define a map/reduce function and then you get the liberty of not being burdened by divide or transform assumptions. Like for example, you can just write a map to divide up that data object pretty fast.

On Mon, May 23, 2016 at 9:00 AM, Jakub Paulina <jakub.pa...@gmail.com> wrote:

The lists in lists was definitely problem. But that doesnt change anything about how the divide() work. He can cast data.frame to ddf really fast without problem about that lists localy. Main problem is beginning with divide and HDFS where that lists are starting memory leak. When i removed that lists it works. Ofcourse i found many bugs in my implementation i am not proud of it :/ . I wanted to use wordcloud by day in Trelliscope now i need to find another way. Maybe if this list will be of fixed length as you said it will fix my problem. I will update my progress. Sad thing is that i am starting to run out of Time to some more tunning. But i want to continue with Tessera in my final thesis. Maybe i will be able to contribute with some fixes or at least help with documentation ,future will show me my path! :D

--

You received this message because you are subscribed to the Google Groups "Tessera-Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tessera-user...@googlegroups.com.
To post to this group, send email to tesser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/tessera-users/f49e2d52-8213-4157-90b0-9477b623de72%40googlegroups.com.

hafen

unread,

Jun 3, 2016, 7:08:16 PM6/3/16

to Tessera-Users, jakub.pa...@gmail.com

Hi Jakub,

I just wanted to check in to see if you were able to resolve your issue. Sorry I haven't been able to respond more quickly - I've been traveling. The divide method was indeed designed to be used with MapReduce and is actually just a simple wrapper around a MapReduce specification. As Jeremiah mentioned, if you want to bypass divide, you can write your own MapReduce code with mrExec. However, I would speculate that you'll still experience issues. Debugging distributed code can be difficult. It looks like your data structure could be part of the problem, as well as the heavy use of iteration instead of vectorized computations. However, my guess is that the problem goes beyond this - it could have something on how your data is being written to HDFS and how big each input chunk is. If you have more time to experiment, I would iterate through rows of the data frame and add each chunk with the addData() command (look for the section on "adding data" here: http://tessera.io/docs-datadr/#large_hdfs_rhipe). When properly configured, we have been using datadr smoothly on very large data sets.

Ryan

On Monday, May 23, 2016 at 12:07:51 PM UTC-7, jeremiah rounds wrote:

Divide is just a helper function for the construction of distributed data.frame objects (ddf), and even though it says "data.frame" in the name there are some uses of data.frames, allowed by R, that make more sense in datadr's "distributed data objects" (ddo). List of list for example are a classic example of what Ryan was thinking about when making distirbuted data objects.

Fact of the matter most of my personal uses of Trelliscope is with DDOs.

What that means to you is if you want to break out of the constraints of divide you can use mrExec and define a map/reduce function and then you get the liberty of not being burdened by divide or transform assumptions. Like for example, you can just write a map to divide up that data object pretty fast.

On Mon, May 23, 2016 at 9:00 AM, Jakub Paulina <jakub.pa...@gmail.com> wrote:

The lists in lists was definitely problem. But that doesnt change anything about how the divide() work. He can cast data.frame to ddf really fast without problem about that lists localy. Main problem is beginning with divide and HDFS where that lists are starting memory leak. When i removed that lists it works. Ofcourse i found many bugs in my implementation i am not proud of it :/ . I wanted to use wordcloud by day in Trelliscope now i need to find another way. Maybe if this list will be of fixed length as you said it will fix my problem. I will update my progress. Sad thing is that i am starting to run out of Time to some more tunning. But i want to continue with Tessera in my final thesis. Maybe i will be able to contribute with some fixes or at least help with documentation ,future will show me my path! :D

--
You received this message because you are subscribed to the Google Groups "Tessera-Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to tessera-users+unsubscribe@googlegroups.com.

To post to this group, send email to tesser...@googlegroups.com.

Reply all

Reply to author

Forward