ecdf different colour points and fit line ggplot

695 views
Skip to first unread message

Maria Lathouri

unread,
Jan 4, 2019, 8:04:18 AM1/4/19
to Ggplot2
Dear all, 

First of all, Happy New year to all of you. May 2019 be a prosperous year.

I am writing as I would like your help on the following. I am trying to plot two different ECDF on the same plot in ggplot, based on val and var1. What I would like to do is first to fit a line on those points (of "val" values) and also to colour the points based on var2 (based on taxonomic groups). 

head(dat)
  ID    val var1  var2
1  1 1.0556 log1 algae
2  2 1.0556 log1 algae
3  3 1.3893 log1 algae
4  4 1.3893 log1 algae
5  5 1.4264 log1 algae
6  6 1.4264 log1 algae

71   0.9367 log2 algae
72   0.9367 log2 algae
73   1.2816 log2 algae
74   1.2816 log2 algae
75   1.3200 log2 algae
76   1.3200 log2 algae
77   1.5791 log2 higher plant
78   1.5791 log2 higher plant


str(dat)
'data.frame':    140 obs. of  4 variables:
$ ID  : int  1 2 3 4 5 6 7 8 9 10 ...
$ val : num  1.06 1.06 1.39 1.39 1.43 ...
$ var1: Factor w/ 2 levels "log1","log2": 1 1 1 1 1 1 1 1 1 1 ...
$ var2: Factor w/ 9 levels "algae","bivalve",..: 1 1 1 1 1 1 6 6 8 8 …

library(ggplot2)

#create a new dataframe
df<-data.frame(x=c(val), g1=factor(var1), g2=factor(var2)

#plot ecdf
ecd<-ggplot(df, aes(x, colour=g1))+stat_ecdf(geom="point")

I have been able to plot the two different ecdfs but when I try to colour=g2, then I get points everywhere. I also tried to add a geom_line() but without any luck. 

I have attached my data in text file (I hope it will go through!)

Thank you very much in advance.

Kind regards,
Maria
dat.txt

Brandon Hurr

unread,
Jan 4, 2019, 11:37:22 AM1/4/19
to Maria Lathouri, Ggplot2
Maria, 

You could plot the interaction of the two variables. 

library(tidyverse)


df1 <- read_tsv("~/Downloads/dat.txt")


ggplot(df1, aes(val)) +

stat_ecdf(geom="point", aes(colour = interaction(var1, var2)))


Screen Shot 2019-01-04 at 8.28.30 AM.png

The trouble here is that you have too many factor levels to plot really.

An alternative would be to facet by one variable or another. Here I'm showing var1:

ggplot(df1, aes(val)) +

stat_ecdf(geom="point", aes(colour = var2)) +

facet_grid(. ~ var1)

Screen Shot 2019-01-04 at 8.31.24 AM.png
This will show you interactions between the different var2 within var1, but comparisons between var1 and var2 by type will be tough. 
Faceting the other way is better for looking at comparisons within species type, but poor at comparing by var1:

ggplot(df1, aes(val)) +

stat_ecdf(geom="point", aes(colour = var1)) +

facet_grid(. ~ var2)


Screen Shot 2019-01-04 at 8.34.44 AM.png

You can facet by both variables, but that is essentially one plot per interaction. Code below without picture. 

ggplot(df1, aes(val)) +

stat_ecdf(geom="point") +

facet_grid(var1 ~ var2)



HTH, 
B



--
--
You received this message because you are subscribed to the ggplot2 mailing list.
Please provide a reproducible example: https://github.com/hadley/devtools/wiki/Reproducibility

To post: email ggp...@googlegroups.com
To unsubscribe: email ggplot2+u...@googlegroups.com
More options: http://groups.google.com/group/ggplot2

---
You received this message because you are subscribed to the Google Groups "ggplot2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ggplot2+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Maria Lathouri

unread,
Jan 4, 2019, 11:59:45 AM1/4/19
to Brandon Hurr, Ggplot2
Dear Brandon,

Thank you very much for your help. Unfortunately, I really need to have a cumulative distribution plot like the one below. The two different ecdfs are based on the var1. 

ecd<-ggplot(dat, aes(val, colour=var1))+stat_ecdf(geom="point")

Ενσωματωμένη εικόνα
I can understand that it might be difficult to colour each point differently based on var2. But I was wondering if I can add two fitted lines, one for each ecdf. I tried but I couldn't find any solution. 

Thank you very much.

Kind regards,
Maria



Brandon Hurr

unread,
Jan 4, 2019, 7:17:07 PM1/4/19
to Maria Lathouri, Ggplot2
Maria, 

Sorry, I'm not much help here. I've rarely used ECDFs. I was thinking that maybe you could extract the processed curve data, align it and plot it with your colors directly, but your data is 140 rows long and the ECDF output from ggplot is only 80 rows. I'm not sure what ECDF does nor how to align the data. 

plot1 <- 
ggplot(df1, aes(val)) +
stat_ecdf(geom="point", aes(colour = var1))

ggplot_build(plot1)

I hope someone else can give you a better answer. 

B

Brian Shine

unread,
Jan 5, 2019, 5:06:34 AM1/5/19
to Brandon Hurr, Brian Shine, Maria Lathouri, Ggplot2
Is this any good to you?  

The reason there are fewer points on the graph than data values is that many of the data values are repeated.

ggplot(dd1, aes(x = val)) +
    stat_ecdf(geom = "point", aes(shape = var1, colour = var2))

Best wishes,
Brian
PastedGraphic-2.pdf

Luke Tudge

unread,
Jan 5, 2019, 8:01:58 AM1/5/19
to Maria Lathouri, Ggplot2
Hi Maria.

To get both the points and the lines for each ecdf by var1, you could just add the ecdf geom twice, once as a line and once as a point, like this:

library(ggplot2)
d = read.table('dat.txt', sep='\t', header=TRUE)
fig = ggplot(d, aes(x=val, fill=var1)) +
  stat_ecdf(aes(color=var1)) +
  stat_ecdf(geom='point', shape='circle filled')
print(fig)

To get the points also colored by var2, the best I could come up with is to compute the ecdf density values separately first and then use them as a new variable for geom_point(). In this case it is probably better to avoid also using color for var1, as this could lead to some confusion about what is represented. You could use the linetype instead. Like this:

d$density = NA
for(i in levels(d$var1)){
  valSubset = d$val[d$var1==i]
  cdfFun = ecdf(valSubset)
  d$density[d$var1==i] = cdfFun(valSubset)
}
fig = ggplot(d, aes(x=val, y=density)) +
  stat_ecdf(aes(lty=var1)) +
  geom_point(aes(fill=var2), shape='circle filled')
print(fig)

There is probably a snazzier way to get the separate density values by var1, mapping or applying function or something. But I'm a bit of an R dinosaur, all loops and ifs.

Luke

Maria Lathouri

unread,
Jan 5, 2019, 8:57:42 AM1/5/19
to Brian Shine, Ggplot2


Dear Brian, 

Thank you very much. I followed your advice that there are some repeated values and the length is too big so I divide the three val, var1 and var2 columns into 4 columns, log1 with Taxa1 and log2 with Taxa2. 

dd1<-read.csv("dd1.csv")

head(dd1)
head(dd1)
  ID   log1   log2        Taxa1        Taxa2
1  1 1.0556 0.9367        algae        algae
2  3 1.3893 1.2816        algae        algae
3  5 1.4264 1.3200        algae        algae
4  7 1.6748 1.5791 higher plant higher plant
5  9 1.7069 1.6514      rotifer      rotifer
6 11 1.7958 1.7412        snail        snail

str(dd1)
'data.frame':    43 obs. of  5 variables:
$ ID   : int  1 3 5 7 9 11 13 15 17 19 ...
$ log1 : num  1.06 1.39 1.43 1.67 1.71 ...
$ log2 : num  0.937 1.282 1.32 1.579 1.651 ...
$ Taxa1: Factor w/ 8 levels "algae","bivalve",..: 1 1 1 5 7 8 6 8 2 3 ...
$ Taxa2: Factor w/ 8 levels "algae","bivalve",..: 1 1 1 5 7 8 6 8 2 3 ...


fig1<-ggplot(data=data.frame(value=dd1$log1, ecdf=ecdf(dd1$log1)(dd1$log1), spec=dd1$Taxa1)) + geom_point(aes
(x=value, y=ecdf, col=spec))

fig2<-ggplot(data=data.frame(value=dd1$log2, ecdf=ecdf(dd1$log2)(dd1$log2), spec=dd1$Taxa2)) + geom_point(aes]
(x=value, y=ecdf, col=spec))

I managed to create two separate ggplots with different colours based on Taxa columns. 



But now I cannot manage to have these two different plots into one. Any ideas?

Many thanks.

Kind regards,
Maria
>> I can understand that it might be difficult to colour each point differently based on var2. But I was wondering if I can add two fitted lines, one for each ecdf. I tried but I couldn't find any solution. 
>>
>> Thank you very much.
>>
>> Kind regards,
>> Maria
>>
>>
>>
>>
>>  
>>  
>>  
>>  Στις ‎Παρασκευή‎, ‎4‎ ‎Ιανουαρίου‎ ‎2019‎ ‎04‎:‎37‎:‎18‎ ‎μμ‎ ‎GMT, ο χρήστης Brandon Hurr <brando...@gmail.com> έγραψε:
>>
>>
>>
>>
>>
>> Maria, 
>>
>> You could plot the interaction of the two variables. 
>>
>>  
>> library(tidyverse)
>>
>>
>>
>> df1 <- read_tsv("~/Downloads/dat.txt")
>>
>>
>>
>> ggplot(df1, aes(val)) +
>>
>>  stat_ecdf(geom="point", aes(colour = interaction(var1, var2)))
>>
>>
>>
>> The trouble here is that you have too many factor levels to plot really.
>>
>> An alternative would be to facet by one variable or another. Here I'm showing var1:
>>  
>> ggplot(df1, aes(val)) +
>>
>>  stat_ecdf(geom="point", aes(colour = var2)) +
>>
>>  facet_grid(. ~ var1)
>>
>> This will show you interactions between the different var2 within var1, but comparisons between var1 and var2 by type will be tough. 
>> Faceting the other way is better for looking at comparisons within species type, but poor at comparing by var1:
>>  
>> ggplot(df1, aes(val)) +
>>
>>  stat_ecdf(geom="point", aes(colour = var1)) +
>>
>>  facet_grid(. ~ var2)
>>
>>
>>
dd1.txt
1546696621554blob.jpg

marios...@gmail.com

unread,
Jan 5, 2019, 9:15:21 AM1/5/19
to Maria Lathouri, Brian Shine, ggp...@googlegroups.com

A more robust solution using order and not sort inside the ecdf function ( which should correctly reorder all the df according to value in the correct order ) . Apologies for the mistake 😝

 

 

# Session ! Simple Analysis Script

 

library(tidyverse)

library(scales)

library(stringr)

library(extrafont)

 

# Read the source file

 

source_file_path<- paste('….path…. /dat.txt', sep = "")

 

source_data <- read.csv(file = source_file_path, sep = "\t")

source_data <- source_data %>% mutate_if(is.character, trimws )

source_data$unique_group <- interaction(source_data$var1, source_data$var2, sep = "_")

 

source_data <- source_data %>%

               group_by(var1) %>%

               mutate(count = n())  %>%

               filter(count > 10 )

 

data_and_ecdf<- lapply(split(source_data,source_data$var1, drop = T), function(x){

 

  x<- x[order(x$val, decreasing = F), ]

 

  #xx <-  sort(x$val) # x$folding

  x$ecdf <- 1:length(x$val) / length(x$val)

 

  #x$val <- xx

 

  return(x)

} )

 

data_and_ecdf <- bind_rows(data_and_ecdf)

 

 

ecdf_overlap <- ggplot(data = data_and_ecdf, mapping = aes(x= val, y=ecdf, group= var1,  colour= var2, shape=var1 ))+

  geom_line(size=1)+

  scale_shape_manual(values = c(5,15))+

  geom_point(size=2.5)+

  scale_color_viridis_d()+

  theme_bw()

 

 

{windows(height = 15, width = 15)

  ecdf_overlap

  }

 

From: marios...@gmail.com <marios...@gmail.com>
Sent: Saturday, January 5, 2019 3:07 PM
To: 'Maria Lathouri' <mlat...@yahoo.gr>; 'Brian Shine' <brians...@gmail.com>
Subject: RE: ecdf different colour points and fit line ggplot

 

Kalimera Maria, all,

 

From your email what I understand is that you want to calculate the ecdf of your populations according to var1 group.

 

then you want to identify where these quantiles lie in your ecdf. if there's more then you actually need to be carefull cause you have too few obervations in var2 to actually think about making a statistic.

 

Now then under these assumptions, just manually calculate ecdf, which be definition is a cummulative sum of the ordered observations. Here's a code that solves your problem in the way i understood it.

 

There’s an even more neat solution to calculate the function using dplyr , the group_by, and do commands, I’ll let you crack your head around it.

 

A picture of hoz it looks like below.

 

Best,

Marios

 

P.S. Don’t try to use the high level functions before you understand the math behind, it just makes solving the problem that much more complicated.

 

#-------------------------------------------------------------------------------------------------------------------------------------

Remember to paste the right path in this line below

source_file_path<- paste('put your path here /dat.txt', sep = "")

 

 

# -------------------------------------------------------------------------------------------------------------------------------------------------

library(tidyverse)

library(scales)

library(stringr)

library(extrafont)

 

# Read the source file

 

source_file_path<- paste('put your path here /dat.txt', sep = "")

 

source_data <- read.csv(file = source_file_path, sep = "\t")

source_data <- source_data %>% mutate_if(is.character, trimws )

source_data$unique_group <- interaction(source_data$var1, source_data$var2, sep = "_")

 

source_data <- source_data %>%

               group_by(var1) %>%

               mutate(count = n())  %>%

               filter(count > 10 )

 

data_and_ecdf<- lapply(split(source_data,source_data$var1, drop = T), function(x){

 

  xx <-  sort(x$val)

  x$ecdf <- 1:length(xx) / length(xx)

 

  x$val <- xx

 

  return(x)

} )

 

data_and_ecdf <- bind_rows(data_and_ecdf)

 

 

ecdf_overlap <- ggplot(data = data_and_ecdf, mapping = aes(x= val, y=ecdf, group= var1,  colour= var2, shape=var1 ))+

  geom_line(size=1)+

  scale_shape_manual(values = c(5,15))+

  geom_point(size=2.5)+

  scale_color_viridis_d()+

  theme_bw()

 

 

{windows(height = 15, width = 15)

  ecdf_overlap

image001.png
image002.jpg

Maria Lathouri

unread,
Jan 6, 2019, 5:56:22 AM1/6/19
to marios...@gmail.com, ggp...@googlegroups.com
Hi Marios,

Thank you so much for this. This is exactly what I want to show. The group_by in dplyr perhaps could also do the trick so I will definitely go through this. 

Best,
Maria
Reply all
Reply to author
Forward
0 new messages