Error with unite function

502 views
Skip to first unread message

Helen

unread,
Mar 24, 2014, 11:16:10 PM3/24/14
to methylkit_...@googlegroups.com
Hello,

I am getting an error most of the time with the unite function. The default unite works fine, but if I specify min.per.group I usually get the error below.
I have installed the development version from github (as was suggested in another post) but I still have the same problem.

An example of my input and the error message are below.

Any help would be appreciated.

Regards
Helen

> file.listFull = list("69827_EA2_liver_noXY.call","69828_EA2_liver_noXY.call","72364_EA2_liver_noXY.call","69829_EA2_liver_noXY.call","69830_EA2_liver_noXY.call","72365_EA2_liver_noXY.call" ,"75535_ED2_liver_noXY.call","75536_ED2_liver_noXY.call","75541_ED2_liver_noXY.call","75538_ED2_liver_noXY.call","75539_ED2_liver_noXY.call","75542_ED2_liver_noXY.call")

> myobjFull = read(file.listFull, sample.id = list ("69827EA2","69828EA2","72364EA2","69829EA2","69830EA2","72365EA2","75535ED2","75536ED2","75541ED2","75538ED2","75539ED2","75542ED2"), assembly = "mm10", treatment = c(0,0,0,1,1,1,0,0,0,1,1,1), context = "CpG")

> fil.myobFull=filterByCoverage(myobjFull,lo.count=20,lo.perc=NULL, hi.count=NULL,hi.perc=99.9)

> full.min3=unite(fil.myobFull,min.per.group=3L, destrand=TRUE)
Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  :
  Join results in 2067386 rows; more than 2067384 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.


Altuna Akalin

unread,
Mar 25, 2014, 5:52:24 AM3/25/14
to methylkit_...@googlegroups.com
Please make sure there are no duplicate coordinates in your files. If the problem persists please send a subset of the dataset that reproduces the problem. I can not reproduce this with the example dataset in the package
> --
> You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
> To post to this group, send email to methylkit_...@googlegroups.com.
> Visit this group at http://groups.google.com/group/methylkit_discussion.
> For more options, visit https://groups.google.com/d/optout.
>

Helen McCormick

unread,
Mar 25, 2014, 6:23:15 PM3/25/14
to methylkit_...@googlegroups.com
Hi Altuna,

I would say there are duplicate coordinates in my files where there is coverage at the same position for both the + and - strand. Is this likely to cause the error? I'd have thought this would be a common occurrence in this type of data.

Meantime I will send you a subset of the data..

Thanks for your quick reply.
Helen


You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/YfctVA6Dhp4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.

roberts...@gmail.com

unread,
Mar 27, 2014, 10:13:26 PM3/27/14
to methylkit_...@googlegroups.com
Hi Helen, I had this same error, and had no duplicate values. I downloaded the development version of the methylKit program and tried reducing the data given to it and various other things but the only way I got it to work was changing the server that I was running it on. Never did figure out why the error was occurring, so if you do find out, I would be keen to know.

Cheers

Hester


On Wednesday, March 26, 2014 11:23:15 AM UTC+13, Helen wrote:
Hi Altuna,

I would say there are duplicate coordinates in my files where there is coverage at the same position for both the + and - strand. Is this likely to cause the error? I'd have thought this would be a common occurrence in this type of data.

Meantime I will send you a subset of the data..

Thanks for your quick reply.
Helen
On Tue, Mar 25, 2014 at 8:52 PM, Altuna Akalin <aak...@gmail.com> wrote:
Please make sure there are no duplicate coordinates in your files. If the problem persists please send a subset of the dataset that reproduces the problem. I can not reproduce this with the example dataset in the package


On Tuesday, March 25, 2014, Helen <hel...@gmail.com> wrote:
> Hello,
>
> I am getting an error most of the time with the unite function. The default unite works fine, but if I specify min.per.group I usually get the error below.
> I have installed the development version from github (as was suggested in another post) but I still have the same problem.
>
> An example of my input and the error message are below.
>
> Any help would be appreciated.
>
> Regards
> Helen
>
>> file.listFull = list("69827_EA2_liver_noXY.call","69828_EA2_liver_noXY.call","72364_EA2_liver_noXY.call","69829_EA2_liver_noXY.call","69830_EA2_liver_noXY.call","72365_EA2_liver_noXY.call" ,"75535_ED2_liver_noXY.call","75536_ED2_liver_noXY.call","75541_ED2_liver_noXY.call","75538_ED2_liver_noXY.call","75539_ED2_liver_noXY.call","75542_ED2_liver_noXY.call")
>
>> myobjFull = read(file.listFull, sample.id = list ("69827EA2","69828EA2","72364EA2","69829EA2","69830EA2","72365EA2","75535ED2","75536ED2","75541ED2","75538ED2","75539ED2","75542ED2"), assembly = "mm10", treatment = c(0,0,0,1,1,1,0,0,0,1,1,1), context = "CpG")
>
>> fil.myobFull=filterByCoverage(myobjFull,lo.count=20,lo.perc=NULL, hi.count=NULL,hi.perc=99.9)
>
>> full.min3=unite(fil.myobFull,min.per.group=3L, destrand=TRUE)
> Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x),  :
>   Join results in 2067386 rows; more than 2067384 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
>
>
> --
> You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsub...@googlegroups.com.

> To post to this group, send email to methylkit_...@googlegroups.com.
> Visit this group at http://groups.google.com/group/methylkit_discussion.
> For more options, visit https://groups.google.com/d/optout.
>

--
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/YfctVA6Dhp4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discussion+unsub...@googlegroups.com.

Altuna Akalin

unread,
Mar 28, 2014, 4:02:28 AM3/28/14
to methylkit_...@googlegroups.com
I wonder if this is related to data.table version. Could you check the data.table versions on both servers? Maybe, the server on which unite() didn't work had an older version.

Best
Altuna
>>> > To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.

>>> > To post to this group, send email to methylkit_...@googlegroups.com.
>>> > Visit this group at http://groups.google.com/group/methylkit_discussion.
>>> > For more options, visit https://groups.google.com/d/optout.
>>> >
>>>
>>> --
>>> You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/YfctVA6Dhp4/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.

>>> To post to this group, send email to methylkit_...@googlegroups.com.
>>> Visit this group at http://groups.google.com/group/methylkit_discussion.
>>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.

jonathan corbi

unread,
Feb 18, 2015, 10:46:24 AM2/18/15
to methylkit_...@googlegroups.com
Hi all, 
I am having the same problem and tried the different solution proposed here...

I have two groups (lets say 10 and 12 samples, respectively), split the data in 3 different files (for the 3 methylation contexts) and tried this command:

methUnite=unite(myobj, destrand=FALSE, min.per.group=5L)

I am having no issues if I consider CpG or CHG, but keep having this error message with CHH.  

Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 10626660 rows; more than 10170266 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I would be happy to check for duplicate position but I don't know how to do it on a methylRawList object (on myobj)...

I also tried different xL (from 1 to 15L) and keep having the same issue...

I downloaded the dev. version, still the same problem...

I really would like to keep the maximum of position... eliminating a position because the data is not available in ALL samples is very conservative... 

Any help would be appreciated! 

Thanks! 

Jonathan 

Altuna Akalin

unread,
Feb 18, 2015, 10:49:13 AM2/18/15
to methylkit_...@googlegroups.com
could you check if there duplicate positions in the CHH context in one of your samples ? If there is we need to find out how it is generated. 

Best,
Alltuna

jonathan corbi

unread,
Feb 18, 2015, 11:27:24 AM2/18/15
to methylkit_...@googlegroups.com
apparently I talked to the person in charge of the cluster I am working on and there was a problem with data.table...
I tried now and it works. I hope the problem is solved for other tests!
Thanks  so much for your fast reply!
Jo
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsub...@googlegroups.com.

dilorenzo...@gmail.com

unread,
Mar 17, 2015, 11:14:07 AM3/17/15
to methylkit_...@googlegroups.com
Hi, I am having the same problem on two of my datasets.

I updated data.table using install.packages("data.table") to 1.9.4 and still have the problem. How exactly did you guys solve this? Which versions?

Default data.table was also 1.9.4 but had this message when loading: 

"*** NB: by=.EACHI is now explicit. See README to restore previous behaviour."

It is 27 samples, divided into two groups and subsetted into regions using regionCounts(). Promoters and TSSes regions work, Exons and Introns dont. 

Any help much appreciated.

/S

Altuna Akalin

unread,
Mar 17, 2015, 11:20:31 AM3/17/15
to methylkit_...@googlegroups.com
I guess the problem is transcript isoforms can have the same exon and intron coordinates, and that creates duplicate entries in the tables. By default TSSes and promoters are unique (AFAIR). You need to make sure you don't have duplicate coordinates in exons and intrrons tables before you feed them to region counts.

Best,
Altuna

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.

dilorenzo...@gmail.com

unread,
Mar 17, 2015, 11:26:58 AM3/17/15
to methylkit_...@googlegroups.com

You may very well be right!

I did this test:

> length(gene.obj$exons)

[1] 470552

> length(unique(gene.obj$exons))

[1] 241297

> length(gene.obj$promoters)

[1] 31745

> length(unique(gene.obj$promoters))

[1] 31745


I'll just add an unique() in regionCounts() and see if it works better.

Thanks for the superhumanspeed reply.

/S


To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discussion+unsubscrib...@googlegroups.com.

To post to this group, send email to methylkit_...@googlegroups.com.
Visit this group at http://groups.google.com/group/methylkit_discussion.
For more options, visit https://groups.google.com/d/optout.

Altuna Akalin

unread,
Mar 17, 2015, 11:31:21 AM3/17/15
to methylkit_...@googlegroups.com
sometimes you get superhumanspeed reply, sometimes takes a couple of days. Depends on variables I can't control : )

To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages