iinvalid insert distance

2,139 views
Skip to first unread message

Sinclair Cooper

unread,
Feb 18, 2013, 7:26:39 AM2/18/13
to hku-...@googlegroups.com
Hi, 

I have been trying to used IDBA_UD to do some tests on some false data that I've generated using previously sequenced data. 

I keep getting the error 'Invalid insert distance'. I've tried changing the insert distances for the reads that I'm generating as well as the standard deviation for the insert distances. What assumptions does IDBA make about insert distance?  


This the kind of output I'm getting:

number of threads 8
reads 22874
long reads 0
extra reads 0
read_length 100
kmer 20
kmers 9258 9290
merge bubble 0
contigs: 54 n50: 437 max: 941 mean: 182 total length: 9867 n80: 237
aligned 13734 reads
confirmed bases: 7325 correct reads: 10354 bases: 0
distance mean 254.027 sd 196.944
seed contigs 20 local contigs 108
kmer 40
kmers 9680 9698
merge bubble 0
contigs: 33 n50: 937 max: 1040 mean: 327 total length: 10795 n80: 815
aligned 18451 reads
confirmed bases: 8787 correct reads: 16856 bases: 0
distance mean 212.975 sd 288.682
seed contigs 11 local contigs 66
kmer 60
kmers 9930 9937
merge bubble 0
contigs: 20 n50: 1038 max: 1079 mean: 554 total length: 11093 n80: 1004
aligned 20999 reads
confirmed bases: 9719 correct reads: 19184 bases: 0
distance mean 153.616 sd 364.841
invalid insert distance
kmer 80
kmers 10003 10005
merge bubble 0
contigs: 11 n50: 1082 max: 1102 mean: 984 total length: 10834 n80: 1047
aligned 21586 reads
confirmed bases: 10025 correct reads: 19425 bases: 0
distance mean 113.821 sd 402.032
invalid insert distance
kmer 100
kmers 9853 9734
merge bubble 0
contigs: 10 n50: 1082 max: 1102 mean: 1072 total length: 10724 n80: 1060
reads 22874
aligned 21586 reads
distance mean 113.821 sd 402.032
invalid insert distance
Segmentation fault (core dumped)




Any help would be greatly appreciated.

Incidentally, when I run  IDBA on a real data set (illumina PE reads) it seem to work ok, but I'm trying to QC the assembly by using  a false data set so that I can more accurately quantify copy number etc. 
 
Thanks 
Sinclair



Yu PENG

unread,
Feb 19, 2013, 2:00:32 PM2/19/13
to hku-...@googlegroups.com
Hi,

We assumed the paired-end library is forward and backward (->, <-). Please make sure your library format is correct.

Thanks,
Yu Peng
 

Sinclair



--
You received this message because you are subscribed to the Google Groups "hku-idba" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hku-idba+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Sinclair Cooper

unread,
Feb 21, 2013, 11:36:30 AM2/21/13
to hku-...@googlegroups.com
Hi, 

I'm quite sure my reads are in the correct orientation. I tried reverse complementing them just to test but still getting similar error messages. 

Does IDBA have a lower limit on the size of data set it can use? (I am using a very small dataset to test it). 


Thanks 
Sinclair 

Yu PENG

unread,
Feb 21, 2013, 12:35:59 PM2/21/13
to hku-...@googlegroups.com
Hi,

There is no such limitation. Did you try velvet and soap? If they works, I think IDBA should also work.

Thanks,
Yu Peng

Sinclair Cooper

unread,
Feb 21, 2013, 12:37:32 PM2/21/13
to hku-...@googlegroups.com
Hi, 

I tried velvet and get good results. 

rohita sinha

unread,
Feb 21, 2013, 12:40:02 PM2/21/13
to hku-...@googlegroups.com
Why don't you share a fraction of your reads or one sample. 

We can try and it will be a learning for all of us. I used IDBA_UD and got good results

Rohita
Rohita Sinha, Ph.D.
Core for Applied Genomics and Ecology (CAGE)
University of Nebraska-Lincoln
Dept. Food Science & Tech. 330 FIC
Lincoln, NE 68583-0919

Sinclair Cooper

unread,
Feb 21, 2013, 12:54:23 PM2/21/13
to hku-...@googlegroups.com
That's a great idea. I have a small data set that would be good for that. 

I appreciate your input as IDBA has also worked well for me with real illumina output, but for my simulated data does not seem to work...

I've attached some reads. 

Thankyou!
sim_reads.zip

Sinclair Cooper

unread,
Feb 21, 2013, 1:07:32 PM2/21/13
to hku-...@googlegroups.com
Sorry for sending separate fq files, but I speculated that it may be the way I'm pre-processing that could be the problem. 

Kevin Chen

unread,
Feb 22, 2013, 12:09:28 AM2/22/13
to hku-...@googlegroups.com
Actually, I have the same problem. I always turn on the pre_correction with idba-ud. On one of many sequences that I assembled with idba-ud, at k = 40, 60, 80, 100, it shows "invalid insert distance", and then core dump, while this same sequence gets result from other assembler, like velvet, ray, abyss, etc. 
idba_ud -r read.fa -o output --num_threads 16 --pre_correction

number of threads 16
reads 2572190
long reads 0
extra reads 0
read_length 101
kmer 60
kmers 9439265 9392915
merge bubble 487
contigs: 13600 n50: 495 max: 129925 mean: 293 total length: 3988167 n80: 191
aligned 1547939 reads
confirmed bases: 2773649 correct reads: 1229186 bases: 167168
kmer 20
kmers 8391898 8704445
merge bubble 1542
contigs: 129740 n50: 294 max: 12132 mean: 56 total length: 7278867 n80: 22
aligned 1164980 reads
confirmed bases: 2664809 correct reads: 1014972 bases: 1294
distance mean 238.293 sd 60.7358
seed contigs 5599 local contigs 259480
kmer 40
kmers 12257698 12403264
merge bubble 1579
contigs: 42557 n50: 483 max: 56663 mean: 172 total length: 7350052 n80: 79
aligned 1404858 reads
confirmed bases: 3627855 correct reads: 1182843 bases: 3945
distance mean 238.007 sd 511.259
invalid insert distance
kmer 60
kmers 11490515 11522912
merge bubble 478
contigs: 20127 n50: 683 max: 130857 mean: 337 total length: 6789597 n80: 271
aligned 1690005 reads
confirmed bases: 4021375 correct reads: 1259011 bases: 29693
distance mean 220.6 sd 5615.2
invalid insert distance
kmer 80
kmers 8123863 8083455
merge bubble 109
contigs: 8858 n50: 984 max: 130897 mean: 657 total length: 5821418 n80: 389
aligned 1685593 reads
confirmed bases: 4037684 correct reads: 1265135 bases: 20058
distance mean 208.114 sd 8162.33
invalid insert distance
kmer 100
kmers 5297804 5242226
merge bubble 45
contigs: 5840 n50: 1191 max: 130897 mean: 912 total length: 5327754 n80: 461
reads 2572190
aligned 1741678 reads
distance mean 208.566 sd 7872.14
invalid insert distance

Kevin Chen

unread,
Feb 22, 2013, 10:18:50 AM2/22/13
to hku-...@googlegroups.com
Oh, it probably has nothing to do with the pre_correction.

I rerun the sample without --pre_correction, still has Segmentation fault.

24257 Segmentation fault      (core dumped) idba_ud -r read.fa -o output2 --num_threads 16

number of threads 16
reads 2572190
long reads 0
extra reads 0
read_length 101
kmer 20
kmers 8487703 8806327
merge bubble 2642
contigs: 131829 n50: 289 max: 12132 mean: 55 total length: 7327282 n80: 22
aligned 1158662 reads
confirmed bases: 2661707 correct reads: 1008503 bases: 114609
distance mean 238.338 sd 60.0754
seed contigs 5617 local contigs 263658
kmer 40
kmers 12387008 12535507
merge bubble 1788
contigs: 43050 n50: 480 max: 56663 mean: 171 total length: 7376898 n80: 79
aligned 1406097 reads
confirmed bases: 3625088 correct reads: 1182231 bases: 29921
distance mean 237.921 sd 465.821
seed contigs 7248 local contigs 86100
kmer 60
kmers 11771870 11808943
merge bubble 502
contigs: 21020 n50: 671 max: 57410 mean: 327 total length: 6886687 n80: 266
aligned 1649679 reads
confirmed bases: 4021802 correct reads: 1233773 bases: 30336
distance mean 234.072 sd 746.447
invalid insert distance
kmer 80
kmers 8268691 8227832
merge bubble 114
contigs: 9160 n50: 974 max: 130897 mean: 642 total length: 5885783 n80: 387
aligned 1690081 reads
confirmed bases: 4053068 correct reads: 1267961 bases: 25744
distance mean 203.673 sd 8129.93
invalid insert distance
kmer 100
kmers 5350595 5292609
merge bubble 36
contigs: 5876 n50: 1180 max: 130897 mean: 912 total length: 5363417 n80: 463
reads 2572190
aligned 1734549 reads
distance mean 204.888 sd 7930.91
invalid insert distance

Sinclair Cooper

unread,
Mar 5, 2013, 9:29:23 AM3/5/13
to hku-...@googlegroups.com
Hi all, 

Has anyone had any luck with the sample data I sent?

Thanks
Sinclair 

Jonathan Ligo

unread,
Mar 6, 2013, 1:09:20 PM3/6/13
to hku-idba
How did you generate this data? (I'm getting a core dump as well, but
only reads generated from the included read simulator seem to work for
me - I've tried generating data from metasim with the empirical model
and setting paired end probability to 1, but it still will core
dump).

Did it produce any useful contig/etc. files?
> >>>>>>>>>> send an email to hku-idba+u...@googlegroups.com**.
> >>>>>>>>>> For more options, visithttps://groups.google.com/**
> >>>>>>>>>> groups/opt_out <https://groups.google.com/groups/opt_out>.
>
> >>>>>>>>>  --
> >>>>>>>>> You received this message because
>
> ...
>
> read more »

Yu PENG

unread,
Mar 6, 2013, 2:50:47 PM3/6/13
to hku-...@googlegroups.com
Hi all,

If the insert distance is invalid. The scaffold step has a high chance to get a segment fault. But the contig generation should work, you can find the contig.fa or contig-maxk.fa as result. The method we used to estimate insert distance is aligning the reads to contigs and compute the distance between each two reads. I am not sure why it doesn't work on this data. Can you also send out the reference sequence for testing? How is the assembly quality? Can the contigs be mapped back to reference well? If you align the reads to reference, can you detect the insert distance well?

Thanks,
Yu Peng


Sinclair Cooper

unread,
Mar 7, 2013, 5:38:22 AM3/7/13
to hku-...@googlegroups.com
Hi everyone, 

contig-100.fa is actually quite a good assembly for the test data, and the read counts match up with the 'reference',  (passed contig-100 and the 'reference' sequences through dotter and they seemed to be a good match.) 


I've attached the reference sequences. Some of the sequences have been multiplied to try and reflect the copy number variation in the system I'm attempting to sequence. In this small (relatively low complexity) data set the copy numbers are as follows:
copy#   length 
10        984     
  5       1005
  1       1015
  1       1013
  1        992
  1       1018
  1       1004
  1        982
  1        969
  1       1024
 
The script I used to generate the reads from this data set was from here: http://socwiki.wordpress.com/2011/05/11/simulateseq-pl-perl-script-for-simulating-ngss-sequence/ 
The commands I used were simulateSeq.pl --PE 100X --ISPE 300 --ISSD 10 --RL 100-100 --ReadsSD 0  --type stdfq --circle  test_mini.fasta 


I've attached a clc report for a larger data set I generated in the same way (it gave the same results in idba). 


Many thanks for all of your input with this.


Sinclair
test_mini.fasta
CLC report (2).pdf

Jérémy Tournayre

unread,
Mar 14, 2013, 6:33:19 AM3/14/13
to hku-...@googlegroups.com
Hi,

Same problem for me. 

I add some news :

-With the 1.1.0 version of idba_ud I get this error (I think that is correlated whith the sd : when the sd is greater than 400, I always get "Invalid insert distance").

-On the other side, with the 1.0.9 of idba_ud runs like clockwork.

Thanks,
Jérémy Tournayre

daniel aguirre

unread,
Jun 14, 2013, 3:52:47 AM6/14/13
to hku-...@googlegroups.com
Hi all,

I tried idba_ud with several simulated datasets and it worked fine in most instances. In one instance I got the core dumped issue several times until I got it right!

now with real datasets I´m getting the same problem as you guys, and it is clearly a maximum sd insert distance issue.
I tried the new version and modifying the script but still got the same error.

As Yu Peng says we can still use the contigs generated which are quite fine.

Does anyone know if we can 'build' the scaffolds from the contigs with any available software? for example we have our contigs and initial paired reads then if a read pair aligns against the extremes of two different contigs we make a scaffold out of them and pu tX number of Ns in the middle.

Does this make sense, how can I we do it?

thanks

Zhuofei Xu

unread,
Aug 29, 2013, 9:57:16 AM8/29/13
to hku-...@googlegroups.com
Hello All,

I met the same problems with 2 read datasets. One should be normal and the other is abnormal without the scaffold file named scaffold.fa. I noticed that both can generate contig file contig.fa. But there is a little different between them. The contig.fa in that normal one is complete with a newline character in the last line but the other one might be not complete without newline in the end.

So do you think such file contig.fa is complete or not?

Thanks a lot for your suggestions in advance!

Zhuofei

Ben Temperton

unread,
Sep 13, 2013, 12:19:08 PM9/13/13
to hku-...@googlegroups.com
I can confirm this with my data sets.
Running IDBA-UD v.1.1.0 and 1.1.1 results in the following and a core dump (and no scaffold.fa)

\number of threads 12
reads 150271916
long reads 0
extra reads 0
read_length 101
kmer 60
kmers 387875467 383443362
merge bubble 5826
contigs: 2947859 n50: 61 max: 22905 mean: 66 total length: 196442066 n80: 60
aligned 1252490 reads
confirmed bases: 4574771 correct reads: 454019 bases: 178643
kmer 20
kmers 1384235686 1413690000
merge bubble 505452
contigs: 29757001 n50: 25 max: 4640 mean: 30 total length: 897762003 n80: 20
aligned 379930 reads
confirmed bases: 3178588 correct reads: 127877 bases: 12507
distance mean -nan sd -nan
invalid insert distance
kmer 40
kmers 1981905064 2040150779
merge bubble 43548
contigs: 22163387 n50: 41 max: 51167 mean: 44 total length: 985913133 n80: 40
aligned 2949052 reads
confirmed bases: 18777173 correct reads: 679201 bases: 94180
distance mean 7287 sd 0
seed contigs 20653 local contigs 44326774
kmer 60
kmers 1247105612 1231462707
merge bubble 5606
contigs: 3468326 n50: 62 max: 127764 mean: 72 total length: 251963043 n80: 60
aligned 3102081 reads
confirmed bases: 19290186 correct reads: 702455 bases: 11289
distance mean 7287 sd 0
seed contigs 18623 local contigs 6936652
kmer 80
kmers 178091417 171168814
merge bubble 1091
contigs: 343387 n50: 113 max: 136176 mean: 151 total length: 52131602 n80: 80
aligned 3235230 reads
confirmed bases: 18979983 correct reads: 707838 bases: 4459
distance mean 7287 sd 0
seed contigs 17003 local contigs 686774
kmer 100
kmers 25446130 24372291
merge bubble 337
contigs: 20961 n50: 1611 max: 136176 mean: 1126 total length: 23618799 n80: 787
reads 0
aligned 0 reads
distance mean -nan sd -nan
invalid insert distance

Whereas, running v. 1.0.9 on the same data works fine:

number of threads 12
reads 75135958
long reads 0
read_length 101
kmer 60
kmers 297307891 289329626
merge bubble 5826
contigs: 102120 n50: 143 max: 22905 mean: 129 total length: 13262928
aligned 1091346 reads
confirmed bases: 4245315 correct reads: 428113 bases: 164712
kmer 20
kmers 1357876996 1369398738
merge bubble 507973
contigs: 17664727 n50: 43 max: 4819 mean: 37 total length: 660146933
aligned 370774 reads
confirmed bases: 3112494 correct reads: 127264 bases: 12129
distance mean 406.22 sd 232.893
seed contigs 6579 local contigs 13256
kmer 40
kmers 507006951 497715340
merge bubble 45438
contigs: 705710 n50: 109 max: 51167 mean: 105 total length: 74667222
aligned 2057022 reads
confirmed bases: 15214356 correct reads: 605465 bases: 70798
seed contigs 19696 local contigs 44049
kmer 60
kmers 72071102 71471755
merge bubble 9758
contigs: 119204 n50: 828 max: 97175 mean: 294 total length: 35063797
aligned 2562735 reads
confirmed bases: 18329394 correct reads: 659078 bases: 24008
seed contigs 19334 local contigs 43896
kmer 80
kmers 34215782 34084594
merge bubble 5187
contigs: 37084 n50: 1477 max: 97957 mean: 751 total length: 27863642
aligned 2903968 reads
confirmed bases: 19569921 correct reads: 696816 bases: 19404
seed contigs 17429 local contigs 39533
kmer 100
kmers 27151707 27093734
merge bubble 3509
contigs: 23267 n50: 1728 max: 98103 mean: 1149 total length: 26749209
aligned 2969925 reads
expected coverage 0.126465
num edges 8773
contigs: 17469 n50: 3584 max: 226190 mean: 1403 total length: 24510954

Stephen Turner

unread,
Dec 5, 2013, 10:09:33 AM12/5/13
to hku-...@googlegroups.com, yp...@cs.hku.hk
I'd like to re-open this discussion. I am having the same invalid insert distance / segfault issues on a single-end fasta file that runs just fine in velvet. I attached the file (only 1000 reads). I'm using 1.1.0, but have also tried 1.1.1 and neither works. 

Using the file attached:

idba_ud -r tmp.fa -o idba

output:

number of threads 12
reads 1000
long reads 0
extra reads 0
read_length 151
kmer 20
kmers 5401 5400
merge bubble 0
contigs: 1 n50: 5405 max: 5405 mean: 5405 total length: 5405 n80: 5405
aligned 937 reads
confirmed bases: 5273 correct reads: 903 bases: 62
distance mean 173.028 sd 2177.7
invalid insert distance
kmer 40
kmers 5366 5365
merge bubble 0
contigs: 1 n50: 5405 max: 5405 mean: 5405 total length: 5405 n80: 5405
aligned 937 reads
confirmed bases: 5273 correct reads: 903 bases: 0
distance mean 173.028 sd 2177.7
invalid insert distance
kmer 60
kmers 5346 5345
merge bubble 0
contigs: 1 n50: 5405 max: 5405 mean: 5405 total length: 5405 n80: 5405
aligned 937 reads
confirmed bases: 5273 correct reads: 903 bases: 0
distance mean 173.028 sd 2177.7
invalid insert distance
kmer 80
kmers 5326 5325
merge bubble 0
contigs: 1 n50: 5405 max: 5405 mean: 5405 total length: 5405 n80: 5405
aligned 937 reads
confirmed bases: 5273 correct reads: 903 bases: 0
distance mean 173.028 sd 2177.7
invalid insert distance
kmer 100
kmers 5306 5305
merge bubble 0
contigs: 1 n50: 5405 max: 5405 mean: 5405 total length: 5405 n80: 5405
reads 1000
aligned 937 reads
distance mean 173.028 sd 2177.7
invalid insert distance
Segmentation fault

Anyone have any thoughts? Many thanks.

Stephen

tmp.fa

Joe Anderson

unread,
Dec 5, 2013, 1:40:33 PM12/5/13
to hku-...@googlegroups.com, yp...@cs.hku.hk
I've received the exact same fault with Stephen's input file, using IDBA 1.1.0 on Ubuntu 13.10.
 
-Joe

Stephen Turner

unread,
Dec 5, 2013, 3:59:43 PM12/5/13
to hku-...@googlegroups.com, yp...@cs.hku.hk
I've just discovered that running the exact same commands on the same data but specifying --num_threads 1 eliminates the problem. So, segfault with multithreaded job, runs fine on a single thread. Any thoughts Yu?

Thanks,

Stephen

Stephen Turner

unread,
Dec 5, 2013, 4:03:04 PM12/5/13
to hku-...@googlegroups.com, yp...@cs.hku.hk
Even more strangely, I get a segfault with 12 cores, no segfault with 16 cores, segfault again at 24 cores, no segfault at 3 cores. Not sure what's going on here.

Thanks,

Stephen

Ben Temperton

unread,
Dec 11, 2013, 4:55:44 PM12/11/13
to hku-...@googlegroups.com, yp...@cs.hku.hk
I can also confirm this on real datasets and that it does not segfault at 40 cores.


--
You received this message because you are subscribed to a topic in the Google Groups "hku-idba" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hku-idba/RzTkrVTod8o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hku-idba+u...@googlegroups.com.

Em Seven

unread,
Jul 10, 2014, 8:52:53 AM7/10/14
to hku-...@googlegroups.com, yp...@cs.hku.hk
Bet this has been resolved since, but am surfing the group for a problem of my own, and thought I might mention these lines from the IDBA README file:
"IDBA-UD IDBA-Hybrid and IDBA-Tran require paired-end reads stored in single FastA file and a pair of reads is in consecutive two lines."

I am thus more suprised that it actually works for your SE data on some values of --num_threads...

Matthew

unread,
Dec 3, 2014, 5:44:41 PM12/3/14
to hku-...@googlegroups.com
IDBA_ud is still crashing for me and saying "invalid insert size".  has this problem not been fixed yet?

Peter King

unread,
Dec 31, 2014, 2:29:25 PM12/31/14
to hku-...@googlegroups.com
I don't believe so. I'm using the "Current Release" listed on the IDBA website (v1.1.1), and I'm having the same problems. Also, the scripts in this release don't seem to have been modified since July 2013, which suggests to me that the current release is still buggy.
Reply all
Reply to author
Forward
0 new messages