Post-review: The length of fasta does not match that suggested by the cprops file

216 views
Skip to first unread message

Yuan Tao

unread,
Oct 8, 2018, 3:44:49 PM10/8/18
to 3D Genomics

Dear 3D Genomic Group,

I just ran the post-review script and I got the error right after the first fragment.

The length of fasta does not match that suggested by the cprops file. Exiting!


with the first fragment as :
 

>scaffold_0:::fragment_1:::debris

TATCAGTAGATGCATGCCAGGGGTGGATCAGGCATGCAAGAGGATTTTACCCCCGCTGCCTGGCTGGGGCCAATATAGCC

TGTGATGTGGATGAGATTCTCTGGCCTGACCCAGACCACAGACAAGATGCTGAGGTGGGATAATGTTTCTGGTGTGTGTT

GTACTGTATAGTACATGAATGACAATGTGCCACTGTACATACATTGGTTGCGTCTTTTGTGTTTATCATGTGACAAGGGT

TTTGAAGGGGAGAACCATAAGCAACCTATTTGTTTTGGAACTATTGTTTTGAATTAATTGTACCAGTGTGTAAAACTCTG

TTGTAGTGTGTGTGTTTTTGAGGGCTTGTGTTTGATGTCTGAGGGCAAAGTTTGGTTTTTCAGCAGGAGTGAATAGTTTT

GGGTGTAGAGCTTCATTTTGACCTGGAAATAGGATGTCTGGGAAATTGAGTGAGATGTTATGAATTTGTGTTTACTGTTG

TGAGGATAGGAGGCGTAGTTTCAAGAAATGTGCTTTAGCAATCGAGAAAAACTGTAATATGCCTCTGCATCTCTGCACTA

AACTTCCTGTTTCTCAGTCTGGCTGAGCACTTGGTTTGTGGGTCAAAGCTGGCCTTCTTTTATACAGGCCTTTATTATTA

AAATACATAAAACAATATAATCAGAAAATAACCACTGAGAACATATATTTATTCTTTAACAGCATTTTTGAAAACACAGC

TGTTTCTATGCGTTTGGGCCTTTTGTCCACACGTAAACGGCGTTTCCAATCACCCAAAACTGAGGACGTGTGGACGAGGA

ATACAGAGTTCGTCAAGCAACATCACAGGTACGTGGCTTTTTTAAAGTTTTTAAGTTTTTACTGTTTGTTATTGTTTACA

TGAGATGAATTGCAGAATGGCAGATAGAGACAAAATACTGTTACTGTTAATCTTAATATCTTCAGTTTTTACATGCTTAC

ATATAGACTACACATGGAGTTACTGTCCTTCCATTTTGAAAGGCAGAGGCGTCATGGTGT


 I count the first fragment in the final fasta file, and the length is 1033bp, while the assembly file says 1020?

>scaffold_0:::fragment_1:::debris 1 1020

>scaffold_0:::fragment_2 2 3430736

>scaffold_0:::fragment_3:::debris 3 5495

>scaffold_0:::fragment_4 4 8833000

>scaffold_0:::fragment_5:::debris 5 25000

>scaffold_0:::fragment_6 6 550000


I didn't edit this part of the final assembly so it is the same as the output assembly file here.

Also, I've checked that I used the corresponding fasta and hic file (both `final`). And it looks good if I reload them into the JBAT with the correct step (hic, original assembly, review assembly).

What do you think could be wrong during the process? 

Thank you for any help.

Best,
Yuan

Yuan Tao

unread,
Oct 9, 2018, 1:25:13 PM10/9/18
to 3D Genomics
Sorry, I count it wrong. It is exactly 1020bp. So it makes me even more confused...

Olga Dudchenko

unread,
Oct 9, 2018, 2:24:10 PM10/9/18
to 3D Genomics
Hey Yuan,

Hmm.. any chance you did review on a windows machine? Old version of JBAT did default carriage return at the end of line which may lead some issues on windows machine. Open in something that can show those and if yes do something like

tr '\r' '\n'


to get rid of anything except newline character. 

The master branch for Juicebox/JBAT has this handled but I am not sure if the build does as well..

Best,
Olga

Yuan Tao

unread,
Oct 9, 2018, 2:33:19 PM10/9/18
to 3D Genomics
Hi Olga,

I did everything on Linux except JBAT on OS X.
I would assume the carriage return comes from the fasta file? Those are just generated by the pipeline...
Or you mean to trim the review assembly file?...

Best,
Yuan

Yuan Tao

unread,
Oct 9, 2018, 2:35:05 PM10/9/18
to 3D Genomics
By the way, is there any test data that I can try in a faster way?

Best,
Yuan

Olga Dudchenko

unread,
Oct 10, 2018, 4:17:24 AM10/10/18
to 3D Genomics
Yuan,

If you did this in MacOS there should not be any problem with carriage returns in assembly file. Please share the files and I will take a look. I would need your original fasta and assembly file.

Regarding example files I am not sure what you are referring to. There are some example hic and assembly files shared as part of JBAT tutorial. In either case this is the first time I see this error invoked. I suspect perhaps some file corruption again.

Best,
Olga

Yuan Tao

unread,
Oct 10, 2018, 11:18:04 AM10/10/18
to 3D Genomics
Olga, 

I see! I just found that the original assembly file has CR each line... Would that be the problem?

Best,
Yuan

Olga Dudchenko

unread,
Oct 10, 2018, 2:15:32 PM10/10/18
to 3D Genomics
Yep, that might be it. (Presumably it was saved at some point on a win machine?)

Anyhow, just remove those with the tr or any editor and feed to 3d-dna.

Best,
Olga

Yuan Tao

unread,
Oct 18, 2018, 4:14:04 PM10/18/18
to 3D Genomics
Hi Olga,

I found a better quality draft assembly of a close-related species and turned to use that one. I unmasked the new assembly and put all the fasta in one line manner instead of the 80 character a line.(Am I supposed to do that? How would you preprocess a soft-masked assembly as input?) The old problems just didn't show up for the new assembly. But new problem came out. After the 1st round, it failed to generate the HiC map. The first HiC map(*.0.hic) was OK, but it encountered the error later:

java.lang.NumberFormatException: For input string: "K00339:108:HV3T3BBXX:8:1103:19776:35761/2"

        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)

        at java.lang.Integer.parseInt(Integer.java:580)

        at java.lang.Integer.parseInt(Integer.java:615)

        at juicebox.tools.utils.original.AsciiPairIterator.advance(AsciiPairIterator.java:149)

        at juicebox.tools.utils.original.AsciiPairIterator.next(AsciiPairIterator.java:194)

        at juicebox.tools.utils.original.Preprocessor.computeWholeGenomeMatrix(Preprocessor.java:493)

        at juicebox.tools.utils.original.Preprocessor.writeBody(Preprocessor.java:371)

        at juicebox.tools.utils.original.Preprocessor.preprocess(Preprocessor.java:283)

        at juicebox.tools.clt.old.PreProcessing.run(PreProcessing.java:108)

        at juicebox.tools.HiCTools.main(HiCTools.java:86)

Could not read hic file: null


I grep this line in `merged_nodup.txt` and it looks just like any other lines, nothing special... And I have no idea why it wanted to parse this string as integer... It's a line in the middle of the mnd file.

Thank you,
Yuan

Olga Dudchenko

unread,
Oct 19, 2018, 12:05:21 PM10/19/18
to 3D Genomics
Hi Yuan,

Soft masking does not matter. I would advise to pre-wrap the fasta (there is a wrap-fasta utility in 3d-dna/utils) as single-line output may result in long compute times at the finalize stage. Other than that it should not matter.

As for the error it seems again there might be some file corruption. If you want, put the mnd somewhere and I can try to dig out where the issue is. In general pre should not be even seeing the read names: 3d-dna does the mapping to 'assembly' chromosome and substitutes all the non-essenial entries. Actually, try running 3d-dna/visualize/run-assembly-visualizer.sh on both the original draft.assembly and the 0.assembly. If the draft does not work, than the issue is the original mnd file. If it does, something went wrong in the remapping due to either the issue with .0.assembly file or some compute issue.

Olga 
Reply all
Reply to author
Forward
0 new messages