tail: invalid number of bytes: ‘+’

827 views
Skip to first unread message

李兴正

unread,
Aug 12, 2020, 1:21:21 AM8/12/20
to 3D Genomics
Hi 3D Genomics group,

Thanks for creating the genome assembly tools and the 3D Genomics forum.

I am having an issue while running the 3d-dna assembly pipeline on a draft genome assembly (the step before modifying assembly in Juicebox Assembly Tools) --- ./3d-dna/run-asm-pipeline.sh draft.fa merged_nodups.txt. In the log file, there were 2964 lines of error messages: tail: invalid number of bytes: ‘+’. I believe the "invalid number of bytes" error message was related to the tail commands at lines 62 and 64 in the script construct-fasta-from-asm.sh.

This is an illustration of the "tail" command:
"The -c option is less tolerant than the -n option. That is, there is no default number of bytes, and thus some integer must be supplied. Also, the letter c cannot be omitted as can the letter n, because in such case tail would interpret the hyphen and integer combination as the -n option. Thus, for example, the following would produce an error message something like tail: aardvark: invalid number of bytes: tail -c aardvark" --- http://www.linfo.org/tail.html

Based on the usage information of "tail", it seems that the tail command at lines 62 and 64 in the script construct-fasta-from-asm.sh didn't revceive its -c parameter properly:
tail -c +${index[${contig}]} ${fasta} | awk '$0~/>/{exit}1' | awk -f ${pipeline}/utils/reverse-fasta.awk -
tail -c +${index[${contig}]} ${fasta} | awk '$0~/>/{exit}1'

But I can't figure out the original source from which the "invalid number of bytes" error came. I have tried some possible solutions related to the problem in this forum and theaidenlab/3d-dna github issues, but none of them solved the "invalid number of bytes" errors.

Howerer, I found that the temp index file (32.3 MB) created (and removed) by construct-fasta-from-asm.sh seems larger than usual. And there were many hidden characters (^@) before the last line of the index file (please see the attached tmp.index screenshot by less). I think this might give some clues about the "tail" issue.

By the way, the draft.fa file was generated by a colleague of mine using wtdbg2.

Here is my environment runing 3d-dna:
  • lastz (version 1.04.00 released 20170312)
  • java version "1.7.0_45"
  • GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)
  • GNU Awk 5.1.0, API: 3.0
  • GNU coreutils sort version 8.31
  • Python 2.7.8
  • GNU parallel 20200722
Attached were the log file, rawchrom.assembly file, HiC map (derived from rawchrom.hic and rawchrom.assembly), and tmp.index file (screenshot by less), the tmp.index file was too large to upload.

It's my first time posting a question on the google forum, please give me a little patience if I didn't make myself clear. Please let me know if I miss any essential point.

Thanks!

Regards,
Xingzheng Li
3d-dna.log
horse.Equus_caballus_61.rawchrom.assembly
2020.08.12.10.24.51.HiCImage.pdf
tmp.index_less.png

Olga Dudchenko

unread,
Aug 12, 2020, 9:50:55 AM8/12/20
to 3D Genomics
Hi Xingzheng,

Wonderful to have so many details included in the post to help understand the issue. Thank you.

Re problem as you point out yourself, it is likely that the issue is the draft fasta. You want to make sure you remove all carriage return characters and empty lines if there are any. 3ddna assumes \n for newline.

Hope this helps,

Best,
Olga

李兴正

unread,
Aug 12, 2020, 8:56:59 PM8/12/20
to 3D Genomics
Hi Olga,

Thank you for the quick reply!

The draft fasta was the only input file that I am not sure if it had the right format, because this file was produced by a colleague of mine.

I am re-running the genome assembly steps for Pacbio long reads using wtdbg2, and hopefully it will work in the 3d-dna pepeline.

By the way, it will be great to add a format check module for the input files.

Thank you again!

Regards,
Xingzheng Li

HY

unread,
Jan 11, 2021, 11:42:58 PM1/11/21
to 3D Genomics
Hello Xingzheng Li :

 
Has the problem " tail: invalid number of bytes:'+' " you met when running 3D-DNA been solved now? I got the same error, and don't know how to solve it, could you please give me some suggestions? Thanks

Best Wishes!

HY

Olga Dudchenko

unread,
Jan 14, 2021, 2:27:11 PM1/14/21
to 3D Genomics
Hello HY,

The reason is usually 1) wrong input during the post-review 2) carriage return characters in fasta or review.assembly file (if using old JBAT on Win machine).

Best,
Olga

HY

unread,
Jan 15, 2021, 12:54:57 AM1/15/21
to 3D Genomics
I have a question about 3D-DNA script "run-asm-pipeline-post-review.sh". I'm not sure which input file for this script should be. I used "*rawchrom.hic" and "rawchrom.assembly" files to generate the "*review.assembly" file. Should I use the original fasta file(I used to run "juicer.sh") or the "*rawchrom.fasta" file generated by the "run-asm-pipeline.sh"?

Olga Dudchenko

unread,
Jan 15, 2021, 2:21:17 PM1/15/21
to 3D Genomics
Hi HY,

Please consult the Genome Assembly Cookbook (dnazoo.org/methods) or --help.

The input is the original fasta file, along with the original mnd file. 

Olga

Valentina Peona

unread,
Jan 18, 2021, 3:53:02 AM1/18/21
to 3D Genomics
Hi Olga,

I have a similar problem. I'm sure I'm using the right input files and I checked for carriage return characters as you suggested but still getting the same error. Then I looked at the reviewed assembly file and the last lines look "strange" and made me think that the problem may reside here (?):

>scaffold1630,32,f1489Z32 3796 32

>scaffold1631,22,f1374Z22 3797 22

>scaffold1632,22,f1819Z22 3798 22

3232 2689 -2751 -3479 -3291 -3305 2939 -2999 -3018 -2823 2701 1997 -1802 2806 3143 -2492 -2752 -2045 3507 -2588 3271 -3041 2017 -2052 2078 -2401 2969 -2472 3108 -2835 -1257 -3468 2458 2804 2440 -3278 -3110 2451 -2748

-1870

-1349

372

374

376

378

-910

-908

361 957

74 948 950 -210 302 -70 3530 3489 -3195 -317 202

55

57 -194

858

-356 860 -104 -101 123 -515

-149 713 -147 -157 84 21 -1212 -211 1758 207 -208 -131

-129

468 -96
[... continues like this for several more lines]


is the end of the assembly file supposed to look like this? I'm using Juicebox 1.11.08 on a macOS Catalina.

Best,
Valentina

Olga Dudchenko

unread,
Jan 20, 2021, 2:37:58 PM1/20/21
to 3D Genomics
Hi Valentina,

It looks like somehow your assembly file has white lines after every scaffold. I don't know what editor introduced this, but if this is indeed the case that's your problem.

Olga
Reply all
Reply to author
Forward
0 new messages