Generating Assembly File from FASTA that includes contigs/gaps

498 views
Skip to first unread message

Jerry Jenkins

unread,
Sep 17, 2019, 10:35:18 AM9/17/19
to 3D Genomics
All,

  I have a a FASTA file that, and a HIC file that was generated externally using the SALSA pipeline by a collaborator.  I can load them into Juicebox for viewing, and there are obvious changes I need to make to the FASTA.  I tried creating an assembly file using "generate-assembly-file-from-fasta.awk", however it did not have the individual contigs/gaps broken out in the assembly file.  Is there a way to generate an assembly file from a FASTA that includes the contig/gap information?  I tried reverse engineering an assembly file from an existing assembly file I had, but was unsuccessful.  It seems like it should be fairly straightforward.  Any help would be appreciated.

Best Regards,

Jerry Jenkins

Olga Dudchenko

unread,
Sep 18, 2019, 7:43:09 AM9/18/19
to 3D Genomics
Hi Jerry,

You can edit the sequences directly using JBAT edit function. You do not need to convert your assembly back to contigs for this. You need a JBAT-compatible hic file as generated by 3d-dna visualize function and assembly file created by the function you cite.

If you want to generate contigs you can use a generate-gap-bed.awk script in utils in 3d-dna.

Best,
Olga

Jerry Jenkins

unread,
Sep 18, 2019, 9:14:37 AM9/18/19
to 3D Genomics
Olga,

  What I am interested in is using the JUICER interface to perform my edits interactively.  I generally load an assembly file generated by run-asm-pipeline.sh.  However, in this case I already have a scaffolded assembly and a *.hic file.  The thing I am missing is the *.assembly file.  When I used the function generate-assembly-file-from-fasta.awk, it did generate an assembly file, but none of the underlying contig/gap information was included.  Is there a way to generate an assembly file that includes the gap/contig information from an existing FASTA file that I can load into JUICER and perform edits interactively?

Thank you for your help,

Jerry

Olga Dudchenko

unread,
Sep 18, 2019, 7:52:55 PM9/18/19
to 3D Genomics
Jerry,

Sounds like you are want to use Juicebox Assembly Tools (JBAT) rather than Juicer.

For you to be able to use JBAT you need two things: a JBAT-compatible hic file and a .assembly file. Whether you have the former depends on how your collaborators generated the .hic file: JBAT format is built agains an 'assembly' chromosome. If the .hic file is not against an assembly chromosome, you will need to rebuild the hic with 3d-dna/visualize package. If you have built the .assembly file based on the fasta that your collaborators shared with you, you already have the second necessary component.

You do not need the gap/contig information to edit the assembly.

Olga

Jerry Jenkins

unread,
Oct 14, 2019, 10:49:24 AM10/14/19
to 3D Genomics
I am still having trouble with this, and maybe I am not describing what I am wanting to do properly.  I have a hic file that I generated using the following command:

java -Xmx10G -jar juicer_tools.jar pre -f assembly.fasta_MboI.txt alignments_sorted.txt scaffolds.hic chromosome_sizes.tsv

Where the alignments_sorted.txt file is of the following format:

0 Chr01 100000028 0 1 Chr01 99999941 1
0 Chr01 100000050 0 1 Chr01 99999949 1
0 Chr01 100000051 0 1 Chr01 99999941 1
0 Chr01 100000089 0 1 Chr01 99999941 1
0 Chr01 100000091 0 1 Chr01 99999941 1

I then generated an assembly file using the following command:

awk -f ./3d-dna-master/utils/generate-assembly-file-from-fasta.awk assembly.fasta > assembly.fasta.assembly

However, when I load scaffolds.hic and assembly.fasta.assembly into JUICEBOX, all of the assembly information (blue boxes) are placed into the first chromosome.  

What I am trying to do is to get to a point where I can use the *.assembly file to make edits on an existing assembly in a manner similar to what I do when working with an assembly that I have scaffolded using JUICER.  However, assembly.fasta was not generated using JUICER.

My question is:  How can I generate a *.hic and *.assembly file using an existing *.fasta file that will enable me to perform edits in a manner similar to a JUICER-scaffolded assembly?

Thank you,

Jerry Jenkins

Olga Dudchenko

unread,
Oct 14, 2019, 9:57:56 PM10/14/19
to 3D Genomics
Jerry,

You will have to rebuild the .hic file. File created like this:
java -Xmx10G -jar juicer_tools.jar pre -f assembly.fasta_MboI.txt alignments_sorted.txt scaffolds.hic chromosome_sizes.tsv
has data sandboxed into putative chromosomes. This does not work with Juicebox assembly tools (JBAT). You will have to reconstruct the merged_nodups.txt file from your alignments_sorted.txt or rerun Juicer to get a proper merged_nodups.txt file.

You can try the following:

1) try to reconstruct the mnd file in long format (https://github.com/aidenlab/juicer/wiki/Pre), this assumes all mapping scores are 1 as this information seems to be lost in your alignment_sorted file:
awk 'BEGIN{FS="\t";OFS="\t"}{print $0, 1, "-", "-", "-", 1, "-", "-", "-","-","-"}' alignment_sorted.txt > pseudo_mnd_file.txt
2) make an assembly (presumably done):
awk -f ./3d-dna-master/utils/generate-assembly-file-from-fasta.awk assembly.fasta > assembly.fasta.assembly
3) build a JBAT-compatible hic file using 3D-DNA:
bash <path-to-3d-dna>/visualize/run-assembly-visualizer.sh <assembly.fasta.assembly> <pseudo_mnd_file.txt>

See page 5 of the cookbook and instructions henceforth for details. Note that I would much rather recommend a proper Juicer alignment and a proper 3D-DNA rerun rather than this reconstruction surrogate.

Olga

Jerry Jenkins

unread,
Oct 15, 2019, 9:53:26 AM10/15/19
to 3D Genomics
Olga,

  Thank you for your reply.  I will give it a go and let you know how it works out.

Best Regards,

Jerry Jenkins
Reply all
Reply to author
Forward
0 new messages