Mapping .fasta sequences to .hic bins

Vanessa Roy

unread,

Mar 4, 2021, 8:10:42 AM3/4/21

to 3D Genomics

Dear all,

I am working with the sandboxed _HiC files from DNA zoo and came across the following problems:

I don't seem to understand how to map the sequences from the .fasta files to the corresponding bin positions of the HiC contact maps.

For instance:

I am working with the Clonorchis sinensis datasets at 1Mb resolution from DNA zoo. I have a .hic file, from which I can extract the contact maps chromosome by chromosome, but I don't seem to understand which sequences from the .fasta file correspond to this chromosome and how to get the sequences corresponding to the specific bins of the HiC contact maps.

As far as I understood, the .assembly files contain information about how the .fasta sequences and .hic files relate, and understand that the chromosome-length scaffold IDs are listed there, as in the first line of the following correspond to the scaffold IDs in that order and orientation, the second line to the second chromosome, etc.

>hic_gap_3305 3305 500

-728 3305 159 3305 -956 3305 -933 3305 -957 3305 912 3305 1139 ... --> chromosome 1

1126 3305 949 3305 552 3305 -549 3305 888 ... --> chromosome 2

etc.

But when I look at the scaffold lengths from all of the scaffolds placed in the corresponding chromosomes, and at the number of bins of the corresponding HiC contact map, the numbers don't add up.

So my question is, how can I get the corresponding chromosome length sequences and map those to the bins in the HiC chromosomal contact maps?

Sorry for these basic questions, I am new to working with HiC data.

Many thanks!

Best,

Vanessa

Olga Dudchenko

unread,

Mar 5, 2021, 6:28:06 PM3/5/21

to 3D Genomics

Hi Vanessa,

Here's a list of files in the Clonorchis_sinensis folder:

[F] 64223906 ASM360417v1.hic - this is the contact map that corrresponds to the draft, ASM360417v1 (the "before" contact map on standard dnazoo.org/assembly pages). This is built agains the assembly chromosome since there are no chromosomes to speak of

[F] 84100 ASM360417v1.assembly - the assembly file corresponding the the contact map (in what order the original pieces are shown)

[F] 359613 ASM360417v1_asm.scaffold_track.txt - 2D annotation files in case you can't/dont want to load the assemlby, e.g. if working in juicebox.js

[F] 306874 ASM360417v1_asm.superscaf_track.txt - same, for draft these two 2D annotaton files are equivalent

[F] 62684746 ASM360417v1.rawchrom.hic - contact map that correspond to the final ordering and orientation of the draft pieces. This is build agains the "assembly" chromosome to be fully comparable to the draft

[F] 111465 ASM360417v1.rawchrom.assembly - corresponding assembly file

[F] 434734 ASM360417v1.rawchrom_asm.scaffold_track.txt - corresponding track files

[F] 282527 ASM360417v1.rawchrom_asm.superscaf_track.txt

[F] 20425903 ASM360417v1_HiC.hic - contact map that corresponds to the final ordering and orienation of the draft pieces, built for only chromsoome-length scaffolds, in coordinate systems that corresponds to the chromosomes. in contrast to rawchrom, N-overhangs are removed from draft sequences/fragments of sequences if necessary & gaps are introduced between scaffolded sequences

[F] 180296580 ASM360417v1_HiC.fasta.gz - chromosome-length fasta. This fasta matches the _HiC.hic contact map

[F] 117228 ASM360417v1_HiC.assembly - assembly file that describes the relationship between the draft and the _HiC.fasta, including removal of overhangs and gaps.

[F] 4945 README.json - metadata file with comments on species, samples and assembly

Best,

Olga

Vanessa Roy

unread,

Mar 6, 2021, 1:15:53 PM3/6/21

to 3D Genomics

Dear Olga,

Thank you for the information.

I am still confused about the following, however:

The _HiC.fasta.gz corresponds to the _HiC.hic contact map and is the chromosome-length fasta.

But, when looking at the _HiC.hic header file, there are 7 chromosomes (HiC_scaffold_1 .. HiC_scaffold_7), whereas when looking at the _HiC.fasta file, there are 2555 scaffolds (>HiC_scaffold_1 ... >HiC_scaffold_2555) and corresponding sequences. For this reason, I expected to see also 2555 scaffolds in the _HiC.assembly file, in the final order and orientation and split over 7 lines, each line corresponding to 1 chromosome, but there are more (unplaced scaffolds?). When looking at the first 7 lines after >hic_gap_3305 3305 500 in the _HiC.assembly file, there are only 725 scaffolds (excluding the gap scaffold). When looking at all of the scaffolds in the _HiC.assembly file, there are 3304 scaffolds. Why are there a different number of scaffolds in the _HiC.fasta and the _HiC.assembly file, respectively? Which scaffolds from the _HiC.fasta file correspond to which chromosomes?

Many thanks for the clarifications!

Best,

Vanessa

Olga Dudchenko

unread,

Mar 8, 2021, 11:05:48 AM3/8/21

to 3D Genomics

Hi Vanessa,

The _HiC.hic is built only for chromosome-length scaffolds, and not for unplaced scaffolds. For the number of chromosomes identified in the assembly see the README.json.

With respect to your question about assembly, I think there might be a confusion about the structure of the assembly file. See this slide shared elsewhere on this forum: https://docs.google.com/viewer?a=v&pid=forums&srcid=MTQwMjA4ODQ1NDM4NDQ3NjQ4NzABMDg0NTM3ODMxMzA3NTM1MTg3MzcBY25kcHRhUkVBd0FKATAuMQEBdjI&authuser=0

Best,

Olga

Vanessa Roy

unread,

Mar 8, 2021, 12:29:01 PM3/8/21

to 3D Genomics

Hi Olga,

Thank you very much for your answer. Regarding the .assembly file, I did take a look at that file (also previously), but still don't seem to understand the structure of it.

My aims are the following:

I would like to know which scaffold sequences from the _HiC.fasta file (HiC_scaffold_1, HiC_scaffold_2, etc.) belongs to which chromosome. Also, I need to know in which order and orientation they are placed so the end result is that I end up with sequences (megascaffold) that correspond to one chromosome each.

How / where can I retrieve this information such that I end up with the sequences for each chromosome?

Many thanks!

Best,

Vanessa

Olga Dudchenko

unread,

Mar 8, 2021, 12:46:01 PM3/8/21

to 3D Genomics

Vanessa,

I am not sure I understand the question. HiC_scaffold_1 is chromosome 1; HiC_scaffold_2 is chromosome 2 etc in the _HiC.fasta. These are what you refer to as megascaffolds.

Olga

Vanessa Roy

unread,

Mar 8, 2021, 12:49:39 PM3/8/21

to 3D Genomics

Ok, thanks for your answer. But why then are there 2555 entries? In that _HiC.fasta, I find the sequences of >HiC_scaffold_1 through to >HiC_scaffold_2555. If HiC_scaffold_1 to HiC_scaffold_7 are the megascaffolds, what are then the rest of them?

Many thanks for the clarification!

Best

Vanessa

Olga Dudchenko

unread,

Mar 15, 2021, 11:29:06 AM3/15/21

to 3D Genomics

Hi Vanessa,

These are unanchored sequences, i.e. sequences who's chromosome of origin remains unknown. This can be for a plehora of different reasons, e.g. a) the sequence is too small to reliably place; b) the sequence has a conflicting signal, suggestive of a misjoin; c) the sequence has an unusual coverage, suggesting some error modality; d) the sequence represents a contamination to nuclear genome, e.g. from mitochondrion, chloroplast, symbiont or adapter; e) these are alternative haplotypes to some sequences already included in the main assembly; .... Point being is that most of the genome assemblies you encounter, including the very best ones like human, do not just consist of chromosomal sequences. One does not want to just discard those sequences as they may be legitimate bits from the genome. Hence, it is not unusual to include them in the genome assembly as unanchored sequences.

Olga

Vanessa Roy

unread,

Mar 16, 2021, 3:31:49 AM3/16/21

to 3D Genomics

Dear Olga

Thank you very much for the clarification!

Best

Vanessa

Reply all

Reply to author

Forward