Hi Stehpen,
here is the script:
To run it:
$ awk -f mergeSuperContig.awk All.fasta All.gtf
It has the following hardcoded parameters that you can edit inside the script:
shortL=64000 defines the max contig length for contigs that are merged into the supercontig. Longer contigs are kept separate.
You need to select this number so that the number of separate contigs is < 50,000-100,000.
pN=60 is the length of N-padding in between of the merged short contigs. This has to be ~ read length.
The script will generate Long.out.fasta, Short.out.fasta and Annot.out.gtf files that have to be fed to STAR for genome generation.
After mapping, the reads mapped to short contigs will need to be transformed to local coordinates.
This can be done using the ChrStart.tab file that contains the start positions of the short contigs in the super-contig.
If you test the genome generation and mapping, and it works fine for you, I can write a simple script to make this transformation.
Cheers
Alex