Regarding sequence alignment using SW but using fasta files

Muhammad Mobeen Movania

unread,

Jun 12, 2015, 1:14:12 AM6/12/15

to nvbio...@googlegroups.com

I have successfully compiled and run all examples given in the introduction section http://nvlabs.github.io/nvbio/nvbio_page.html. I have also successfully used the proteinSW example.

I would now like to carryout sequence alignment using SW algorithm on two fasta files. I do not know where to start. I know how to load fasta files but thats it. How do i extract the data from the fasta file and pass it forward to the SW aligner? Is there any simple example which might help me? What are the steps that I need to follow in order to align a source fasta with the reference fasta? Is there any documentation on this? I would like to obtain the alignment score and the alignment result.

Any help will be appreciated.

Thanks,
Mobeen

Jacopo Pantaleoni

unread,

Jun 12, 2015, 2:57:36 AM6/12/15

to Muhammad Mobeen Movania, nvbio...@googlegroups.com

Hello Muhammad,

this goes a bit beyond the help I can quickly offer to you over email, but the idea is that you have to look at how to load sequences with

the io::SequenceData objects (http://nvlabs.github.io/nvbio/sequence_io_page.html).

Once you have done that, transferring to the GPU is trivial (see SequenceDataDevice) - and at that point you can pick your string sets

using the SequenceDataAccess class.

But if all you want to do is just protein sequence alignment from files, you can also look into CUDASW++ 3.0 which is designed to do just that.

--
You received this message because you are subscribed to the Google Groups "nvbio-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nvbio-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Muhammad Mobeen Movania

unread,

Jun 14, 2015, 9:55:19 AM6/14/15

to nvbio...@googlegroups.com, mmmo...@gmail.com

Thanks Jacopo,
I will go through the steps you have mentioned and share my code once its done. As for using CUDASW++ I will also give it a try after I am done with nvbio's implementation. By the way, can CUDASW++ handle global alignment (Needleman Wunsch) also or is it restricted to local alignment only?

Jacopo Pantaleoni

unread,

Jun 15, 2015, 2:26:19 AM6/15/15

to Muhammad Mobeen Movania, nvbio...@googlegroups.com

I suppose CUDASW does only local alignment, but modifying it should be fairly straightforward...

Muhammad Mobeen Movania

unread,

Jun 15, 2015, 6:31:30 AM6/15/15

to nvbio...@googlegroups.com, mmmo...@gmail.com

Hi Jacopo,
OK I have got some success by looking at the seeding.cu example in nvbio distribution. I am currently trying to read all sequences in my first (reference) fasta file. This is how I do it currently. Note that I wanted to see the contents of current seed but the code crashes when I try to call d_seed_set[i].m_string and this I think is probably because I am trying to read device data on host directly? How should I go about this? Thanks for your help so far.

nvbio::SharedPointer<nvbio::io::SequenceDataInputStream> reads_file(nvbio::io::open_sequence_file(ref_filename));
nvbio::io::SequenceDataHost reads;
    
// declare how much sequence data we want to load in each batch
const nvbio::uint32  seqs_per_batch = 128 * 1024;        // the maximum number of sequences
const nvbio::uint32   bps_per_batch = 128 * 1024 * 100;    // the maximum number of base pairs
    
// loop through the stream in batches
while (nvbio::io::next(nvbio::DNA_N, &reads, reads_file.get(), seqs_per_batch, bps_per_batch))
{
   // copy the loaded batch on the device
   const nvbio::io::SequenceDataDevice device_reads(reads);
      
   // and do something with it...
   printf("File contains %u sequences.\n", reads.size());
         
   // prepare some typedefs for the involved string-sets and infixes
   typedef nvbio::io::SequenceDataAccess<nvbio::DNA_N>        read_access_type;
   typedef read_access_type::sequence_string_set_type        string_set_type;    // the read string-set
   typedef nvbio::string_set_infix_coord_type                infix_coord_type;   // the infix coordinate type, for string-sets
   typedef nvbio::vector<nvbio::device_tag, infix_coord_type>     infix_vector_type;  // the device vector type for infix coordinates
   typedef nvbio::InfixSet<string_set_type, const nvbio::string_set_infix_coord_type*>   seed_set_type;      // the infix-set type for representing seeds

   // build a read accessor
   const read_access_type d_read_access(device_reads);

   // fetch the actual read string-set
   const string_set_type d_read_string_set = d_read_access.sequence_string_set();

   // prepare enough storage for the seed coordinates
   infix_vector_type d_seed_coords;

   // extract the seeds and get the corresponding string-set representation
   const seed_set_type d_seed_set = extract_seeds(    d_read_string_set, 20u,    10u, d_seed_coords);

   //see the contents of the all seeds
   for (size_t i = 0; i < d_seed_set.size(); ++i) {
      log_info(stderr, "Seed: %s\n", d_seed_set[i].m_string); //this line crashes
   }

   return 0;
}

Reply all

Reply to author

Forward