Re: Project 5

27 views
Skip to first unread message
Message has been deleted

Keith Bradnam

unread,
Oct 5, 2013, 1:47:20 PM10/5/13
to unix-and-perl-...@googlegroups.com
This code has syntax errors that prevent it from running at all. You can always check the syntax of your code by simply running:


You should get into the habit, especially when learning a language, of writing one line of code and then testing it. Then write another line and test it again. Obviously if you are writing a block of code (with curly braces) you need to have the opening and closing lines of your block in place, but you can still proceed by adding one line at a time inside your block.

Apart from the syntax errors that you would discover easily if you test-as-you-go, there are other problems with your code:

1) Block style. Your while and foreach loops are not set out as we suggest. There is no indentation and this makes your code hard to read.

2) You assign the contents of @ARGV to two variables (which is good) but then continue to use $ARGV[0] instead of $sequences.

3) There are no comments in your code to indicate what each block of code is trying to do. This makes it harder to follow.


On Saturday, October 5, 2013 6:14:56 AM UTC-7, Ema wrote:
This is my code for project 5. It is not completed yet. Can you please tell me if i'm going right or wrong? Is there anything thats need to be modified?
#!/usr/bin/perl
# kmeranalysis.pl
use strict; use warnings;
die "usage: kmeranalysis.pl <file name> <k>\n" unless @ARGV == 2;
my $sequences=$ARGV[0];
my $k=$ARGV[1];

my %count1 = ()
my %count2 = ()
my $total1 = 0;
my $total2 = 0;
open(IN, "<$ARGV[0]") or die "error reading $ARGV[0] for reading";

while (<IN>)
{
chomp;
$key1 = $1 if /^(>\S+i1\S+)/;
$sequences1{$key1} = $_;
 for (my $i = 0; $i < length($sequences1{$key1}); $i ++)
{
    my $kmer = substr($sequences1{$key1}, $i, $k);
if (exists $count1{$kmer}) {$count1{$kmer}++}
else {$count1{$kmer} = 1}
$total1++;

 }

foreach my $kmer (sort keys %count1) {
my $frequency = $count1{$kmer}/$total1;
printf "%s\t%d\t%.4f\n", $kmer, $count1{$kmer}, $frequency;
 }}
close IN;

Message has been deleted
Message has been deleted
Message has been deleted

Keith Bradnam

unread,
Oct 7, 2013, 12:13:45 AM10/7/13
to unix-and-perl-...@googlegroups.com
Can you post a version of your latest code. Also, are you using a modified version of the IME intron file so that each DNA sequence is only on one line?

On Saturday, October 5, 2013 1:44:08 PM UTC-7, Ema wrote:

I've modified the code a bit. It is ruuning but there is some problem. It also reads the header line with the sequence and gives result. how i can get the fasta file in only a line . can you please give me a sample code just for the beginning.this project seems to me quite hard.If you can please give a sample code of how to count kmers in intron1 i will understand the procedure. I am at a loss and i need help. please.

Ema

unread,
Oct 8, 2013, 1:35:35 PM10/8/13
to unix-and-perl-...@googlegroups.com
#!/usr/bin/perl
# kmeranalysis.pl
use strict; use warnings;
die "usage: kmeranalysis.pl <file name> <k>\n" unless @ARGV == 2;

my $k=$ARGV[1];

my (%count) = ();
my (%count1) = () ;
my (%frequency) = ();
my (%frequency1) = ();
my $total = 0;
my $total1 = 0;
my $kmer;
my $kmer1 ;


open(IN, "<$ARGV[0]") or die "error reading $ARGV[0] for reading";

my ($seq) = "";
while (<IN>){
     my ($line) = $_;
     if($line =~ m/^>\S+i1\S+/){

        ($seq) = <IN>;
        chomp($seq);

        for (my $i = 0; $i < length($seq); $i ++){
            my $kmer = substr($seq, $i, $k);

            if (length($kmer)==$k){
                if (exists $count{$kmer}) {$count{$kmer}++}
                else {$count{$kmer} = 1}
               
            }
            $total++;          
        }      
    }

   

     elsif ($line =~ m/^>\S+i2\S+/){
       
        ($seq) = <IN>;
        chomp($seq);

        for (my $i = 0; $i < length($seq); $i ++){
            my $kmer1 = substr($seq, $i, $k);

            if (length($kmer1)==$k){
                if (exists $count1{$kmer1}) {$count1{$kmer1}++}
                else {$count1{$kmer1} = 1}
               
            }

            $total1++;

        }
    }


}
close IN;


    print "kmer usage in intron 1:\n";
    foreach my $kmer (sort keys %count){
    my $frequency= $count{$kmer}/$total;
        printf "%s\t%d\t %.4f\n", $kmer, $count{$kmer}, $frequency;
    }

print "kmer usage in intron 2:\n";
    foreach my $kmer1 (sort keys %count1){
    my $frequency1= $count1{$kmer1}/$total1;
        printf "%s\t%d\n", $kmer1, $count1{$kmer1},$frequency1;
    }
Tell me the problems and how to solve it and how to modify the fasta file. Thanks.

Keith Bradnam

unread,
Oct 21, 2013, 11:45:00 AM10/21/13
to unix-and-perl-...@googlegroups.com
Hi Ema,

Your code does not contain any syntax errors and runs. If you provide a test sequence (always a good idea) it becomes easier to check the output. So if the first sequence in the file is:

>AT1G68260.1_i1_204_CDS
ACGTACGT

Then your script outputs the following (assuming k = 2):

kmer usage in intron 1:
AC 2 0.2500
CG 2 0.2500
GT 2 0.2500
TA 1 0.1250
kmer usage in intron 2:

So while the counts are correct, it looks like your frequencies have errors as they sum to more than 1.

Some comments on your code (which should help you find the problems):

  1. You extract $k from @ARGV, but not the filename. Ideally, you don't want to keep on accessing @ARGV throughout your script.
  2. Your code has two sets of variables (e.g. $kmer, $kmer1). These variable names are not ideal as they are potentially confusing (i.e. the variable name that includes a '1' is the variable that doesn't hold the 1st intron)
  3. You have no comments in your script. This always make it harder to follow (both for yourself and for others looking at your code).
  4. You read a second line of the file depending on whether the first line is a first intron or not. But you will always need to read a second line irrespective. So you could restructure this as 'read two lines from the file then process the output based on the contents of the first line'
  5. Likewise, your for loop that processes each sequence is written twice even though the code has to do the same thing no matter if it is a first or non-first intron. The only thing it has to do differently is work out which hash to add the details to.
  6. Your 'elsif' code is looking to see whether the intron might be a second intron. But really, this code should be looking for any non-first intron. I.e. this can be an else statement rather than elsif.
  7. You test whether the length of your kmer is equal to k, but this should always be true because you subtract length $k using the substr function. However, your problem is that the condition for when your for loop ends is wrong. If you count from 1-10 in groups of 4 bp, what is the last value of $i that you want to consider? In your code you just ask whether $i is less than the length of the sequence (so you're asking whether $i < 10). This will produce errors (which is probably why you are testing the length of $kmer later on.
Hope this helps.

daisy yu

unread,
Aug 7, 2014, 12:51:36 PM8/7/14
to unix-and-perl-...@googlegroups.com
Hi,

The last part of my code which calculates the logs odd ratio doesn't work. I get an error messages as follows: 

Use of uninitialized value $kmer in hash element at project5_kmer.pl line 69.

Use of uninitialized value $kmer in hash element at project5_kmer.pl line 69.

Use of uninitialized value in division (/) at project5_kmer.pl line 69.

Use of uninitialized value in division (/) at project5_kmer.pl line 69.

Illegal division by zero at project5_kmer.pl line 69.

Any hints or suggestions would be greatly appreciated. My code is: 

# perl cat intron_IME_data.fasta | sed 's/\(^>.*\)/!\1!/' | tr '\n' '@' | tr '!' '\n' | sed 's/@//g' > intron_IME_data_v2.fasta 
#!/usr/bin/perl  
use strict; use warnings;

my $k = $ARGV[1]; #kmer length
my (%frequency1, %frequency2);
my (%count1, %count2);
my $total1 = 0;
my $total2 = 0;
my $kmer = 0;
my $freq1 = 0;
my $freq2 = 0;

open(IN, "<$ARGV[0]") or die "error reading $ARGV[0] for reading";

my ($seq) = "";
my ($old) = "";
while (<IN>) { 
    my ($line) =  $_;
    ($seq) = $line;
    chomp($seq);
if ($seq =~ m/\S+\D+/ and $seq !~ m/>A*/){
print "sequences: $seq \n";
print "line1: $old \n";
for(my $i = 0; $i < length($seq)-$k; $i ++){
my $kmer = substr($seq, $i, $k);
if ($old =~ m/^>\S+i1\S+/){
                print "line1: $old \n";
                if (exists $count1{$kmer}) {
                $count1{$kmer}++;} 
                else {$count1{$kmer} = 1}
                $total1 ++;
}
elsif ($old =~ m/_i/){
print "line2: $old \n";
                if (exists $count2{$kmer}) {
                $count2{$kmer}++;} 
                else {$count2{$kmer} = 1}
                $total2 ++;
}
}
}
$old = $seq; #looks up one seq/line above in file

#while close
close IN;

print "intron1: kmer_seq, occurance, freq \n";
foreach $kmer(sort keys %count1){
$freq1 = $count1{$kmer}/$total1;
$frequency1{$kmer} = $freq1;
printf "%s\t%d\t %.4f\n ", $kmer, $count1{$kmer}, $freq1;
}    

print "intron2 & up: kmer_seq, occurance, freq \n";
foreach $kmer(sort keys %count2){
$freq2 = $count2{$kmer}/$total2;
$frequency2{$kmer} = $freq2;
printf "%s\t%d\t %.4f\n ", $kmer, $count2{$kmer}, $freq2;
}

#calculates and reports the log_odds ratio   
foreach my $kmer (sort keys %frequency1 and sort keys%frequency2){
printf "%s %.3f\n", $kmer, log($frequency1{$kmer} / $frequency2{$kmer}) / log(2);}

Keith Bradnam

unread,
Aug 7, 2014, 1:53:10 PM8/7/14
to unix-and-perl-...@googlegroups.com
Some general comments:

1. Assign everything out of @ARGV to named variables  as  soon as possible. $ARGV[0] is a meaningless variable name when you are several lines into your code as it tells you nothing about what it contains. E.g. you can have the following in your script:

my ($file, $k) = @ARGV;

2. You make $kmer a global variable and assign it a value of zero. You then use a completely different $kmer inside your while loop. In the subsequent two foreach loops you re-use the global $kmer, but in the subsequent foreach loop you switch to using a locally declared $kmer (foreach my $kmer…). Declare variables in the smallest scope possible. I.e. you don't need a global $kmer in this script.

3. $seq is declared globally but is only needed inside the while loop

4. You can declare empty variables with just my $var;. No need for my ($var) = "";. The parentheses means you are assigning to a list (of one element), not needed in this instance.

5. You read a line from a file and assign to the magic variable $_. You then reassign this to $line (as a list assignment) and then reassign $line to $seq. In this case you only need $line or $seq (not both), and can assign it more simply with while (my $line = <IN>){.

6. I'm not sure what you mean by your comment '#looks up one seq/line above in file'. At this point you are just overwriting $old with the current line. In this sense $previous_line may be a better variable name.

7. You need to add more comments. It is very hard to follow most of what happens in your while loop as there are no comments inside it (apart from the single one mentioned in my previous point).

8. When using increment operator (++), the preferred style is not to put a space before it.

9. You don't need to set hash values to 1 if they don't already exist. Perl can do something very magical. If you just use $hash{$key}++, then Perl will add 1 if a hash key already exists *or* create a value of 1 if it is the first time it sees that key.


Reply all
Reply to author
Forward
0 new messages