Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

PERL SCRIPT PROGRAMMING A researcher has a file containing information about the

ID: 3804041 • Letter: P

Question

PERL SCRIPT PROGRAMMING

A researcher has a file containing information about the number of times particular k-mers (peptide sequences of length k, derived from actual protein sequences) occur in the human proteome. The information for each k-mer is on one line in the file. The information is divided into columns. The first column is the position of the start of the k-mer in its source protein. The next column is the k-mer itself. Then are two counts: the number of times that the k-mer occurs in the human proteome, and the number of proteins in the human proteome which contain the k-mer. The information columns are deliminted by tab characters. For example, a portion of the data file might look like:

The researcher is interested in those k-mers for which the counts in the last two columns are both 0; i.e. the researcher is interested in k-mers which do not occur in the human proteome. For instance, given the data above, the researcher would be interested in being informed of the k-mer IDTLQ.

Write a Perl script that will output, on the standard output, the k-mers that do not occur in the human proteome assuming input as described above. Each k-mer is to be on a separate line. The script is to read from standard input. Assume that the input file contains nothing other than lines of k-mer information.

Hint: Use the pattern-extraction facilities of Perl.

Your scripts should be independent of the value of k (providing, of course, that k1). That is, your scripts should be work for data files of k-mers of any size. Further, k should not be a parameter in/to your scripts

note: perl script not shell script

Explanation / Answer

The input file is as below:

Note: There should be single space between the entries on single line.

data.txt

110 DPRRR 18 18
111 PRRRS 58 54
112 RRRSR 173 112
113 RRSRN 12 12
114 RSRNL 13 13
115 SRNLG 14 14
116 RNLGK 22 22
117 NLGKV 9 9
118 LGKVI 23 23
119 GKVID 19 19
120 KVIDT 12 12
121 VIDTL 4 4
122 IDTLQ 0 0
123 DTLQE 4 3

count.pl

use strict;
use warnings;

# the file data.txt should be in the same location as this script file.
my $filename = 'data.txt';

# open the file and throw error if unsuccessful.
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";

# loop through each line in the file.
while (my $row = <$fh>) {
# strip the new line character at end of line
chomp $row;

# convert the line to array.
my @words = split / /, $row;
  
# store the 3rd column and 4th column
my $kmerCount = $words[2];
my $proteinCount = $words[3];
  
# check if the 3rd and 4th column are equal to zero.
# if yes, then display the k-mer.
if ( ($kmerCount == 0) && ($proteinCount == 0)){
   print "$words[1] ";
}
}

OUTPUT:

>perl 1.pl
IDTLQ