Project 3: Open Reading Frame Locator Due: At the beginning of lab in week 8 Obj
ID: 3757748 • Letter: P
Question
Project 3: Open Reading Frame Locator
Due: At the beginning of lab in week 8
Objective:
In this project, you will gain practical understanding of the branching constructs, such as the if construct, logical and relational operations, and loops. You will further reinforce the concepts of data I/O, including loading data files and inputting data at the command prompt, with the requirement of some user entry validation.
Background
One of the most important and highly regulated functions of every eukaryotic cell is the transcription of DNA in the nucleus. During transcription, the code to produce a protein that is initially described by a DNA sequence is rewritten in RNA so that it can be transported out of the nucleus and eventually interpreted, or translated into a corresponding amino acid sequence that makes up a useful protein. Genes, which are segments of DNA that code for individual proteins, are scattered throughout the chromosomes, with hundreds or thousands of genes located on each chromosome. Proteins are translated from the information in genes with the help of the genetic code. Here, bases (or nucleotides) are read three-at-a-time (these triplets are called codons), and, for each codon, the corresponding amino acid from the genetic code is added to the growing protein (see fig. 1).
Figure 1: Genetic Code
There are four special codons represented in the genetic code. The first is ATG. This codon is known as the start codon; it is necessary for translation of a protein to begin. It also causes the amino acid MET (methionine) to be placed as the first amino acid in each protein. Following the start codon, codons are read and translated until a stop codon is reached, at which point translation is terminated. As shown in fig. 1, the stop codons are TAA, TGA and TAG.
Since translation occurs three bases at a time, there are three different reading frames corresponding to a given segment of DNA. Because of this, what might be a start or stop codon in one reading frame is not necessarily a start or stop codon in another reading frame. Codons in the same reading frame are said to be in-frame with each other.
As researchers attempt to identify new genes, one way to search for potential genes is to search for open reading frames (ORFs), or long stretches of codons with no stop codons. To illustrate what does and does not constitute an open reading frame, see the examples in fig. 2. For this illustration, the minimum acceptable length for an open reading frame is 20 bases.
Figure 2: Illustration of Open Reading Frame (ORF) examples when minimum ORF length is 20 bases.
The main goal of this project is to write a program that identifies open reading frames from long sequences of DNA bases provided via input files. Your program will have much of the same functionality as the Open Reading Frame Finder web tool available to researchers through the National Library of Medicine website (http://www.ncbi.nlm.nih.gov/projects/gorf/orfig.cgi). To familiarize yourself with this tool, download the file ras.txt from Blackboard, and copy and paste the DNA sequence from the file into the FASTA format input box on the webpage. Finally, click OrfFind. The output should match that shown in fig. 3.
Figure 3: ORF Finder Output for a human ras gene
The output on the right side of fig. 3 shows the identified open reading frames that are at least 100 bases in length. Note that we are only interested in the frames identified with positive numbers (+1, +2 or +3) here. The negative frames correspond to the DNA strand that would be complementary to the provided input. Here, there are three open reading frames of interest, starting at base 905, 2203 and 692, respectively. If you click the blue box next to each, the DNA and corresponding amino acid sequences appear. You will notice that each begins with ATG and ends with one of the three stop codons.
Project Description
Your job is to write a user-friendly open reading frame (ORF) locator that meets the following specifications. Your program should be named ORFfinder_yourlastname.m.
Program Specifications
Your program must be able to load input data files in .txt format that contain DNA base data in FASTA format (for an example of the desired format, see the provided ras.txt file).
To load the data into the program, ask the user to input a file name and load a file with that name into the MATLAB workspace. To load the data, use the textread command (load will not work in this case as you have a character array).
Sample usage of textread:
>> Data=textread(‘datafile.txt’,’%c’)
You are not required to validate the user entry for the file name.
The program should ask the user to enter a minimum ORF size to identify. If the user enters a string (with single quotes) OR a number that is less than 50 or greater than the size of the sequence, the program should warn the user of an invalid entry and provide another opportunity for input. This should happen perpetually as long as the user keeps entering these types of invalid entries.
Note that MATLAB will automatically display an error and provide another opportunity for input for most cases in which the user enters a non-numeric input without single quotes (because it will be assumed to be a variable, which probably does not exist, so you need not worry about handling such cases.
If you cannot get this part to work, you may skip the validation of user entry, but you will receive a deduction of 7 points from your grade for doing so.
Next, the program should identify ALL ORF’s in the DNA sequence that meet the following criteria:
Begins with the start codon.
Ends with a stop codon in the same reading frame as the stop codon.
Has a length (in bases) of at least the number specified by the user in part (2).
Has no intervening in-frame stop codons (i.e., the stop codon at the end of the ORF is the only one in-frame within the ORF).
The result of this part of your program should be a two-column matrix, where each row corresponds to an ORF that meets the criteria above. The first column should contain the starting position of the ORF, and the second column should be the total number of bases in the ORF (including the start and stop codon bases). Note: see suggestions below for hints on how you might approach your search.
[Note: This step corresponds to Project 3a; it should only require editing of the code you wrote for menu implementation, although the “response” to a menu selection will be different] Unless there were no ORF’s found (in which case your program should provide a statement that no ORF was found, and then end), your program must then perpetually provide a menu like that below to the user until the user enters the selection to exit. Note that the entries in the menu should correspond to the ORF’s that your program identified, so the number of items in the menu will depend on how many ORF’s were identified.
Choose which ORF to view:
(1) ORF #1 (start: 692, length: 159)
(2) ORF #2 (start: 905, length: 306)
(3) ORF #3 (start: 2203, length: 240)
(4) Exit
Enter selection:
Once the user makes a selection (other than the Exit selection), the program should display the sequence of the entire ORF, broken up into codons, in the following format:
DNA Sequence for ORF #X:
ABC ABC ABC ABC ABC
ABC ABC ABC ABC ABC
ABC ABC ABC ABC ABC
ABC ABC ABC ABC ABC
…,
where X is the ORF number from the menu and the A,B,C’s are replaced with the actual bases. Hint: if you are having difficulty achieving this formatting for the bases, see what happens when you type the following two commands:
>> a = ‘abcdefghijklmnopqrstuvwxyz’;
>> fprintf(‘%c%c ’,a);
The program should then return to the menu.
Once the user selects to exit, the program should end.
User entry validation is not required for the menu selection; however, extra credit will be awarded, as described in project 3a for this menu implementation.
ORF Search Strategy
While there are many ways to approach the search for ORF’s in a DNA sequence, outlined here is one possible approach. You may use parts or all of the methods described here in your own approach. Note: this approach is somewhat sophisticated and efficient. If you are having trouble implementing it, you are encouraged to attempt other methods. You should not expect your instructor to walk you through each step of this method.
Once the test string has been input and the user has input the minimum ORF length. The strfind command in MATLAB can be used to find the locations of the start codon and each of the three stop codons in the test string. The locations of the three types of stop codons can all be put together into one vector.
After initializing an array to contain the ORF data (creating an empty array to start), a loop can be created to go through each of the elements in the start codon location vector. For each, a logical array can be set up that corresponds to the stop codon location vector that identifies which of the stop codon locations are in the same reading frame as the start codon being considered (hint: consider using the mod command). Similarly, another logical array can be set up that identifies which stop codon locations come after the start codon being considered. A final logical vector can be created that combines the information from the first two to identify which stop codon locations meet both criteria. This logical vector can then be used as an index with the stop codon location vector to identify the actual stop codon locations that meet the criteria, and the minimum of these can be identified. If a minimum was found (hint: consider using the isempty command to make this determination), then the distance between the start and stop codons can be checked. If it would result in an ORF that exceeds the minimum size identified by the user, then the array containing the ORF data can have the relevant information (start location and length) appended as a new row.
What to Submit?
Please compose a mini-report as follows. This should all be put together as a single Word document named ORFfinder_yourlastname.docx.
Include a cover page with standard information (name of class, your name, date, name of project, etc.). Also cut and paste (or retype) the following disclosure:
I affirm, in accordance with MSOE’s Policy on Student Integrity, that while completing this project, I wrote and tested my code independently. I did not receive code from other students or from the internet. I used the following resources, and only these resources, while completing the assignment (the course textbook and Dr. LaMack are the only resources that need not be listed):
Resource (e.g., student name, etc.) Nature of Use (e.g.,discussed approaches for part X, etc.)
Signed: [provide electronic signature] Date:
A statement of whether or not your program performs the basic requirements described above.
A description of how you tested your program. This should include:
Demonstration that the program provides the correct interface with the user. Include screen shots of the program as it is run, showing user entry, menu selection, and output display. Show what happens when an invalid entry is attempted for the minimum ORF length to show that your user entry validation works correctly. If you included user input validation for the menu selection, explain to what capacity it validates entry and give an example of how your program responds.
Demonstration that the program output for the ras.txt file is correct. Note that this data file contains the DNA sequence for the following gene:
Human SK2 c-Ha-ras-1 oncogene-encoded protein gene, exon 1
GenBank: M30539.1
Compare your program’s output with that obtained when you run the ORF finder web tool (http://www.ncbi.nlm.nih.gov/projects/gorf/orfig.cgi) on this sequence (see fig. 3). Note that the web tool does allow you to select among a few choices for minimum ORF length.
Demonstrate that your program produces correct output for the DNA sequence of one other human gene. You can search for gene sequences on the following website: http://www.ncbi.nlm.nih.gov/nuccore. You might type in keywords of genetic diseases to find an interesting gene. Once you find an entry for a new gene, click on FASTA to get it displayed in a useful format (see fig. 4). Then, highlight the DNA sequence and copy and paste it into either the ORF finder web tool or into a Notepad document (which can be saved as a .txt file and then used as an input file into your program). In your report, identify the gene you select (with its name and GenBank number as above). Be sure to select a gene that yields at least one ORF. You may notice that your program produces more possible ORF’s than the web tool for some genes. If you encounter this, see if you can figure out why (hint: it may have to do with whether or not extra start codons are allowed in the ORF).
Figure 4: Sample gene search output with the FASTA option boxed.
If your program did not work, provide an explanation of what did not work and how you attempted to fix it. Note: it is in your best interest to test each major part of your program independently to get it to work. For example, if you cannot get the ORF locator part of your program to work, you could temporarily comment out that section of the program (if it produces errors) and then make up a matrix of test ORF location data to proceed with the menu and display. Please see me if you would like assistance in isolating the parts of your program.
Your code provided as an appendix (copy and paste it into your Word document).
E-mail me your program prior to the beginning of lab in week 8.
Submit a hard copy of your mini-report at the beginning of lab in week 8.
The grading rubric to be used to grade your submissions is attached. Please see the final page for descriptions of how to earn the final 5 points and up to +10 extra credit points.
Lab 3 Grading Rubric
Below is the scoring that will be used for Lab 3. The scale is a 5 point scale with 5 being superior, 4 being satisfactory, 3 being average, 2 being unsatisfactory and 1 being not undertaken. Some items have been scaled to higher values.
Program Documentation
A header was included with the correct format, which included an adequate
statement of purpose, record of revisions, and list of important variables. 1 2 3 4 5
Comments adequately described the logic used in the program. 2 4 6 8 10
Program Execution
User is correctly prompted for a file name, and the corresponding data
file is read in correctly. 1 2 3 4 5
Minimum ORF size is correctly obtained from the user, with entry validation 2 4 6 8 10
ORF’s that meet the required criteria are correctly identify and assembled
into an array. 5 10 15 20 25
The ORF menu is correctly implemented, allowing users to continue
choosing several options until they choose to exit. 2 4 6 8 10
ORF’s are correctly displayed if the user chooses to view one. 2 4 6 8 10
Mini-Report
Correct and complete cover page. 1 2 3 4 5
Program testing is complete and described adequately for the ras gene
and one other human gene. 2 4 6 8 10
A description of what worked and what didn’t work is complete and accurate 1 2 3 4 5
Point Subtotal: _____________ (out of 95; note opportunity for up to +15 described on next page)
(continued on next page)
Adjustments:
Extraneous data was output to the command window (up to -3):
Did not use specified file names (-3):
Did not e-mail program file appropriately (-3):
Did not submit printed copy of report--only e-mailed submission (-3):
Did not include m-file as appendix (-3)
Late penalty (-15 per day):
Additional user entry validation included (up to +5):
Additional clever and useful features added to the program (up to +5):
Exceptional efficiency in programming approach (up to +5):
Final Grade:
Additional Comments:
figure1:genetic code
Figure 3: ORF Finder Output for a human ras gene
here is the code for pat 3(a), you will need this part for section 4.
clear all;
close all;
clc;
%using a while loop to ask the user to input a number between 1 and 4
%(inclusive).
% variables:
% value: variables to store the input
numoptions = input('Enter a Number between 1 and 4 (1-4):')
while (numoptions == 1);
fprintf('(1) Option 1. ');
fprintf('(2) Exit. ');
i= input('Enter selection:');
if( i == 1)
fprintf('Option 1 was chosen');
elseif (i==2)
break
else
fprintf('Please choose a number between 1 and 2. ');
end
end
while (numoptions == 2)
fprintf('(1) Option 1. ');
fprintf('(2) Option 2. ');
fprintf('(3) Exit. ');
i=input('Enter selection:');
if( i == 2)
fprintf('Option 2 was chosen');
elseif (i==1)
fprintf('Option 1 was chosen');
elseif (i==3)
break
else
fprintf('Please choose a number between 1 and 2. ');
end
end
while (numoptions == 3)
fprintf('(1) Option 1. ');
fprintf('(2) Option 2. ');
fprintf('(3) Option 3. ');
fprintf('(4) Exit. ');
i=input('Enter selection:');
if( i == 3)
fprintf('Option 3 was chosen. ');
elseif (i==1)
fprintf('Option 1 was chosen. ');
elseif (i==2)
fprintf('Option 2 was chosen. ');
elseif (i==4)
break
else
fprintf('Please choose a number between 1 and 2. ');
end
end
while (numoptions == 4)
fprintf('(1) Option 1. ');
fprintf('(2) Option 2. ');
fprintf('(3) Option 3. ');
fprintf('(4) Option 4. ');
fprintf('(5) Exit. ');
i=input('Enter selection:');
if( i == 4)
fprintf('Option 4 was chosen. ');
elseif (i==1)
fprintf('Option 1 was chosen. ');
elseif (i==2)
fprintf('Option 2 was chosen. ');
elseif (i==3)
fprintf('Option 3 was chosen. ');
elseif (i==5)
break
else
fprintf('Please choose a number between 1 and 2. ');
end
end
if you have any question please email me at: abdullah.g.aleid@gmail.com , or call me at: 414-477-5508
thanks
Explanation / Answer
Solution :
>> filename=get(hObject,'String')
>> evalin('base','load(fileName{1})')
>> as well as:
>> fileName = inputdlg('Enter file name:');
>> evalin('base','load(fileName{1})')
>> I get the error that fileName does not exist.
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.