Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Your task is to write a GFF3 feature exporter. A user should be able to run your

ID: 3713202 • Letter: Y

Question

Your task is to write a GFF3 feature exporter.

A user should be able to run your script like this: $ export_gff3_feature.py --source_gff=/path/to/some.gff3 --type=gene --attribute=ID --value=YAR003W There are 4 arguments here that correspond to values in the GFF3 columns. In this case, your script should read the path to a GFF3 file, find any gene (column 3) which has an ID=YAR003W (column 9).

When it finds this, it should use the coordinates for that feature (columns 4, 5 and 7) and the FASTA sequence at the end of the document to return its FASTA sequence. Your script should work regardless of the parameter values passed, warning the user if no features were found that matched their query. (It should also check and warn if more than one feature matches the query.)

The output should just be printed on STDOUT (no writing to a file is necessary.) It should have a header which matches their query,

like this:

>gene:ID:YAR003W …. sequence here …

(Using  genome and annotation for Saccharomyces cerevisiae S288C)

Explanation / Answer

Biopython provides a full featured GFF parser which will handle several versions of GFF: GFF3, GFF2, and GTF. It supports writing GFF3, the latest version.

GFF parsing differs from parsing other file formats like GenBank or PDB in that it is not record oriented. In a GenBank file, sequences are broken into discrete parts which can be parsed as a whole. In contrast, GFF is a line oriented format with support for nesting features. GFF is also commonly used to store only biological features, and not the primary sequence.

These differences have some consequences in how you will deal with GFF:

The documentation below provides a practical guide to examining, parsing and writing GFF files in Python.

Examining your GFF file

Since GFF is a very general format, it is extremely useful to start by getting a sense of the type of data in the file and how it is structured. GFFExaminer provides an interface to examine and query the file. To examine relationships between features, examine a dictionary mapping parent to child features:

This file contains a flexible three level description of coding sequences: genes have mRNA trasncripts; those mRNA transcripts each contain common features of coding sequence, the CDS itself, exon, intron and 5’ and 3’ untranslated regions. This is a common GFF structure allowing representation of multiple transcripts:

Another item of interest for designing your parse strategy is understanding the various tags used to label the features. These consist of:

The available_limits function in the examiner gives you a high level summary of these feature attributes, along with counts for the number of times they appear in the file: