Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Galaxy is an open-source platform for analysis of Next Generation Sequencing (NG

ID: 3838077 • Letter: G

Question


Galaxy is an open-source platform for analysis of Next Generation Sequencing (NGS) data. Navigate to the Galaxy website and https://usegalaxy.org/- Briefly summarize the of the following functions in Galaxy. Be sure to explain how these functions allow for the analysis of NGS data. Bowtie FASTOC TopHat Cuffdiff/Cufflinks/Cuffmerge/Cummerbund Compare FASTQ files (Wiki page: https://en.wikipedia.org/wiki/FASTQ format) to FASTA files. What are the differences in the information they contain? What are the differences in their utility? Why is it important to include Quality control data in the output of gene sequencing data?

Explanation / Answer

Next generation sequencing (NGS) data is extremely high throughput, allowing for exponentially higher amounts of data to be generated than the traditional Sanger Sequencing. This is made possible by procuring millions of sequence clusters in parallel, and reading the sequences of all of these clusters base by base, through cycles of nucleotide incorporation, fluorescence reading, and dye cleaving.

Bowtie: is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. It aligns 35-base-pair reads to the human genome at a rate of 25 million reads per hour on a typical workstation.

The purpose of alignment is to determine reads’ point of origin with respect to the reference genome. Once points of origin are identified, downstream tools use that information, for example, to characterize differences between the subject and reference genome (e.g. when calling SNPs), or to relate the reads to annotations defined with respect to the reference genome (e.g. for digital gene expression). Alignment programs, together with appropriate reference sequences, serve this purpose because genomes of individuals of the same species tend to be highly similar. For example, two humans typically have on the order of 3–4 million single-nucleotide differences between them out of a total of 3 billion bases.

          FastQC aims to provide a simple way to do some quality control checks on raw sequence data              coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The main functions of FastQC are:

·         Import of data from BAM, SAM or FastQ files (any variant)

·         Providing a quick overview to tell you in which areas there may be problems

·         Summary graphs and tables to quickly assess your data

·         Export of results to an HTML based permanent report

·         Offline operation to allow automated generation of reports without running the interactive application

FastQC is the best place to look for documentation - it's very good.

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons. TopHat has entered a low maintenance, low support stage as it is now largely superseded by HISAT2 which provides the same core functionality (i.e. spliced alignment of RNA-Seq reads), in a more accurate and much more efficient way. improved the detection of linker options for the Boost::Thread library which prevented the TopHat build from source on some systems.

Cufflinks includes a program, “Cuffdiff”, that you can use to find significant changes in transcript expression, splicing, and promoter use. From the command line, run cuffdiff as follows:

cuffdiff [options]* <transcripts.gtf>

<sample1_replicate1.sam[,…,sample1_replicateM.sam]>

<sample2_replicate1.sam[,…,sample2_replicateM.sam]> …

[sampleN.sam_replicate1.sam[,…,sample2_replicateM.sam]]

cummeRbund is a visualization package for Cufflinks high-throughput sequencing data. It is designed to help you navigate through the large amount of data produced from a Cuffdiff RNA-Seq differential expression analysis. CummeRbund begins by re-organizing output files of a cuffdiff analysis, and storing these data in a local SQLite database. CummeRbund indexes the data to speed up access to specific feature data (genes, isoforms, TSS, CDS, etc.), and preserves the various relationships between these features. Access to data elements is managed via the RSQLite package and data are presented in appropriately structured R classes with various convenience functions designed to streamline your workflow. This persistent database storage means that inter-connected expression values are rapidly accessible and quickly searchable in future analyses.

FASTA & FASTQ files:

Next-Generation sequencing machines usually produce FASTA or FASTQ files, containing multiple short-reads sequences (possibly with quality information).

The main processing of such FASTA/FASTQ files is mapping (aka aligning) the sequences to reference genomes or other databases using specialized programs. Example of such mapping programs are: Blat, SHRiMP, LastZ, MAQ and many many others.

However,
It is sometimes more productive to preprocess the FASTA/FASTQ files before mapping the sequences to the genome - manipulating the sequences to produce better mapping results.

    FASTQ has emerged as a common file format for sharing sequencing read data combining both the sequence and an associated per base quality score, despite lacking any formal definition to date, and existing in at least three incompatible variants.

     In the area of DNA sequencing, the FASTQ file format has emerged as another de facto common format for data exchange between tools. It provides a simple extension to the FASTA format: the ability to store a numeric quality score associated with each nucleotide in a sequence.