Example Final Projects

Selected example final project

The following project descriptions are taken from final projects turned in by students. In some cases, I lightly edited the title or description to clarify the nature of the project. — F. J. Pineda


Correlation of GC content with average isoelectric point.

Amber L. Hartman

Recent halophile genome papers (such as Halobacterium NRC-1 and Haloarcula marismortui) have reported unusually low average isoelectric points for these organisms, 4.5 and 5 respectively. These low values indicate an especially acidic proteome. The authors hypothesize that these acidic proteomes are a halophilc adaptation important for preventing the “salting out” of proteins in high salt environments.

However, a member of our group, while working on intracelluar endosymbionts demonstrated an inverse linear relationship between GC content and average pI, i.e, as GC content increases, pI increases. Given the high GC content (65-71%) of halophile genomes it is conceivable that the low pI of their proteome is simply the result of high GC content and not necessarily adaptive.

For the final project I have selected 35 genomes of varied GC content to compare this relationship between isoelectric point and GC content. For each genome file (.con = nucloetides, .pep = peptides) I have calculated average GC content and isoelectric point.


Comparison of Confidence Intervals for Gene Location

Ani Manichaikul

In this project, I performed a simulation to compare the performance of two methods for estimation and confidence intervals for gene mapping in humans. The methods to be compared were non-parametric LOD (NPL), which is a sort of standard in gene mapping, and GEE, which has been proposed as an alternative The purpose of this project is to visualize presumable disease alleles in each single SNP clearly at a glance as well as to check consensus on disease allele from both single genotype analyses and various combinations of haplotype analyses. method of gene mapping.


Parsing output file from SAGE program for linkage analysis

Ching-Yu Cheng

This program is to generate a text file from output files of single-point linkage analysis performed with Statistical Analysis for Genetic Epidemiology (SAGE) program. The text file will be ready for plotting with R, and the plot can help us to know how many significant linkage signals are detected across 23 human chromosomes. Because SAGE produces a single output file from each chromosome, using Perl to parse those outputs will efficiently combine and summarize all outputs. Now I only have output files avialble from chrmosomes 1 to 21, so the script below is used for parsing 21 linkage output files. Each output file contains results of 6 traits, such as BP, BMI, and so on. Although the script is not long, it is useful for my genetic analysis project. Scripts using the same principle can be written to parse other analysis outputs from each of 23 chromosomes.


Parsing output file for Visualization of disease alleles

Euiju Jung

The purpose of this project is to visualize presumable disease alleles in each single SNP clearly at a glance as well as to check consensus on disease allele from both single genotype analyses and various combinations of haplotype analyses.

I made a preliminary setting for chromosome 1 as a part of visualization of results across different set of genome-wide association studies.

First, I made two parser modules (single.pm, haplo.pm) to extract information from different format of genetic association study output files. Then, make a R batch file to plot the extracted information (slp.pl). Here, I made template R file (slp.r) to plot graphs efficiently.


Script that drives Genehunter-plus program and parses the output

Evaristus Nwulia

A Perl script that uses Genehunter-plus to generate nullprobs and probs.dat from autosomal chromosomes and use ASM that takes the two outputs and computes likelihood ratio statistics from the Kong and Cox model of allele sharing in ASM


Automated download and BLAST analysis

Jonathan Pelsis

Create a Perl script that will download a nucleotide or protein database, search for a given RefSeq and do a BLAST search using the two. After the initial BLAST search is complete, the program will mutate the sequence and used the mutant sequence as the new query in the BLAST search. The sequence is mutated 1000 times and every 100th mutation is saved in a new file. A dynamic folder is created based on the current time, date, and database being used. Below are links to all the code used for the project.


FANTOM3 data and cancer

Luigi Marchionni

Retrieve S/AS pairs of transcripts from FANTOM3-db and map their sequences on mouse arrays in order to evaluate the presence and the nature of S/AS pairs differential expression in cancer models (or during embryogenesis).

Gene Annotation for Microarray Chip

Qiushan Tao

The number of large-scale experimental datasets generated from high-throughput technologies has grown rapidly. For example, a affy microchip can measure more than 10 thousands gene expression levels simultaneously. Although the analysis packages of those gene chips have gene annotation profile, those annotation profile usually too simple and out-of-date for most studies. This project is try to write a perl program to update the gene annotation profile of Affy chip HU95av2. and it will have the following functions: