Purdue Discovery Pipeline click on a section below to learn more |
Bindley Contact
Cheolhwan Oh, Ph.D.
oh2@purdue.edu
(765) 496-3170
BIND 227-G |
Faculty Contact
Xiang Zhang, Ph.D.
zhang100@purdue.edu
(765) 496-1153
BIND 227-D |
| |
 |
Spectrum deconvolution
The purpose of spectrum deconvolution is to differentiate signals from the real analyte and
signals from contaminants or instrumental noise. Another purpose is to reduce data dimensionality,
which will benefit down stream statistical analysis. XMass is a software package using chemical
noise filtering, charge state fitting, and de-isotoping for the analysis of complex peptide
samples.
View Diagram
|
Peak alignment
Ideally, the same molecules detected on the same system should have the same value of the measurement. For example, if a peptide is measured on a LC-MS system, the retention time and the molecular weight of a peptide, in different samples, should have the same values. However, this is usually not true because of the experimental variation. The objective of peak alignment is to recognize peaks of the same molecule occurring in different samples from thousands of peaks detected during the course of an experiment.
|
Peak normalization
To allow multiple experiment analyses, it is important to first normalize the data to minimize experimental variation. Normalization (scaling the intensities of spectra) improves ability to compare samples by reducing the variability of intensity between spectra.
|
Significance test
Several statistical criteria can be used either to identify data elements that make significant contributions to the protein profile of a sample or to distinguish a group of samples from others. Procedures employed by the BBC for biomarker discovery efforts can be summarized according to the following criteria for biomarker discovery.
Qualitative analysis for significant peaks that are present in one group but not in the other. Qualitative difference indicates the situation in which a peak is present in only a few samples, e.g. less than 50% of the samples, in one group but is present in most samples in the other group. The two groups are compared to each other using a table whose columns and rows correspond to groups versus presence/absence, respectively. A chi-square test provides an adequate test statistic about whether the presence of this peak is significantly different between the two groups.
Statistical tests for quantitative different peaks between two groups. Some peaks are present in both sample groups but their intensity difference between the two groups can be assessed by statistical significant tests. The quantitative difference indicates the situation in which a peak is present in most (or all) samples, but has different intensities between the groups. The standard two-sample t-test or the Wilcoxon-Mann-Whitney rank test will be used to evaluate the difference between groups.
Rank p-values of the t-statistics for controlling the false discovery rate (FDR). The false discovery rate approach can be used for the multiple testing (testing tens of thousands of peptide peaks simultaneously). The peptide peaks are ranked in terms of their p-values from the statistical tests. A cut-off is calculated giving a false discovery rate (e.g. 5%). All peptides with p-value less than the cut-off are selected as differentially expressed peptides between the two groups’ metastasis and non-metastasis, and therefore will be listed as potential biomarkers.
|
Pattern recognition
Most bioinformatics data mining systems fall into one of the two types of categories: supervised and unsupervised systems. Supervised systems require knowledge or data in which the outcome or classification is known ahead of time, so that the system can be trained to recognize and distinguish outcomes. Unsupervised systems cluster or group records without previous knowledge of outcome or classification.
The most frequently used and straightforward approach of unsupervised method is principal component analysis (PCA). PCA was developed for the analysis of datasets with high dimensionality. The main function of PCA is to reduce the dimensions of multivariate, multi-channeled data to a few manageable dimensions, a new set of uncorrelated variables called principle components (PCs). These PCs serve as an approximation to the original data and allow an analyst to overview the data in the reduced dimensions and study the different cases and variables for their contribution and relationship to overall variability of the data.
Some of the supervised learning systems include discriminant function analysis, partial least square, artificial neural networks, and nearest-neighbor. The most popular supervised learning system is support vector machine (SVM).
View Diagram
|
Molecular networks
Molecular correlation is complementary to abundance level information, which provides a powerful approach to define relationships of molecules in a biologic sample. It not only reveals important relationships among the various components, but also provides information about the biochemical processes underlying the disease or drug response. As a simple example, two molecules will have a positive correlation if the concentration of both molecules increases in the same sample. Alternatively, two molecules will have a negative correlation if the concentration of one molecule increases while the other decreases in the same sample. A common evaluation approach is to estimate molecular correlations by calculating the Pearson’s correlation coefficient.
Thousands of molecules could be measured in a single differential proteomics experiment. Bioinformatics tools play critical rules in extracting scientific information from the experimental data for the protein behavior. Interactive visualization of differential data is one of the major components in ‘omics data analysis. SysNet is a software package for ‘omics expression data analysis that combines interactive visualization and data mining. SysNet is able to integrate molecular expression data obtained from different ‘omics experiments, interactively analyze intermolecular correlations using different statistical models, and perform interactive analysis of time lapse data to assess molecular evolution.
View Diagram
|
|
|