Statistical Methods for the Modelling of Label-Free Shotgun Proteomic Data in Cell Line Biomarker Discovery

Gregori Font, Josep

Please use this identifier to cite or link to this item: https://dipositint.ub.edu/dspace/handle/2445/57364

Title:	Statistical Methods for the Modelling of Label-Free Shotgun Proteomic Data in Cell Line Biomarker Discovery
Author:	Gregori Font, Josep
Director/Tutor:	Sànchez, Àlex (Sànchez Pla) Villanueva i Cardús, Josep Ocaña i Rebull, Jordi
Keywords:	Proteòmica Marcadors bioquímics Mètodes estadístics Proteomics Biochemical markers Statistical methods
Issue Date:	11-Jul-2014
Publisher:	Universitat de Barcelona
Abstract:	[cat] En la tesi s'ha desenvolupat, dissenyat i implementat una solució per l'anàlisi de dades de proteòmica comparativa en descobriment de biomarcadors. Específicament la solució s'ha optimitzat per l'anàlisi de secretomes de línies cel•lulars tumorals per LC-MS/MS sense marcatge, i quantificant pel nombre d'espectres de pèptids assignats a cada proteïna. Durant el desenvolupament de la metodologia s'ha demostrat la incidència i rellevància dels efectes batch en l'anàlisi comparatiu de pèptits sense marcar per LC-MS/MS. Així com les característiques que identifiquen un potencial biomarcador com a reproductible. Els models s'han desenvolupat amb l'ajut de dades empíriques obtingudes de mostres amb mescles controlades de proteïnes, i de simulacions. La solució informàtica que implementa el model desenvolupat consta de dos paquets R/Bioconductor, amb les respectives interfícies gràfiques que faciliten el seu ús a no experts. El primer paquet, msmsEDA, consta de funcions útils en l'anàlisi exploratòria de dades, i permet avaluar la qualitat del conjunt de dades d'un experiment de LC-MS/MS basat en comptatge d'espectres, així com explorar l'eventual presència de valors extrems, factors de confusió, o d'efectes batch. El segon paquet, msmsTests, encapsula funcions per la inferència en el descobriment de biomarcadors. El model emprat és un GLM que contempla la inclusió de factors per blocs per la correcció d'efectes batch, i incorpora una normalització generalitzada per offsets que permet la comparació de secretoma al nivell d'una cel•lula. Les distribucions implementades són la de Poisson i la binomial negativa, així com l'extensió de la quasiversemblança. En conjut el model desenvolupat i la implementació informàtica que se'n ha fet permet: • Avaluar la qualitat d'un conjunt de dades de LC-MS/MS. • Identificar valors extrems. • Identificar la presència de factors de confusió o d'efectes batch. • El descobriment de biomarcadors emprant la distribució que millor s'ajusti a les dades. • Assegurar un bon nivell de reproductibilitat mercès a un filtre post-test. Els paquets i llur documentació es troben lliurement disponibles a bioconductor.org, i les interfícies gràfiques a github.com. [eng]In this work it has been developed and implemented a data analysis pipeline for the discovery of biomarkers by high throughput shotgun proteomics. Specifically the solution has been optimized for the analysis of secretomes of tumor cell lines by label-free LC-MS/MS, with proteins quantified by peptide spectral counts. Along the development it has been shown the incidence and relevance of batch effects in the comparative analysis of label-free proteomics by LC-MS/MS. Also the features providing reproducibility to potential biomarkers have been identified. The model has been developed on empirical data obtained from a series of spiked experiments, and with the help of simulations to evaluate its performance. The pipeline comprises an exploratory data analysis (EDA) R/Bioconductor package, msmsEDA, based on multidimensional analysis tools and a R/Bioconductor inference package, msmsTests, based on generalized linear models (GLM) with Poisson or negative binomial distributions, or the quasi-likelihood GLM extension. Two graphical interfaces have also been produced to ease the use of the provided solution in a MS lab by non experts, and are freely available at GitHub. The designed model is devised to discover differentially expressed proteins in tumor cell line secretomes, using the cell as the unit of interest. The model allows blocking factors as a mean for batch effects correction. The normalization to cell units is embedded in the model through the use of offsets, and no previous data treatment is required. The two packages developed, msmsEDA and msmsTests, allow for: • Dataset quality assessment. • The identification of outliers • The identification of confounding factors or batch effects. • The discovery of potential biomarkers by using the distribution best fitting the available data. • The improvement of reproducibility by a post test filter based of effect size and signal levels. Different papers have been published in peer-reviewed proteomics journals develo-ping each data treatment step, and demonstrating its use and value in biological experiments carried out in the Tumor Biomarker lab at VHIO.
URI:	https://hdl.handle.net/2445/57364
Appears in Collections:	Tesis Doctorals - Departament - Estadística

Files in This Item:

File	Description	Size	Format
JGF_THESIS.pdf		5.01 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License