Machine learning-based tools to model and to remove the off-target effect

Pattern Analysis and Applications - Tập 20 - Trang 87-100 - 2015
Riwal Lefort1, Ludovico Fusco2, Olivier Pertz2, François Fleuret1,3
1Idiap Research Institute, Centre du Parc, Martigny, Switzerland
2Department of Biomedicine, University of Basel, Basel, Switzerland
3Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland

Tóm tắt

A RNA interference, also called a gene knockdown, is a biological technique which consists of inhibiting a targeted gene in a cell. By doing so, one can identify statistical dependencies between a gene and a cell phenotype. However, during such a gene inhibition process, additional genes may also be modified. This is called the “off-target effect”. The consequence is that there are some additional phenotype perturbations which are “off-target”. In this paper, we study new machine learning tools that both model the cell phenotypes and remove the “off-target effect”. We propose two new automatic methods to remove the “off-target” components from a data sample. The first method is based on vector quantization (VQ). The second method we propose relies on a classification forest. Both methods rely on analyzing the homogeneity of several repetitions of a gene knockdown. The baseline we consider is a Gaussian mixture model whose parameters are learned under constraints with a standard Expectation–Maximization algorithm. We evaluate these methods on a real data set, a semi-synthetic data set, and a synthetic toy data set. The real data set and the semi-synthetic data set are composed of cell growth dynamic quantities measured in time laps movies. The main result is that we obtain the best recognition performance with the probabilistic version of the VQ-based method.

Tài liệu tham khảo

Arthur D, Vassilvitskii S (2007) k-means\(++\): the advantages of careful seeding. In: Proceedings of the ACM-SIAM symposium on discrete algorithms, p 1027–1035 Bakal C (2007) Quantitative morphological signatures define local signaling networks regulating cell morphology. Science 316:1753–1756 Bishop CM, Ulusoy I (2005) Generative versus discriminative methods for object recognition. Conf Comput Vis Pattern Recogn 2:258–265 Breiman L (2001) Random forest. Mach Learn 45:5–32 Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth & Brooks/Cole Advanced Books & Software, Monterey Collinet C et al (2010) Systems survey of endocytosis by multiparametric image analysis. Nature 464:243–249 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B Methodol 39(1):1–38 Echeverri CJ et al (2006) Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Methods 3(10):777–779 Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139 Hartigan JA (1975) Clustering algorithms. Wiley, New York Held M et al (2010) CellCognition: time-resolved phenotype annotation in high-throughput live cell imaging. Nat Methods 7:747–754 Jackson AL, Linsley PS (2010) Recognizing and avoiding siRNA off-target effects for target identification and therapeutic application. Nat Rev Drug Discov 9:57–67 Kullback S (1987) Letter to the editor: the Kullback–Leibler distance. Am Stat 41(4):340–341 Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: International conference on computer vision and pattern recognition Lefort R, Fablet R, Boucher J-M (2010) Weakly supervised classification of objects in images using soft random forests. In: European conference on computer vision Lefort R, Fablet R, Boucher JM (2011) Object recognition using proportion-based prior information: application to fisheries acoustics. Pattern Recogn Lett 32(2):153–158 Lefort R, Fleuret F (2013) treeKL: A distance between high dimension empirical distributions. Pattern Recogn Lett 34(2):140–145 Lowe D (1999) Object recognition with informative features and linear classification. In: International conference on computer vision and pattern recognition Lughofer E (2008) Extensions of vector quantization for incremental clustering. Pattern Recogn 41(3):995–1011 Lughofer E (2013) eVQ-AM: an extended dynamic version of evolving vector quantization. In: IEEE conference on evolving and adaptive intelligent systems, p 40–47 McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York Mahalanobis PC (1936) On the generalised distance in statistics. Proc Natl Inst Sci India 2(1):49–55 Moosman F, Nowak E, Jurie F (2008) Randomized clustering forests for image classification. IEEE Trans Pattern Anal Mach Intell 30(9):1632–1646 Neumann B et al (2010) Phenotypic profiling of the human genome by time-lapse microscopy reveals cell division genes. Nature 464:721–72 Orvedahl A et al (2011) Image-based genome-wide siRNA screen identifies selective autophagy factors. Nature 480:113–117 Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076 Pertz O et al (2008) Spatial mapping of the neurite and soma proteomes reveals a functional Cdc42/Rac regulatory network. Natl Acad Sci USA 105:1931–1936 Salma J et al (2012) Computational analysis and predictive modeling of small molecule modulators of microRNA. J Cheminform 4(1):16. doi:10.1186/1758-2946-4-16 Schölkopf B, Smola AJ (2002) Learning with kernels: support vector machines, regularization, optimization and beyond. MIT Press, Cambridge Yan J et al (2013) Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 154(4):801–813 Yin Z et al (2013) A screen for morphological complexity identifies regulators of switch-like transitions between discretecell shape. Nat Cell Biol 15(7):860–871 Yizong C (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 17(8):790–799