Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

Về bất đẳng thức tam giác của các khoảng cách dựa trên tương quan đối với các biểu thức gene

BMC Bioinformatics - Tập 24 - Trang 1-16 - 2023

Jiaxing Chen^1,2, Yen Kaow Ng¹, Lu Lin¹, Xianglilan Zhang³, Shuaicheng Li¹

¹Department of Computer Science, City University of Hong Kong, Hong Kong, China

²Department of Computer Science, Beijing Normal University - Hong Kong Baptist University United International College, Zhuhai, People’s Republic of China

³State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, People’s Republic of China

Tóm tắt

Các hàm khoảng cách là cơ sở để đánh giá sự khác biệt giữa các biểu thức gene. Một hàm như vậy sẽ cho ra giá trị thấp nếu các biểu thức này có mối tương quan mạnh mẽ - dù là âm hay dương - và ngược lại. Một hàm khoảng cách phổ biến là khoảng cách tương quan tuyệt đối, $$d_a=1-|\rho |$$, trong đó $$\rho$$ là một phép đo độ tương đồng, chẳng hạn như tương quan Pearson hoặc Spearman. Tuy nhiên, khoảng cách tương quan tuyệt đối không thỏa mãn bất đẳng thức tam giác, điều này sẽ đảm bảo hiệu suất tốt hơn trong phân đoạn vector, cho phép định vị dữ liệu nhanh chóng, cũng như tăng tốc quá trình phân cụm dữ liệu. Trong công trình này, chúng tôi đề xuất $$d_r=\sqrt{1-|\rho |}$$ như một sự thay thế. Chúng tôi chứng minh rằng $$d_r$$ thỏa mãn bất đẳng thức tam giác khi $$\rho$$ đại diện cho tương quan Pearson, tương quan Spearman, hoặc độ tương đồng Cosine. Chúng tôi cho thấy $$d_r$$ tốt hơn $$d_s=\sqrt{1-\rho ^2}$$, một biến thể khác của $$d_a$$ thỏa mãn bất đẳng thức tam giác, cả trong phân tích lý thuyết và thực nghiệm. Chúng tôi đã so sánh thực nghiệm $$d_r$$ với $$d_a$$ trong thí nghiệm phân cụm gene và phân cụm mẫu sử dụng dữ liệu sinh học thực tế. Hai khoảng cách này thực hiện tương tự nhau trong cả phân cụm gene và phân cụm mẫu thông qua phân cụm phân cấp và phân cụm PAM (phân vùng xung quanh medoids). Tuy nhiên, $$d_r$$ cho thấy đặc điểm phân cụm mạnh mẽ hơn. Theo thí nghiệm bootstrap, $$d_r$$ thường tạo ra phân vùng cặp mẫu mạnh mẽ hơn (P-value $$<0.05$$). Các thống kê về thời gian mà một lớp “tan rã” cũng ủng hộ lợi thế của $$d_r$$ về độ tin cậy. $$d_r$$, như một biến thể của khoảng cách tương quan tuyệt đối, thỏa mãn bất đẳng thức tam giác và có khả năng phân cụm mạnh mẽ hơn.

Từ khóa

Tài liệu tham khảo

Hardin J, Mitani A, Hicks L, VanKoten B. A robust measure of correlation between two genes on a microarray. BMC Bioinformatics. 2007;8(1):220. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci. 1998;95(25):14863–8. Ernst J, Nau GJ, Bar-Joseph Z. Clustering short time series gene expression data. Bioinformatics. 2005;21(suppl1):159–68. Deng Y, Jiang Y-H, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC Bioinformatics. 2012;13(1):113. Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics. 2006;7:7. Langfelder P, Horvath S. Wgcna: an r package for weighted correlation network analysis. BMC Bioinformatics. 2008;9(1):559. Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics. 2003;19(4):459–66. ttnphns (https://stats.stackexchange.com/users/3277/ttnphns): Is triangle inequality fulfilled for these correlation-based distances? Cross Validated. URL:https://stats.stackexchange.com/q/135231 (version: 2017-04-13). https://stats.stackexchange.com/q/135231 Baraty S, Simovici DA, Zara C. The impact of triangular inequality violations on medoid-based clustering. In: international symposium on methodologies for intelligent systems, pp. 2011:280–289 . Springer McCune B, Grace JB, Urban DL. Analysis of ecological communities, vol. 28. Gleneden Beach, Oregon: MjM software design; 2002. Pan J-S, McInnes FR, Jack MA. Fast clustering algorithms for vector quantization. Pattern Recogn. 1996;29(3):511–8. Moore AW. The anchors hierarchy: Using the triangle inequality to survive high dimensional data. In: proceedings of the sixteenth conference on uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 2000;397–405. Elkan C. Using the triangle inequality to accelerate k-means. In: proceedings of the 20th international conference on machine learning (ICML-03), 2003;147–153 Kryszkiewicz M, Lasek P. Ti-dbscan: Clustering with dbscan by means of the triangle inequality. In: international conference on rough sets and current trends in computing. Springer 2010;60–69 Prasad TV, Babu RP, Ahson SI. Gedas-gene expression data analysis suite. Bioinformation. 2006;1(3):83. Van Dongen S, Enright AJ. Metric distances derived from cosine similarity and pearson and spearman correlations. arXiv preprint arXiv:1208.3145 (2012) Priness I, Maimon O, Ben-Gal I. Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinformatics. 2007;8(1):1–12. Zapala MA, Schork NJ. Multivariate regression analysis of distance matrices for testing associations between gene expression patterns and related variables. Proc Natl Acad Sci. 2006;103(51):19430–5. Jaskowiak PA, Campello RJ, Costa IG. On the selection of appropriate distances for gene expression data clustering. BMC Bioinformatics. 2014;15(2):1–7. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. USA: John Wiley & Sons; 2009. Santos JM, Embrechts M. On the use of the adjusted rand index as a metric for evaluating supervised classification. In: international conference on artificial neural networks, Springer 2009;175–184 Hennig C, et al. Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J Multivar Anal. 2008;99(6):1154–76. Jaskowiak PA, Campello RJ, Costa Filho IG. Proximity measures for clustering gene expression microarray data: a validation methodology and a comparative analysis. IEEE/ACM Trans Comput Biol Bioinform (TCBB). 2013;10(4):845–57. de Souto MC, Costa IG, de Araujo DS, Ludermir TB, Schliep A. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics. 2008;9(1):497. Lake BB, Ai R, Kaeser GE, Salathia NS, Yung YC, Liu R, Wildberg A, Gao D, Fung H-L, Chen S, et al. Neuronal subtypes and diversity revealed by single-nucleus rna sequencing of the human brain. Science. 2016;352(6293):1586–90. Macosko EZ, Basu A, Satija R, Nemesh J, Shekhar K, Goldman M, Tirosh I, Bialas AR, Kamitaki N, Martersteck EM, et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015;161(5):1202–14. Wilk AJ, Rustagi A, Zhao NQ, Roque J, Martínez-Colón GJ, McKechnie JL, Ivison GT, Ranganath T, Vergara R, Hollis T, et al. A single-cell atlas of the peripheral immune response in patients with severe Covid-19. Nat Med. 2020;26(7):1070–6. Chua RL, Lukassen S, Trump S, Hennig BP, Wendisch D, Pott F, Debnath O, Thürmann L, Kurth F, Völker MT, et al. Covid-19 severity correlates with airway epithelium-immune cell interactions identified by single-cell analysis. Nat Biotechnol. 2020;38(8):970–9. Aizarani N, Saviano A, Mailly L, Durand S, Herman JS, Pessaux P, Baumert TF, Grün D, et al. A human liver cell atlas reveals heterogeneity and epithelial progenitors. Nature. 2019;572(7768):199–204. Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell rna-sequencing data reveals hidden subpopulations of cells. Nat Biotechnol. 2015;33(2):155–60. Loo L, Simon JM, Xing L, McCoy ES, Niehaus JK, Guo J, Anton E, Zylka MJ. Single-cell transcriptomic analysis of mouse neocortical development. Nat Commun. 2019;10(1):1–11. Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun Stat theory Methods. 1974;3(1):1–27. Huber W, Carey VJ, Gentleman R, Anders S, Carlson M, Carvalho BS, Bravo HC, Davis S, Gatto L, Girke T, et al. Orchestrating high-throughput genomic analysis with bioconductor. Nat Methods. 2015;12(2):115. Carlson M, Falcon S, Pages H, Li N. org. hs. eg. db: Genome wide annotation for human. R package version 3.3;2013 Falcon S, Gentleman R. Using gostats to test gene lists for go term association. Bioinformatics. 2006;23(2):257–8. Lapointe J, Li C, Higgins JP, Van De Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, et al. Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci. 2004;101(3):811–6. Li WV, Li JJ. An accurate and robust imputation method scimpute for single-cell rna-seq data. Nat Commun. 2018;9(1):1–9.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]