repo: an R package for data-centered management of bioinformatic pipelines

BMC Bioinformatics - Tập 18 - Trang 1-9 - 2017

Francesco Napolitano¹

¹Telethon Institute of Genetics and Medicine (TIGEM), Pozzuoli, NA, Italy

Tóm tắt

Reproducibility in Data Analysis research has long been a significant concern, particularly in the areas of Bioinformatics and Computational Biology. Towards the aim of developing reproducible and reusable processes, Data Analysis management tools can help giving structure and coherence to complex data flows. Nonetheless, improved software quality comes at the cost of additional design and planning effort, which may become impractical in rapidly changing development environments. I propose that an adjustment of focus from processes to data in the management of Bioinformatic pipelines may help improving reproducibility with minimal impact on preexisting development practices. In this paper I introduce the repo R package for bioinformatic analysis management. The tool supports a data-centered philosophy that aims at improving analysis reproducibility and reusability with minimal design overhead. The core of repo lies in its support for easy data storage, retrieval, distribution and annotation. In repo the data analysis flow is derived a posteriori from dependency annotations. The repo package constitutes an unobtrusive data and flow management extension of the R statistical language. Its adoption, together with good development practices, can help improving data analysis management, sharing and reproducibility, especially in the fields of Bioinformatics and Computational Biology.

Tài liệu tham khảo

Ince DC, Hatton L, Graham-Cumming J. The case for open computer programs. Nature. 2012; 482(7386):485–8. doi:10.1038/nature10836. Peng RD. Reproducible research in computational science. Science. 2011; 334(6060):1226–7. doi:10.1126/science.1213847. http://science.sciencemag.org/content/334/6060/1226.full.pdf. Accessed 25 Jan 2017. Boulesteix AL. Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLoS Comput Biol. 2015; 11(4):1004191. Napolitano F, Mariani-Costantini R, Tagliaferri R. Bioinformatic pipelines in Python with Leaf. BMC Bioinforma. 2013; 14(1):201. doi:10.1186/1471-2105-14-201. Cited by 0000 Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, Mesirov JP. Genepattern 2.0. Nat Genet. 2006; 38(5):500–1. Leipzig J. A review of bioinformatic pipeline frameworks. Brief Bioinform. 2016. doi:10.1093/bib/bbw020, https://academic.oup.com/bib/articlelookup/doi/10.1093/bib/bbw020. Accessed 25 Jan 2017. Sadedin SP, Pope B, Oshlack A. Bpipe: a tool for running and managing bioinformatics pipelines. Bioinformatics. 2012; 28(11):1525–6. doi:10.1093/bioinformatics/bts167. Goecks J, Nekrutenko A, Taylor J. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 2010; 11(8):86. doi:10.1186/gb-2010-11-8-r86. Goodstadt L. Ruffus: a lightweight python library for computational pipelines. Bioinformatics. 2010; 26(21):2778–9. doi:10.1093/bioinformatics/btq524. Bruegge B, Dutoit AH. Object-Oriented Software Engineering: Using UML, Patterns and Java, Second Edition. Upper Saddle River: Prentice-Hall, Inc. Knuth DE. Literate programming. Comput J. 1984; 27(2):97–111. Leisch F. Sweave: Dynamic generation of statistical reports using literate data analysis. In: Compstat. Berlin: Springer-Verlag: 2002. p. 575–80. Liu Z, Pounds S. An r package that automatically collects and archives details for reproducible computing. BMC Bioinforma. 2014; 15(1):1–9. doi:10.1186/1471-2105-15-138. Napolitano F. repo: A Data-Centered Data Flow Manager. 2016. R package version 2.0.2. http://CRAN.R-project.org/package=repo. Accessed 25 Jan 2017. Napolitano F. repo: A Data-Centered Data Flow Manager. 2016. R package version 2.0.4.4. https://github.com/franapoli/repo. Accessed 25 Jan 2017. Wickham H. Advanced, R, 1st ed. Boca Raton: Chapman and Hall/CRC. Lichman M. UCI Machine Learning Repository. 2013. http://archive.ics.uci.edu/ml. Accessed 25 Jan 2017. Waltemath D, Wolkenhauer O. How modeling standards, software, and initiatives support reproducibility in systems biology and systems medicine. IEEE Trans Biomed Eng. 2016; 63(10):1999–2006. doi:10.1109/TBME.2016.2555481. González-Beltrán A, Li P, Zhao J, Avila-Garcia MS, Roos M, Thompson M, Horst Evd, Kaliyaperumal R, Luo R, Lee TL, Lam T-w, Edmunds SC, Sansone SA, Rocca-Serra P. From peer-reviewed to peer-reproduced in scholarly publishing: The complementary roles of data models and workflows in bioinformatics; 10(7):0127612. doi:10.1371/journal.pone.0127612. Accessed 05 Oct 2016 Chang W, et al. shiny: Web Application Framework for R. 2016. R package version 0.13.2. http://CRAN.R-project.org/package=shiny. Accessed 25 Jan 2017.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA