Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

Phân tích dữ liệu bền vững với Snakemake

F1000Research - Tập 10 - Trang 33

Felix Mölder^1,2, Kim Philipp Jablonski^3,4, Brice Letcher⁵, Michael B. Hall⁵, Christopher H. Tomkins-Tinch^6,7, Vanessa Sochat⁸, Jan Förster^1,9, Soohyun Lee¹⁰, Sven Twardziok¹¹, Alexander Kanitz^12,13, Andreas Wilm¹⁴, Manuel Holtgrewe^15,11, Sven Rahmann¹⁶, Sven Nahnsen¹⁷, Johannes Köster^1,18

¹Algorithms for Reproducible Bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany

²Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany

³Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland

⁴Swiss Institute of Bioinformatics (SIB), Basel, Switzerland

⁵EMBL-EBI, Hinxton, UK

⁶Broad Institute of MIT and Harvard, Cambridge, USA

⁷Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA

⁸Stanford University Research Computing Center, Stanford University, Stanford, USA

⁹German Cancer Consortium (DKTK), Partner Site Essen, and German Cancer Research Center (DKFZ), Heidelberg, Germany

¹⁰Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA

¹¹Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany

¹²Biozentrum, University of Basel, Basel, Switzerland

¹³SIB Swiss Institute of Bioinformatics / ELIXIR Switzerland, Lausanne, Switzerland

¹⁴Microsoft Singapore, Singapore, Singapore

¹⁵CUBI (Core Unit Bioinformatics), Berlin Institute of Health, Berlin, Germany

¹⁶Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany

¹⁷Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany

¹⁸Medical Oncology, Harvard Medical School, Harvard University, Boston, USA

Tóm tắt

Phân tích dữ liệu thường bao gồm nhiều bước không đồng nhất, từ việc áp dụng các công cụ dòng lệnh khác nhau đến việc sử dụng các ngôn ngữ kịch bản như R hoặc Python để tạo ra các biểu đồ và bảng. Điều này được công nhận rộng rãi rằng phân tích dữ liệu lý tưởng nên được thực hiện theo cách có thể tái lập. Tính tái lập cho phép xác thực kỹ thuật và tái tạo kết quả trên dữ liệu gốc hoặc thậm chí trên dữ liệu mới. Tuy nhiên, chỉ tính tái lập là không đủ để cung cấp một phân tích có ảnh hưởng lâu dài (tức là bền vững) cho lĩnh vực, hoặc thậm chí chỉ cho một nhóm nghiên cứu. Chúng tôi cho rằng việc đảm bảo khả năng thích ứng và tính minh bạch cũng quan trọng không kém. Khả năng thích ứng mô tả khả năng điều chỉnh phân tích để trả lời các câu hỏi nghiên cứu mở rộng hoặc hơi khác biệt. Tính minh bạch mô tả khả năng hiểu phân tích để đánh giá xem nó không chỉ hợp lệ về mặt kỹ thuật, mà còn hợp lệ về phương pháp học.Tại đây, chúng tôi phân tích các thuộc tính cần thiết cho một phân tích dữ liệu trở nên có thể tái lập, thích ứng và minh bạch. Chúng tôi cho thấy cách hệ thống quản lý quy trình làm việc phổ biến Snakemake có thể được sử dụng để đảm bảo điều này, và cách nó cho phép một biểu diễn thống nhất, kết hợp và thuận tiện cho tất cả các bước liên quan trong phân tích dữ liệu, từ việc xử lý dữ liệu thô, đến kiểm soát chất lượng và khám phá và vẽ biểu đồ các kết quả cuối cùng một cách chi tiết, tương tác.

Từ khóa

Tài liệu tham khảo

M Baker, 2016, 1,500 scientists lift the lid on reproducibility., Nature., 533, 452-4, 10.1038/533452a

J Mesirov, 2010, Computer science. Accessible reproducible research., Science., 327, 415-6, 10.1126/science.1179653

M Munafò, 2017, A manifesto for reproducible science., Nat Hum Behav., 1, 0021, 10.1038/s41562-016-0021

E Afgan, 2018, The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update., Nucleic Acids Res., 46, W537-W544, 10.1093/nar/gky379

M Berthold, 2007, KNIME: The Konstanz Information Miner.

M Kluge, 2020, Watchdog 2.0: New developments for reusability, reproducibility, and workflow execution., GigaScience., 9, giaa068, 10.1093/gigascience/giaa068

A Cervera, 2019, Anduril 2: upgraded large–scale data integration framework., Bioinformatics., 35, 3815-3817, 10.1093/bioinformatics/btz133

M Salim, 2018, Balsam: Automated Scheduling and Execution of Dynamic, Data-Intensive HPC Workﬂows., In: Proceedings of the 8th Workshop on Python for High-Performance and Scientiﬁc Computing. ACM Press.

V Cima, 2018, HyperLoom: A Platform for Defining and Executing Scientific Pipelines in Distributed Environments., ACM., 1-6, 10.1145/3183767.3183768

L Coelho, 2017, Jug: Software for Parallel Reproducible Computation in Python., J Open Res Softw., 5, 30, 10.5334/jors.161

M Tanaka, 2010, Pwrake: a parallel and distributed flexible workflow management tool for wide-area data intensive computing., Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing -HPDC 2010., 356-359, 10.1145/1851476.1851529

L Goodstadt, 2010, Ruffus: a lightweight Python library for computational pipelines., Bioinformatics., 26, 2778-9, 10.1093/bioinformatics/btq524

S Lampa, 2019, SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines., Gigascience., 8, 10.1093/gigascience/giz044

Y Hold-Geoffroy, 2014, Once you SCOOP, no need to fork, Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment., 1-8, 10.1145/2616498.2616565

F Lordan, 2013, ServiceSs: An Interoperable Programming Framework for the Cloud., J Grid Comput., 12, 67-91, 10.1007/s10723-013-9272-5

S Pal, 2020, Bioinformatics pipeline using JUDI: Just Do It!, Bioinformatics., 36, 2572-2574, 10.1093/bioinformatics/btz956

P Di Tommaso, 2017, Nextflow enables reproducible computational workflows., Nat Biotechnol., 35, 316-319, 10.1038/nbt.3820

J Köster, 2012, Snakemake–a scalable bioinformatics workflow engine., Bioinformatics., 28, 2520, 10.1093/bioinformatics/bts480

L Yao, 2017, BioQueue: a novel pipeline framework to accelerate bioinformatics analysis., Bioinformatics., 33, 3286-3288, 10.1093/bioinformatics/btx403

S Sadedin, 2012, Bpipe: a tool for running and managing bioinformatics pipelines., Bioinformatics., 28, 1525-6, 10.1093/bioinformatics/bts167

P Ewels, 2016, Cluster Flow: A user-friendly bioinformatics workflow tool [version 1; peer review: 3 approved]., F1000Res., 5, 2824, 10.12688/f1000research.10335.1

H Oliver, 2018, Cylc: A Workﬂow Engine for Cycling Systems., J Open Source Softw., 3, 737, 10.21105/joss.00737

P Cingolani, 2015, BigDataScript: a scripting language for data pipelines., Bioinformatics., 31, 10-16, 10.1093/bioinformatics/btu595

I Jimenez, 2017, The Popper Convention: Making Reproducible Systems Evaluation Practical, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)., 10.1109/IPDPSW.2017.157

C Evans, 2009, YAML Ain’t Markup Language YAML Version 1.2.

P Amstutz, 2016, Common Workflow Language, v1.0, 10.6084/m9.figshare.3115156.v2

K Voss, 2017, Full-stack genomics pipelining with GATK4 +WDL +Cromwell., F1000Res., 6, 10.7490/f1000research.1114634.1

J Vivian, 2017, Toil enables reproducible open source, big biomedical data analyses., Nat Biotechnol., 35, 314-316, 10.1038/nbt.3772

S Lee, 2019, Tibanna: software for scalable execution of portable pipelines on the cloud., Bioinformatics., 35, 4424-4426, 10.1093/bioinformatics/btz379

G Kurtzer, 2017, Singularity: Scientific containers for mobility of compute., PLoS One., 12, e0177459, 10.1371/journal.pone.0177459

D Huizinga, 2007, Automated Defect Prevention: Best Practices in Software Management, 10.1002/9780470165171

J Chall, 1995, Readability revisited: the new Dale-Chall readability formula.

L Sundkvist, 2017, Code Styling and its Effects on Code Readability and Interpretation

B Grüning, 2018, Practical Computational Reproducibility in the Life Sciences., Cell Syst., 6, 631-635, 10.1016/j.cels.2018.03.014

J Köster,, 2020, Data analysis for paper "Sustainable data analysis with Snakemake"., Zenodo.

H Handschuh, 2005, SHA Family (Secure Hash Algorithm)., Encyclopedia of Cryptography and Security. Springer US., 565-567, 10.1007/0-387-23483-7_388

A Narayanan, 2016, Bitcoin and Cryptocurrency Technologies: A Comprehensive Introduction.

W McKinney, 2010, Data Structures for Statistical Computing in Python., Proceedings of the 9th Python in Science Conference., 56-61, 10.25080/Majora-92bf1922-00a

2020, pandas-dev/pandas: Pandas, 10.5281/zenodo.3509134

B Grüning, 2018, Bioconda: sustainable and comprehensive software distribution for the life sciences., Nat Methods., 15, 475-476, 10.1038/s41592-018-0046-7

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA