PeerJ

SCOPUS (2013-2023)SCIE-ISI

  2167-8359

  2167-8359

  Mỹ

Cơ quản chủ quản:  PeerJ Inc.

Lĩnh vực:
Biochemistry, Genetics and Molecular Biology (miscellaneous)Agricultural and Biological Sciences (miscellaneous)Neuroscience (miscellaneous)Medicine (miscellaneous)

Các bài báo tiêu biểu

VSEARCH: a versatile open source tool for metagenomics
Tập 4 - Trang e2584
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, Frédéric Mahé
Background

VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use.

Methods

When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads.

Results

VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based orde novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available athttps://github.com/torognes/vsearchunder either the BSD 2-clause license or the GNU General Public License version 3.0.

Discussion

VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

scikit-image: image processing in Python
Tập 2 - Trang e453
Stéfan van der Walt, Johannes L. Schönberger, Juan Nunez-Iglesias, François Boulogne, Joshua Warner, Neil Yager, Emmanuelle Gouillart, Tony Yu
MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities
Tập 3 - Trang e1165
Dongwan Kang, Jeff Froula, Rob Egan, Zhong Wang
A brief introduction to mixed effects modelling and multi-model inference in ecology
Tập 6 - Trang e4794
Xavier A. Harrison, Lynda Donaldson, Maria Correa-Cano, Julian Evans, David N. Fisher, Cecily Goodwin, Beth Robinson, David J. Hodgson, Richard Inger

The use of linear mixed effects models (LMMs) is increasingly common in the analysis of biological data. Whilst LMMs offer a flexible approach to modelling a broad range of data types, ecological data are often complex and require complex model structures, and the fitting and interpretation of such models is not always straightforward. The ability to achieve robust biological inference requires that practitioners know how and when to apply these tools. Here, we provide a general overview of current methods for the application of LMMs to biological data, and highlight the typical pitfalls that can be encountered in the statistical modelling process. We tackle several issues regarding methods of model selection, with particular reference to the use of information theory and multi-model inference in ecology. We offer practical solutions and direct the reader to key references that provide further technical detail for those seeking a deeper understanding. This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.

Swarm: robust and fast clustering method for amplicon-based studies
Tập 2 - Trang e593
Frédéric Mahé, Torbjørn Rognes, Christopher Quince, Colomban de Vargas, Micah Dunthorn
SDMtoolbox 2.0: the next generation Python-based GIS toolkit for landscape genetic, biogeographic and species distribution model analyses
Tập 5 - Trang e4095
Jason L. Brown, Joseph Bennett, Connor M. French

SDMtoolbox 2.0 is a software package for spatial studies of ecology, evolution, and genetics. The release of SDMtoolbox 2.0 allows researchers to use the most current ArcGIS software and MaxEnt software, and reduces the amount of time that would be spent developing common solutions. The central aim of this software is to automate complicated and repetitive spatial analyses in an intuitive graphical user interface. One core tenant facilitates careful parameterization of species distribution models (SDMs) to maximize each model’s discriminatory ability and minimize overfitting. This includes carefully processing of occurrence data, environmental data, and model parameterization. This program directly interfaces with MaxEnt, one of the most powerful and widely used species distribution modeling software programs, although SDMtoolbox 2.0 is not limited to species distribution modeling or restricted to modeling in MaxEnt. Many of the SDM pre- and post-processing tools have ‘universal’ analogs for use with any modeling software. The current version contains a total of 79 scripts that harness the power of ArcGIS for macroecology, landscape genetics, and evolutionary studies. For example, these tools allow for biodiversity quantification (such as species richness or corrected weighted endemism), generation of least-cost paths and corridors among shared haplotypes, assessment of the significance of spatial randomizations, and enforcement of dispersal limitations of SDMs projected into future climates—to only name a few functions contained in SDMtoolbox 2.0. Lastly, dozens of generalized tools exists for batch processing and conversion of GIS data types or formats, which are broadly useful to any ArcMap user.

MaxEnt’s parameter configuration and small samples: are we paying attention to recommendations? A systematic review
Tập 5 - Trang e3093
Narkis S. Morales, Ignacio C. Fernández, Victoria Baca-González

Environmental niche modeling (ENM) is commonly used to develop probabilistic maps of species distribution. Among available ENM techniques, MaxEnt has become one of the most popular tools for modeling species distribution, with hundreds of peer-reviewed articles published each year. MaxEnt’s popularity is mainly due to the use of a graphical interface and automatic parameter configuration capabilities. However, recent studies have shown that using the default automatic configuration may not be always appropriate because it can produce non-optimal models; particularly when dealing with a small number of species presence points. Thus, the recommendation is to evaluate the best potential combination of parameters (feature classes and regularization multiplier) to select the most appropriate model. In this work we reviewed 244 articles published between 2013 and 2015 to assess whether researchers are following recommendations to avoid using the default parameter configuration when dealing with small sample sizes, or if they are using MaxEnt as a “black box tool.” Our results show that in only 16% of analyzed articles authors evaluated best feature classes, in 6.9% evaluated best regularization multipliers, and in a meager 3.7% evaluated simultaneously both parameters before producing the definitive distribution model. We analyzed 20 articles to quantify the potential differences in resulting outputs when using software default parameters instead of the alternative best model. Results from our analysis reveal important differences between the use of default parameters and the best model approach, especially in the total area identified as suitable for the assessed species and the specific areas that are identified as suitable by both modelling approaches. These results are worrying, because publications are potentially reporting over-complex or over-simplistic models that can undermine the applicability of their results. Of particular importance are studies used to inform policy making. Therefore, researchers, practitioners, reviewers and editors need to be very judicious when dealing with MaxEnt, particularly when the modelling process is based on small sample sizes.

Rhea: a transparent and modular R pipeline for microbial profiling based on 16S rRNA gene amplicons
Tập 5 - Trang e2836
Ilias Lagkouvardos, Sandra E. Fischer, Neeraj Kumar, Thomas Clavel

The importance of 16S rRNA gene amplicon profiles for understanding the influence of microbes in a variety of environments coupled with the steep reduction in sequencing costs led to a surge of microbial sequencing projects. The expanding crowd of scientists and clinicians wanting to make use of sequencing datasets can choose among a range of multipurpose software platforms, the use of which can be intimidating for non-expert users. Among available pipeline options for high-throughput 16S rRNA gene analysis, the R programming language and software environment for statistical computing stands out for its power and increased flexibility, and the possibility to adhere to most recent best practices and to adjust to individual project needs. Here we present the Rhea pipeline, a set of R scripts that encode a series of well-documented choices for the downstream analysis of Operational Taxonomic Units (OTUs) tables, including normalization steps,alpha- andbeta-diversity analysis, taxonomic composition, statistical comparisons, and calculation of correlations. Rhea is primarily a straightforward starting point for beginners, but can also be a framework for advanced users who can modify and expand the tool. As the community standards evolve, Rhea will adapt to always represent the current state-of-the-art in microbial profiles analysis in the clear and comprehensive way allowed by the R language. Rhea scripts and documentation are freely available athttps://lagkouvardos.github.io/Rhea.

Leopard (Panthera pardus) status, distribution, and the research efforts across its range
Tập 4 - Trang e1974
Andrew P. Jacobson, Peter Gerngross, Joseph Lemeris, Rebecca F. Schoonover, Corey Anco, Christine Breitenmoser‐Würsten, Sarah M. Durant, Mohammad S. Farhadinia, Philipp Henschel, Jan F. Kamler, Alice Laguardia, Susana Rostro‐García, Andrew Stein, Luke Dollar

The leopard’s (Panthera pardus) broad geographic range, remarkable adaptability, and secretive nature have contributed to a misconception that this species might not be severely threatened across its range. We find that not only are several subspecies and regional populations critically endangered but also the overall range loss is greater than the average for terrestrial large carnivores. To assess the leopard’s status, we compile 6,000 records at 2,500 locations from over 1,300 sources on its historic (post 1750) and current distribution. We map the species across Africa and Asia, delineating areas where the species is confirmed present, is possibly present, is possibly extinct or is almost certainly extinct. The leopard now occupies 25–37% of its historic range, but this obscures important differences between subspecies. Of the nine recognized subspecies, three (P. p. pardus, fusca,andsaxicolor) account for 97% of the leopard’s extant range while another three (P. p. orientalis, nimr,andjaponensis) have each lost as much as 98% of their historic range. Isolation, small patch sizes, and few remaining patches further threaten the six subspecies that each have less than 100,000 km2of extant range. Approximately 17% of extant leopard range is protected, although some endangered subspecies have far less. We found that while leopard research was increasing, research effort was primarily on the subspecies with the most remaining range whereas subspecies that are most in need of urgent attention were neglected.

De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units
Tập 3 - Trang e1487
Sarah L. Westcott, Patrick D. Schloss

Background.16S rRNA gene sequences are routinely assigned to operational taxonomic units (OTUs) that are then used to analyze complex microbial communities. A number of methods have been employed to carry out the assignment of 16S rRNA gene sequences to OTUs leading to confusion over which method is optimal. A recent study suggested that a clustering method should be selected based on its ability to generate stable OTU assignments that do not change as additional sequences are added to the dataset. In contrast, we contend that the quality of the OTU assignments, the ability of the method to properly represent the distances between the sequences, is more important.

Methods.Our analysis implemented sixde novoclustering algorithms including the single linkage, complete linkage, average linkage, abundance-based greedy clustering, distance-based greedy clustering, and Swarm and the open and closed-reference methods. Using two previously published datasets we used the Matthew’s Correlation Coefficient (MCC) to assess the stability and quality of OTU assignments.

Results.The stability of OTU assignments did not reflect the quality of the assignments. Depending on the dataset being analyzed, the average linkage and the distance and abundance-based greedy clustering methods generated OTUs that were more likely to represent the actual distances between sequences than the open and closed-reference methods. We also demonstrated that for the greedy algorithms VSEARCH produced assignments that were comparable to those produced by USEARCH making VSEARCH a viable free and open source alternative to USEARCH. Further interrogation of the reference-based methods indicated that when USEARCH or VSEARCH were used to identify the closest reference, the OTU assignments were sensitive to the order of the reference sequences because the reference sequences can be identical over the region being considered. More troubling was the observation that while both USEARCH and VSEARCH have a high level of sensitivity to detect reference sequences, the specificity of those matches was poor relative to the true best match.

Discussion.Our analysis calls into question the quality and stability of OTU assignments generated by the open and closed-reference methods as implemented in current version of QIIME. This study demonstrates thatde novomethods are the optimal method of assigning sequences into OTUs and that the quality of these assignments needs to be assessed for multiple methods to identify the optimal clustering method for a particular dataset.