18 Statistical Methods
Statistical methods are a far larger subject than can be covered in this handbook. Our team uses a wide variety of methods that differ by project and change via advances in the field. Nonetheless, here are some methods that we favor when they are appropriate.
18.1 Multivariate Regression
- We like Generalized Linear Models. Many of us approach these from a Bayesian perspective, and build and fit these models using Stan. A great introduction to this type of modeling is found in Statistical Rethinking, by Richard McElreath which we have a couple of copies of. Richard also has accompanying lectures for this book online as YouTube videos.
- For nonlinear models, we like Generalized Additive Models, for which we use the mgcv R package. Noam has a bunch of materials on this, including slides and exercises from a workshop he co-teaches, and the seminal text is Generalized Additive Models in R, by Simon Wood. (Example - Host-Pathogen Phylogeny Project: Paper, GitHub)
- When a multivariate analysis is primarily about prediction, and less about variable inference, machine-learning methods such as boosted regression trees are useful. We make use of these through the dismo or xgboost packages. (Example - Hotspots 2: Paper, GitHub)
18.2 Networks
- We like networks for both descriptive visualizations and quantitative analyses. R is full of packages to achieve both of these goals, check out ggnetwork and bipartite.
- We assess individual components within networks looking for high-impact individuals due to their high degree, centrality, or ability to bridge different communities. For a tutorial on basic metrics check out Randi Griffin’s tutorial on primate social networks which utilizes the igraph package.
- Often, our networks are bipartite, meaning the networks shows interactions between two type of nodes. For us, these are usually viruses and hosts. You can look at example networks in two PREDICT publications: Figure 4 in Anthony et al. 2017 and Figure 3 in Willoughby et al. 2017.
- To assess overall network structure and identify communities, we measure the modularity. The modularity of a network model is a type of network partitioning into subgroups or modules. This is an alternative to clustering alogrithms. For bipartite networks, we like the Barber’s modularity (Q), calculated through the lpbrim package. Alternatives modularity algorithms include NetCarto which uses a simulated annealing (SA) algorithm or using a map equation (ME) algorithm.
18.3 Bioinformatics and Sequence Analysis
- Bioconductor is a comprehensive ecosystem of R packages focused on sequence analyses. Typing your general topic of interest into their searchable software table should at least provide an introduction to relevant software.
- In general, best practices in bioinformatics are changing rapidly, so it is difficult to recommend particular procedures. However, the journal Nature Protocols covers cutting-edge methods, ranging from the lab to the laptop, in great depth. Many articles include step-by-step instructions to complete a given analysis. For a scattershot introduction to high-quality bioinformatics software and relevant applications, it might be worth checking out the Bedford, Pachter, and Patro lab websites. Note that many contemporary bioinformatics tools are accessed through the command line rather than R.
- For a general overview of RNA sequencing analysis (other bioinformatics pipelines have strong similarities), the Simple Fool’s Guide from the Palumbi lab is a great learning resource.