21 Sequences alignment

21.1 Heredity

Today, we are capable of reconstructing kinship relationships between organisms through bioinformatics analysis of biological sequences, particularly proteins, and DNA. This is possible thanks to the principle of heredity. The biological sequences are transmitted from one generation to the next through sexual and asexual reproduction. From one generation to the next, the sequences accumulate variation as a result of stochastic or selective processes. Despite this variation, sequences from two closely related organisms are expected to be more similar to each other than sequences from two distantly related organisms. Between two organisms with a distant relationship, more time has elapsed since the last shared ancestor, accumulating more variation. This is the principle on which the search for homology between sequences is based, two DNA or protein sequences are homologous if they share ancestry. This means that the similarity between these sequences is the product of sharing the same ancestor placed somewhere in the phylogenetic tree of life. The higher the sequence similarity, the more recent the common ancestor between the sequences is expected to be. However, the identification of a hit with a high similarity is not enough to determine the homology of a sequence; it is also necessary to model the evolutionary process through phylogenetic tools. This chapter will address the methodological principles of pairwise alignments, the alignment between two biological sequences. Multiple sequence alignment is the topic of the next chapter. Before introducing pairwise alignment, I would like to introduce some homology concepts that are essential to the application and interpretation of alignments and phylogenies.

21.2 Orthologs and paralogs

Homology relationships between sequences are defined by two important events in the evolution of species: duplication and speciation. Two sequences (genes or proteins) are orthologous if they are the product of speciation. Two sequences are paralogs if they are the products of duplication events. When two different organisms present sequences that are the product of duplication events prior to speciation, these sequences are out-paralogs. When duplication occurs after the speciation process, the sequences are called in-paralogs.

21.3 BLAST

21.3.1 Conda Installation

One of the easy ways to install bioinformatics software is with Conda. Download miniconda for ubuntu (If you have another OS choose the appropriate file)

https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

Set the correct authorizations

chmod +x Miniconda3-latest-Linux-x86_64.sh

Install

bash Miniconda3-latest-Linux-x86_64.sh

Export path

export PATH="$HOME/miniconda3/bin:$PATH"

Reinitialize the session.

21.4 Install BLAST+ with Conda

conda install -c bioconda blast=2.12.0

21.5 Install database reference sequences

The first step is to create the directories where the databases will be installed. For example, in your home directory:


mkdir NCBI

cd NCBI

Create two subdirectories, one for nucleic acid sequences and one for protein sequences.


mkdir nt

mkdir nr

21.5.1 There are different ways to install the pre-formatted BLAST databases:

21.5.1.1 The ‘wget’ command

Download NCBI nucleotide sequence database


cd nt

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.??.tar.gz

Uncompress the database files. You will find the uncompress.sh script in the scripts directory of this repository


./uncompress.sh

Download NCBI amino acid sequence database (In the nr directory)

wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.??.tar.gz
./uncompress.sh

21.5.1.2 The update_blastdb.pl script.

BLAST+ offers the update_blastdb.pl script that allows you to download the NCBI databases. This script uncompress the files for you.

Obtain information from the available databases to download


update_blastdb.pl --showall [*]

Download the nucleotide sequence database (NCBI_DB/nt directory)


update_blastdb.pl --decompress nt [*]

Download the amino acid database (NCBI_DB/nr directory)


update_blastdb.pl --decompress nr [*]

21.6 Pairwise sequence alignment

21.6.1 BLASTn

Here is an example of alignment with BLASTn. Remember that BLAST offers different softwares to perform alignments according to the type of sequence and the reference database (BLAST software). BLASTn involves nucleotide-to-nucleotide alignments, this means that both the query sequences and the reference database are nucleotide sequences.


blastn -db /home/alexarmero2022/shared/NCBI/nt/nt   -evalue 0.0000000001     -num_threads 10  -outfmt '6 std qcovs slen' -query  test.fasta  >    testout.csv

In the last command line there are several important points to keep in mind:

  1. Specify the full directory path to the NCBI database. In this example I’m using my path (/home/alexarmero2022/shared/NCBI/nt), replace it with the appropriate directory path on your system.

  2. The -db option expects the name of a database. It is for this reason that in addition to the path, it is also necessary to give the name of the database in the previous example nt.

  3. The -num_threads option refers to the number of cores or logic CPUs that the user wants to assign to the task. Specify it according to availability in your system.

  4. The results of an alignment can be reported in different types of formats (-outfmt). In the example, the tabular format (6) was used; this format is easy to manipulate. In this example the standard alignment information (std) was requested to be printed, but also some other information such as the alignment coverage in the query sequence (qcovs) and the length of the target sequence (slen). Tabular format options.

21.6.2 BLASTp

Here an example of protein-protein alignment with BLASTp. Notice that the path to the database changed; now it points to the nr directory, where the amino acid sequences were installed. In the same way the name of the database also changed (nr). The sequences in the test.fasta file must necessarily be amino acid sequences.


blastp -db /home/alexarmero2022/shared/NCBI/nr/nr   -evalue 0.0000000001     -num_threads 10  -outfmt '6 std qcovs slen' -query  test.fasta  >    testout.csv

21.6.3 Configuration files and environmental variables

BLAST+ allows you to configure the blast runs. Configuration can be done through a .ncbirc file. In the scripts directory of this directory, there is an example of this file. Once the variables of interest have been modified, place this file in your home directory. For example, you could indicate where your NCBI database is located:


; Start the section for BLAST configuration
[BLAST]
; Specifies the path where BLAST databases are installed
BLASTDB=/home/alexarmero2022/shared/NCBI/
    
    

In this last example we modify the variable BLASTDB and we assign the directory path to the NCBI databases. Note, that is the path we use in the blastn and blastp alignment examples. With this modification the blastn command line would subtly change:


blastn -db nt/nt   -evalue 0.0000000001     -num_threads 10  -outfmt '6 std qcovs slen' -query  test.fasta  >    testout.csv

You could also use environmental variables. For example from your command line, you could write:

BLASTDB=/home/alexarmero2022/shared/NCBI/
    

There are several other variables that can modify.

This last line is equivalent to using the .ncbirc file.

Note. The .ncbirc is a hidden file. To be able to observe it in your directory, execute the following line:

ls -ad .*
    

21.7 BLAST with high-throughput data

BLAST is perhaps the best software in the identification of homology relationships between biological sequences. The sensitivity of this tool has a cost both in the use of physical computing resources and in processing time. These limitations are particularly evident when are used NCBI databases. Several efforts have been made to use high performance computing (HPC) resources to efficiently run BLAST on large datasets. Here is one such tool: Crocroblast. This software is presented as an example, and a starting point for higher performance strategies. Crocroblast determines what is the optimal file size to perform the alignment according to the HPC physical capabilities. In this way this software divides a large fasta file into files with sizes optimized to the available system resources. This is an easy to use tool; we recommend a good reading of the CrocroBLAST documentation. The following example shows how to perform a CrocoBLAST blastn alignment.

21.7.1 Installation

Get the right version for your operating system. Install the software in an appropriate directory

mkdir Crocroblast
cd Crocroblast
unzip CrocoBLAST_linux_64.zip
chmod +x crocoblast

21.7.2 Add sequence database

In order to use CrocroBLAST you need to add the sequence database. Please adapt the following line according to the directory path where your database is located:

./crocoblast -add_database --formatted_db /home/alexarmero2022/shared/NCBI/nt/nt.00.nsq

21.7.3 Add the job to the queue

CrocroBLAST handles job queues; this software has different functions to stop, pause, delete and run a job. The next line adds a job to the queue:

./crocoblast -add_to_queue blastn nt data/test.fa output_test_4  --blast_options -outfmt 6

The following line will allow you to check if the job was added to the queue

./crocoblast -status

21.7.4 Run the job

./crocoblast -run --num_threads 4

CrocoBLAST shows messages with the time remaining to complete the process. We hope that this example motivates you to carefully read the CrocoBLAST wiki to efficiently exploit the different possibilities of this nice tool.