Anotación estructural de genomas completos incluyendo sistemas CRISPR- Cas y evaluación de la presión selectiva de pangenomas

Rubio Valle, Alejandro

Anotación estructural de genomas completos incluyendo sistemas CRISPR- Cas y evaluación de la presión selectiva de pangenomas

Rubio Valle, Alejandro

Dirigida por:

Antonio J. Pérez-Pulido Director
Juan Jiménez Codirector

Universidad de defensa: Universidad Pablo de Olavide

Fecha de defensa: 24 de mayo de 2024

Tipo: Tesis

Teseo: 834248 DIALNET TESEO editor

Resumen

In the mid-20th century, genomic sequences began to be determined. Since then, sequencing techniques have continuously evolved and advanced. This has led to a paradigm shift in genomic research in this century, as these new technologies offer unprecedented capabilities to analyze DNA/RNA molecules in a cost-effective and high-throughput manner. Thus, it is now possible to sequence millions of DNA fragments rapidly, providing a complete view of genome structure, genetic variations, gene expression profiles, and epigenetic modifications. The era of genomics is generating an infinite number of biological sequences stored in various databases, resulting in more and more terabytes of information. Therefore, annotation is a crucial task in making biological sense of the thousands of sequences determined daily. While gene search engines or predictors traditionally focused on protein-coding sequences, new bioinformatics algorithms designed with different strategies have been developed. An example of this is the AnABlast tool. It detects evolutionary fingerprints from the accumulation of small, low-scoring alignments that are typically discarded in evolutionary analyses. In this context, this thesis has been carried out, where the main objective has been the improvement of the structural annotation of complete bacterial genomes. For this purpose, model organisms of bacteria of clinical interest were used. In addition, the CRISPR-Cas systems of these bacteria were analyzed as examples of acquired immunity systems that are also necessary to annotate and that are found in 40% of bacterial genomes. In the first stage, the AnABlast algorithm was used to annotate the genome of the opportunistic bacterium Acinetobacter baumannii, which causes infections with high resistance to conventional antibiotics. The analysis resulted in over 40 candidates for new coding regions or fossil regions. These candidates were then subjected to a battery of methods to associate functional information with them. Despite the low annotation obtained, the candidates were highly conserved in A. baumannii, and many showed gene expression. This suggests that the candidates could be related to the virulence of the bacterium, especially considering the terms associated with membrane proteins and oxidoreductases. Following this, we tested whether studying selection pressure on a bacterial species' complete set of genes, known as a pangenome, could improve genome annotation and validate potential new coding regions. The majority of genes in the pangenome are expected to be under purifying selection, which ensures the conservation of their function. However, this is not the case for genes involved in virulence, which need to escape the host immune system, or for spurious sequences that do not need to retain their coding. To develop an automated protocol based on this estimator, we used approximately 180 closed genomes of the bacterium Helicobacter pylori. This bacterium colonizes the gastric mucosa of half of the world's population and has a low number of genes. After obtaining the pangenome, we calculated the evolutionary pressure for each gene and found that 85% of them were subject to purifying selection. As anticipated, genes under positive selection pressure were associated with membrane genes that interact with host tissues. Additionally, many of these genes encoded spurious proteins, suggesting that they could be false positives detected by gene predictors. These results confirm the usefulness of this type of analysis for validating gene predictions and functionally characterizing proteins in bacterial whole genomes. After designing and testing the protocol, it was applied to a more complete pangenome. As a result, selection pressure was estimated in 19,271 genes of the A. baumannii pangenome. Once again, most of the genes appeared to be subject to negative selection. However, 23% of the genes showed values compatible with positive selection, which were related to uncharacterized genes or those necessary to evade the host defense system. The study evaluated the usefulness of measuring selection pressure in detecting sequencing errors by comparing the official annotation of two versions of the A. baumannii ATCC 17978 genome. The more recent sequencing did not present fragmented genes, and genes without sequencing errors causing premature stop codons showed better selection pressure values. Finally, we validated the candidates previously identified by the AnABlast tool and obtained good selection pressure values. This protocol was subsequently applied to a subset of SARS-CoV-2 coronavirus genomes. This virus has been primarily responsible for a global health crisis resulting in numerous deaths in recent years. Calculation of selection pressure on its sequence revealed positive selection for genes coding surface proteins, indicating a high rate of evolution in the short life of this virus. Additionally, it was observed that in regions where genes with overlapping reading frames are present, the overlapping sequence between the two genes diverges under stronger purifying selection than the average in non-overlapping regions of the main gene. During the use of AnABlast in the genome of A. baumannii, we analyzed regions containing the CRISPR-Cas system. These regions presented an atypical morphology of peaks, with a large number of alignments in the repeats that make up this type of system. Upon studying these alignments, it was observed that the UniProtKB reference protein database contained spurious sequences that were translated from DNA containing clustered and interspaced short palindromic repeats, similar to those present in CRISPR-Cas systems. It was also found that the spacers, which are DNA fragments of the CRISPR-Cas systems that prevent viral infection of the bacteria, had almost no alignments. As expected, approximately 80% of the spacers have no clear relationship to any sequence, which is known as CRISPR dark matter. By analyzing the spacers of tens of thousands of genomes from six bacterial species, it was possible to reduce this dark matter to as little as 15% in some species. Furthermore, it was observed that a genome with CRISPR-Cas systems is accompanied by a specific set of membrane proteins. These results suggest that bacteria with specific phage receptor membrane proteins, which enhance their competitive ability in their environment, are compelled to acquire CRISPR-Cas defense systems to resist infection. Thus, the present thesis improved gene annotation in complete genomes and classified pangenome genes based on selection pressure. This led to new discoveries in functional annotation and characterization of CRISPR-Cas systems in pathogenic bacterial genomes, using new bioinformatics tools.