Abstract
Bacteria are ubiquitous organisms; they can be found wherever life is possible. Distinct bacteria are able to coop with highly diverse lifestyles; for instance, they can be classified as host associated or free living. Therefore, these organisms must possess a large and varied genomic arsenal to withstand different environmental conditions. To investigate the genetic repertoire that might be associated with a given lifestyle, we developed two approaches. Both methodologies combine evolutionary sequence analysis with statistical learning methods (Random Forest with feature selection, model tuning and robustness analysis).
Initially, we searched for homologous gene sets that could distinguish Actinobacterial pathogenicity classes. We included 240 Actinobacteria classified to four pathogenicity classes: human pathogens (HP), broad-spectrum pathogens (BP), opportunistic pathogens (OP), and non-pathogens (NP). Essentially, we found homologous gene sets that computationally distinguish pathogens from non-pathogens. We further show a clear limit in differentiating opportunistic pathogens from both non-pathogens and pathogens. Human pathogens may also not be distinguished from bacteria annotated as broad-spectrum pathogens based on a small set of orthologous genes only, as many human pathogens could target a broad range of mammals but have not been annotated accordingly.
Finally, to facilitate the identification of genomic features that might influence bacterial adaptation to a specific niche, we introduce LifeStyle-Specific-Islands (LiSSI). The LiSSI pipeline is an expansion of our previous strategy. In summary, our strategy aims to identify conserved consecutive homology sequences (islands) in genomes and to identify the most discriminant islands for each lifestyle. To illustrate the main functionalities, we expanded our search from exclusively pathogenic classes to include tolerance to atmospheric oxygen (aerobe, anaerobe, facultative) and habitat (soil and aquatic). Essentially, we found that islands seem to carry less weight in the classification performance. It seems that gene order is poorly conserved among bacterial species, which might make individual genes more useful as classifiers.
In conclusion, we illustrate that even in the post-genome era and despite next-generation sequencing technology, our ability to efficiently deduce real-world conclusions, such as pathogenicity classification, remains quite limited. Further, we introduce LiSSI, a bioinformatics pipeline, in order to identify signature genes or islands (conserved consecutive homology sequences) that distinguish bacterial lifestyles.
Initially, we searched for homologous gene sets that could distinguish Actinobacterial pathogenicity classes. We included 240 Actinobacteria classified to four pathogenicity classes: human pathogens (HP), broad-spectrum pathogens (BP), opportunistic pathogens (OP), and non-pathogens (NP). Essentially, we found homologous gene sets that computationally distinguish pathogens from non-pathogens. We further show a clear limit in differentiating opportunistic pathogens from both non-pathogens and pathogens. Human pathogens may also not be distinguished from bacteria annotated as broad-spectrum pathogens based on a small set of orthologous genes only, as many human pathogens could target a broad range of mammals but have not been annotated accordingly.
Finally, to facilitate the identification of genomic features that might influence bacterial adaptation to a specific niche, we introduce LifeStyle-Specific-Islands (LiSSI). The LiSSI pipeline is an expansion of our previous strategy. In summary, our strategy aims to identify conserved consecutive homology sequences (islands) in genomes and to identify the most discriminant islands for each lifestyle. To illustrate the main functionalities, we expanded our search from exclusively pathogenic classes to include tolerance to atmospheric oxygen (aerobe, anaerobe, facultative) and habitat (soil and aquatic). Essentially, we found that islands seem to carry less weight in the classification performance. It seems that gene order is poorly conserved among bacterial species, which might make individual genes more useful as classifiers.
In conclusion, we illustrate that even in the post-genome era and despite next-generation sequencing technology, our ability to efficiently deduce real-world conclusions, such as pathogenicity classification, remains quite limited. Further, we introduce LiSSI, a bioinformatics pipeline, in order to identify signature genes or islands (conserved consecutive homology sequences) that distinguish bacterial lifestyles.
| Original language | English |
|---|---|
| Awarding Institution |
|
| Supervisors/Advisors |
|
| Place of Publication | Odense |
| Publisher | |
| Publication status | Published - Jun 2016 |