Statistics, bioinformatics and Artificial Intelligence

By Adriana López-Doriga, Chief Data Quality Nennisiwok

When people hear the word statistics, they usually tell me that it was the black subject in their career and how did I come up with studying such a thing. The truth is that it was by chance. I was convinced by the talk they gave us at the open doors of the faculty, where they told us about the many areas in which statistics are applied, and where I was fascinated by the meteorological models and predictions that could be made. It must be said that the obsession with the temperature in each place and time of day comes from my family. Although, in the end, life has led me to study and use statistics in another field, bioinformatics.

Bioinformatics, which is defined as the application of computational technologies and statistics to the management and analysis of biological data, is applied, among other fields, to the prediction of the structure of proteins, to the prediction of the role of certain genes , in evolution studies, in the discovery of new therapies, in the development of vaccines or in the handling of large amounts of genomic data.

Each of these applications is very extensive, so in this short article I will focus on how we can detect variants within a gene using bioinformatic algorithms, and how the result leads to the use of Artificial Intelligence (AI) algorithms, especially deep learning to make personalized medicine.

An attempt will be made to define the main steps in the detection of genetic variants, which imply many statistical concepts, and how the result allows the application of AI to help in medicine.

I am going to try to explain the main steps of the process with an example of a cancer patient, because it is the field of research where I have the most experience. I emphasize again that it is a very extensive scientific area and a very complex disease, so in this writing I only intend to give a general idea, without commenting on relevant parts, so that it is better understood and reaches a wider audience.

Let’s put ourselves in context: once a patient has been diagnosed with lung cancer, for example, a certain treatment is prescribed according to established protocols and depending on the characteristics of the disease. But, in some cases, the oncologist doubts or the prescribed treatment does not have the expected results, so it is decided to sequence the DNA of the tumor. Currently, the most common is to sequence a panel of genes that includes the genes that have been associated with the diagnosis or prognosis of certain types of cancer and in the event that a mutation is found, treatments are available. Once the DNA sample has been processed, which requires the meticulous work of specialized laboratory technicians, the sample is sequenced in one of the sequencers on the market (“Illumina”, for example, has several sequencers for sequencing massive and, depending on the characteristics and demand of the centers, they use one or the other). When the sequencing ends, millions of images are obtained that correspond to the nucleotides (ATCG) in each of the reads (bits of sequenced genes). This is where bioinformatics algorithms begin to be used to continue with the analysis process.

The main steps are:

1. Reading of images with different colors and intensities (bcl format) and transformation to fastq format (text format with the nucleotides of each reading). In this step, the role of statistics is to calculate average intensities by cycles and phases of the sequencing, to report the base with the highest possible precision. The quality of the base is also calculated, which will be key for the subsequent steps.

2. Alignment of the reads (short pieces of DNA) in the reference genome. This step is critical and expensive. The algorithms are becoming more precise, but it is important to process the data beforehand and choose the appropriate parameters. In this step, each alignment of each reading is determined by a concordance score, and the statistical significance of the score is associated with a p-value. These p-values are key in determining the final alignment of a read.

3. Variant detection. This step consists of detecting the differences between the reference genome and the reads of the sequenced sample. This is also complex, since it depends on the qualities and p-values of the previous steps for a variant to be considered valid. Likewise, each variant is associated with a p-value that it is true, calculated mainly by Bayesian inference, although there are a large number of methods and algorithms.

4. Annotation of variants. This step consists of annotating the position and the change of nucleotide(s) detected in the corresponding gene, and predicting if this change can affect the protein that is transcribed and, therefore, if it can have an impact on the development of the gene. tumor, being a therapeutic target with clinical relevance.

It is from here that Artificial Intelligence begins to play an important role, since a prediction will be made of the pathogenicity of the variant and of the treatment that may be most beneficial for the patient, based on all the cases studied, its molecular and clinical characteristics, and all retrospective knowledge. Artificial Intelligence allows for more beneficial and cost-effective personalized medicine.

Artificial Intelligence allows for more beneficial and cost-effective personalized medicine. The result of the application of the treatment will be collected again in the databases to enrich the models and make predictions more and more precise.

If any reader has come this far, I hope I have conveyed a global vision of the role of statistics in bioinformatic algorithms for the detection of genetic variants in DNA, and the consequent implication of Artificial Intelligence to carry out quality personalized medicine.

*Figure 1 represents the main steps in the detection of genetic variants.

Nennisiwok AI Lab Blog

Discover how Artificial Intelligence can unleash the power of your ideas.