Merging differential expression and Gene Ontology enrichment in a single plot

By Nicolas Gambardella

I recently came across the package GOplot by Wencke Walter In particular, I liked the function GOBubble. However, I found it difficult to customise the plot. In particular, I wanted to colour the bubbles differently, and to control the plotting area. So I took the idea and extended it. Many aspects of the plot can be configured. It is a work in progress. Not all features of GOBubble are implemented at the moment. For instance, we cannot separate the different branches of Gene Ontology, or add a table listing labelled terms. I also have a few ideas to make the plot more versatile. If you have suggestions, please tell me. The code and the example below can be found at

What we want to obtain at the end is the following plot:

The function plotGODESeq() takes two mandatory inputs: 1) a file containing Gene Ontology enrichment data, 2) a file containing differential gene expression data. Note that the function works better if the dataset is limited, in particular the number of GO terms. It is useful to analyse the effect of a perturbation, chemical or genetic, or to compare two cell types that are not too dissimilar. Comparing samples that exhibit several thousands of differentially expressed genes, resulting in thousands of enriched GO terms, will not only slow the function to a halt, it is also useless (GO enrichment should not be used in these conditions anyway. The results always show things like “neuronal transmission” enriched in neurons versus “immune process” enriched in leucocytes). A large variety of other arguments can be used to customise the plot, but none are mandatory.

To use the function, you need to source the script from where it is; In this example, it is located in the session directory. (I know I should make a package of the function. On my ToDo list)



The Gene Ontology enrichment data must be a data frame containing at least the columns: ID – the identifier of the GO term, description– the description of the term, Enrich – the ratio of observed over expected enriched genes annotated with the GO term, FDR – the False Discovery Rate (a.k.a. adjusted p-value), computed e.g. with the Benjamini-Hochberg correction, and genes – the list of observed genes annotated with the GO term. Any other column can be present. It will not be taken into account. The order of columns does not matter. Here we will load results coming from and analysis run on the server WebGestalt. Feel free to use whatever Gene Ontology enrichment tool you want, as far as the format of the input fits.

# load results from WebGestalt
goenrich_data <- read.table("GO-example.csv", 

# rename the columns to make them less weird 
# and compatible with the GOPlot package
colnames(goenrich_data) %in% c("geneset","R","OverlapGene_UserID")
] <- c("ID","Enrich","genes")

# remove commas from GO term descriptions, because they suck
goenrich_data$description <- gsub(',',"",goenrich_data$description)

The differential expression data must be a data frame in which rownames are the gene symbols, from the same namespace as the genes column of the GO enrichment data above. In addition, one column must be namedlog2FoldChange, containing the quantitative difference of expression between two conditions. Any other column can be present. It will not be taken into account. The order of columns does not matter.

# Load results from DESeq2
deseq_data <- read.table("DESeq-example.csv", 

Now we can create the plot.


The y-axis is the negative log of the FDR (adjusted p-value). The x-axis is the zscore, that is for a given GO term:

(nb(genes up) – nb(genes down))/sqrt(nb(genes up) + nb(genes down))

The genes associated with each GO term are taken from the GO enrichment input, while the up or down nature of each gene is taken from the differential expression input file. The area of each bubble is proportional to the enrichment (number of observed genes divided by number of expected genes). This is the proper way of doing it, rather than using the radius, although of course, the visual impact is less important.

Choosing what to plot

The console output tells us that we plotted 1431 bubbles. That is not very pretty or informative … The first thing we can note is that we have a big mess at the bottom of the plot, which corresponds to the highest values of FDR. Let’s restrict ourselves to the most significant results, by setting the argument maxFDR to 10-8.

This is better. We now plot only 181 GO terms. Note the large number of terms aligned at the top of the plot. Those are terms with an FDR of 0. The Y axis being logarithmic, we plot them by setting their FDR to a tenth of the smallest non-0 value. GO over-representation results are often very redundant. We can use GOplot’s function reduce_overlap by setting the argument collapse to the proportion of genes that needs to be identical so that GO terms are merged in one bubble. Let’s use collapse=0.9 (GO terms are merged if 90% of the annotated genes are identical).

Now we only plot 62 bubbles, i.e. two-third of the terms are now “hidden”. Use this procedure with caution. Note how the plot now looks distorted towards one condition. More “green” terms have been hidden than “red” terms.

The colour used by default for the bubbles is the zscore. It is kind of redundant with the x-axis. Also, the zscore only considers the number of genes up or down-regulated. It does not take into account the amplitude of the change. By setting the argument color to l2fc, we can use the average fold change of all the genes annotated with the GO term instead.

Now we can see that while the proportion of genes annotated by GO:0006333 that are down-regulated is lower than for GO:0008380, the amplitude of their average down-regulation is larger.

WARNING: The current code does not work if the color scheme chosen for the bubbles is based on a variable, l2fc or zscore, that do not contain negative and positive values. Sometimes, the “collapsing” can cause this situation, if there is an initial unbalance between zscores and/or l2fc. It is a bug, I know. On the ToDo list …

Using GO identifiers is handy and terse, but since I do not know GO by heart, it makes the plot hard to interpret. We can use the full description of each term instead, by setting the argument label to description.

Customising the bubbles

The width of the labels can be modified by setting the argument wrap to the maximum number of characters (the default used here is 15). Depending on the breadth of values for FDR and zscore, the buble size can be an issue, either by overlapping too much or on the contrary by being tiny. We can change that by the argument scale which scales the radius of the bubbles. Let’s fix it to 0.7, to decrease the size of each bubble by a third (the radius, not the area!).

There is often a big crowd of terms at the bottom and centre of the plot. This is not so clear here, with the harsh FDR threshold, but look at the first plot of the post. These terms are generally the least interesting, since they have a lower significance (higher FDR) and mild zscore. We can decide to label the bubbles only under a certain FDR with the argument maxFDRLab and/or above a certain absolute zscore with the argument minZscoreLab. Let’s fix them to 1e-12 and 2 respectively.

Finally, you are perhaps not too fond of the default color scheme. This can be changed with the arguments lowCol, midCol, highCol. Let’s set them to  “deepskyblue4”, “#DDDDDD” and “firebrick”,

Customising the plotting area

The first modifications my collaborators asked me to introduce were to centre the plot on a zscore of 0 and to add space around so they could annotate the plot. One can centre the plot by declaring centered = TRUE (the default is FALSE). Since our example is extremely skewed towards negative zscores, this would not be a good idea. However, adding some space on both sides will come in handy in the last step of beautification. We can do that by declaring extrawidth=3 (default is 1).

The legend position can be optimised with the arguments leghoffset and legvoffset. Setting them to {-0.5,1.5}

            maxFDR = 1e-8,
            collapse = 0.9,
            lowCol = "deepskyblue4",
            midCol = "#DDDDDD",
            highCol = "firebrick",
            label = "description",
            scale = 0.7,
            maxFDRLab = 1e-12,
            minZscoreLab = 2.5,
            wrap = 15)

Now we can export an SVG version and play with the labels in Inkscape. This part is unfortunately the most demanding …

Vaccin efficace sur toutes les sous-populations mais apparemment pas sur l’ensemble de la population – le paradoxe de Simpson

Par Nicolas Gambardella

Les dernières statistiques d’Israël et du Royaume-Uni sur la covid-19 dans les populations vaccinées et non vaccinées sont devenues virales. L’une des principales raisons de ce succès dans certains milieux est qu’elles montrent apparemment que les vaccins contre le virus de la covid ne sont plus efficaces ! Ce n’est bien entendu pas le cas. Si les anticorps circulants produits par une vaccination complète semblent diminuer avec une demi-vie d’environ six mois, la protection reste très forte contre la maladie, qu’elle soit modérée ou sévère. La protection contre l’infection reste également robuste pendant les premiers mois suivant la vaccination, quel que soit le variant. Comment expliquer dès lors le résultat apparemment paradoxal selon lequel le taux de mortalité par covid est le même dans les populations vaccinées et non vaccinées ? Plusieurs facteurs peuvent être mis en cause. Par exemple, dans la plupart des ensembles de données utilisés pour calculer l’efficacité, les personnes pré-infectées non vaccinées ne sont pas retirées. Cependant, je voudrais aujourd’hui mettre en avant une autre raison car je pense qu’il s’agit d’un piège dans lequel les apprentis analystes de données tombent très fréquemment : Le paradoxe de Simpson.

Le paradoxe de Simpson se produit quand une tendance présente dans plusieurs sous-populations disparaît, voire s’inverse, lorsque toutes ces populations sont aggrégées. Cela est souvent dû à des facteurs de confusion cachés. La situation est bien illustrée dans la figure suivante obtenue de Wikimedia commons. Alors que la corrélation entre Y et X est positive dans chacune des cinq sous-populations, cette corrélation devient négative si l’on ne distingue pas les sous-populations.

Qu’en est-il de la vaccination contre le SRAS-CoV-2 ? Jeffrey Morris explique sur son blog l’impact du paradoxe de Simpson sur l’analyse des données d’Israël de manière précise et éclairante, bien mieux que je ne pourrais le faire. Cependant, son excellente explication est assez longue et détaillée, et en anglais. J’ai donc pensé que je pourrais en donner une version courte ici, avec une population imaginaire, simplifiée, bien que réaliste.

Comme évoqué dans un précédent billet, la donnée cruciale ici est la structure de la population par classe d’âge . Pour simplifier, nous prendrons une pyramide des âges assez simple, proche de ce que l’on observe dans les pays développés, c’est-à-dire homogène avec seulement une diminution au sommet, ici 1 million de personnes par décennie, et 1 million pour toutes les personnes de plus de 80 ans.

La première variable importante est le taux de vaccination. Comme les campagnes de vaccination ont commencé avec les populations âgées et que l’hésitation vaccinale diminue fortement avec l’âge, le taux de vaccination est beaucoup plus faible dans les populations plus jeunes.

La deuxième variable importante est le taux de létalité de la maladie (Infection Fatality Rate, IFR) pour chaque tranche d’âge. Là aussi, l’IFR est beaucoup plus faible dans les populations les les plus jeunes. Et c’est là que se trouve le nœud du problème : taux de vaccination et taux de létalité ne sont pas des variables indépendantes ; les deux sont liées à l’âge.

Supposons que notre vaccin ait une efficacité absolue de 90 % et que, pour simplifier, cette efficacité ne dépende pas de l’âge. Le nombre de décès dans la population non vaccinée est :

Deaths unvaccinated = round(unvaccinated * IFR)

La fonction arrondi est pour éviter les fractions de personnes mortes. Le nombre de décès dans la population vaccinée est de :

Deaths vaccinated = round(vaccinated * IFR * 0.1)

0.1 = (100 – efficacy)/100

Maintenant que nous avons le nombre de décès dans chacune de nos populations, vaccinées ou non, nous pouvons calculer les taux de mortalité, c’est-à-dire décès/population, et calculer l’efficacité comme suit :

(death rate unvaccinated – death rate vaccinated)/(death rate unvaccinated)*100

Sans surprise, l’efficacité pour toutes les tranches d’âge est de 90%. Les 100% pour les <20 ans viennent du fait que 0,04 décès est arrondi à 0.

Cependant, si l’on fusionne toutes les tranches d’âge, l’efficacité disparaît complètement ! De plus, il semblerait que le vaccin augmente le taux de mortalité ! Le fait de ne pas être vacciné présente une protection contre le décès de 32% !

Il s’agit bien sûr d’un résultat erroné (nous le savons ; nous avons créé l’ensemble de données avec une efficacité vaccinale réelle de 90% !). Cet exemple utilise l’efficacité d’un vaccin. Cependant, le paradoxe de Simpson guette souvent l’apprenti analyste de données au tournant. Les facteurs de confusion doivent être recherchés avant toute analyse statistique, et les populations doivent être stratifiées en conséquence.

A vaccine effective on all subpopulations but apparently not on the entire population – the Simpson’s paradox

By Nicolas Gambardella

The latest statistics from Israel and the UK on COVID-19 in vaccinated and unvaccinated populations are getting viral. One of the main reasons for this success in some circles is that they apparently show that the vaccines against the COVID-19 virus are no longer effective! This is, of course, not the case. While the circulating antibodies triggered by a vaccination course seem to decline with a half-life of about six months, the protection remains very strong against disease, mild or severe. The protection against infection is also still robust during the first months after vaccination, whatever the variant. What could then explain the apparent paradoxical result that people die from COVID-19 as frequently in vaccinated populations as in unvaccinated ones? Several factors might be involved. For instance, in most datasets used to compute effectiveness, unvaccinated pre-infected people are not removed. However, today I would like to highlight another reason because I think it is a trap in which casual data analysts fall very frequently: The Simpson’s paradox.

The Simpson’s paradox is a situation where a trend present in several subpopulations disappears or even reverts when all those populations are pulled together. This is often due to hidden confounding variables. The situation is well illustrated in the following figure obtained from Wikimedia commons. While the correlation between Y and X is positive in each of the five subpopulations, this correlation becomes negative if we do not distinguish the subpopulations.

What about the vaccination against SARS-CoV-2? Jeffrey Morris explains on his blog the impact of Simpson’s paradox on the analysis of Israel data in a precise and enlighting manner, way better than I could. However, his excellent explanation is relatively long and detailed. So I thought I could give a short version here, with an imaginary, simplified, albeit realistic population.

As discussed in a past post, the crucial data here is the age structure of the population. To simplify, we’ll take a pretty simple age pyramid, close to what we observe in developed countries, i.e., homogenous with only a decrease on top, here 1 million people per decade, and 1 million for everyone over 80.

The first important variable is the rate of vaccination. Because vaccination campaigns started with the elderly populations, and that vaccine hesitancy strongly decreases with age, the vaccination rate is much lower in younger populations.

The second important variable is the disease’s lethality – the Infection Fatality Rate – for each age group. Here as well, the IFR is much lower in the younger group. And here lies the crux of the problem: rate of vaccination and IFR are not independent variables; both are linked to age.

Let’s assume that our vaccine has an absolute efficacy of 90%, and for simplicity, this efficacy does not change with age. The number of deaths in the unvaccinated population is:

Deaths unvaccinated = round(unvaccinated * IFR)

The “round” function is to avoid half-dead people. The number of deaths in the vaccinated population is:

Deaths vaccinated = round(vaccinated * IFR * 0.1)

where 0.1 = (100 – efficacy)/100

Now that we have the number of deaths in each of our populations, vaccinated or not, we can calculate the death rates, i.e. deaths/population, and compute the efficacy as:

(death rate unvaccinated – death rate vaccinated)/(death rate unvaccinated)*100

Unsurprisingly, the efficacy for all age groups is 90%. The 100% for the <20 comes from the fact that 0.04 death is 0.

HOWEVER, if we merge all the age groups together, the efficacy completely disappears! Not only that, it seems that the vaccine actually increases the death rate!!! Being unvaccinated presents an efficacy of 32% against death!

This is of course an artefact (we know that; we created the dataset with an actual vaccine efficacy of 90%!). This example used vaccine efficacy. However, Simpson’s paradox is awaiting the casual data analyst behind any corner. Confounding variables must be tracked down before doing any statistical analysis, and populations must be stratified accordingly.

Variability structure to assess dataset quality – the case of COVID-19 deaths

By Nicolas Gambardella

There are many discussions on classical and social media about the quality of datasets reporting deaths by COVID-19. Of course, depending on the density of healthcare systems and the reporting structures, the reported toll will represent a certain proportion of actual deaths (60% in Mexico, 30% in Russia, between 10 and 20% in India, according to the health authorities of these countries). Moreover, most countries maintain two tallies, one based on deaths within a certain period of a positive test for infection by SARS-CoV-2 and one based on death certificates mentioning COVID-19 as the cause of death. That said, both factors should proportionally affect the numbers and are beyond human intervention. Now, can we detect if datasets have been tampered with or even entirely made up?

One way to do so is to look at how the variability evolves over time and depending on absolute numbers. Below, I used the dataset from Our World in Data (as of 11 September 2021) to look at the reported COVID-19 death tolls for a specific set of countries. In most countries, the main variability comes from the reporting system. As such it should be proportional to the daily deaths (basically a percentage of the reports are coming late). On top of it, we should find an intrinsic variability, which should increase as the square root of the daily deaths. So, the variability should be relatively higher outside the waves.

First, let’s look at the datasets from countries with well-developed and accurate healthcare systems. Below are plotted the standard deviation of the daily death count over seven days against the daily amount of deaths (averaged over seven days) in the United Kingdom, Brazil, France, and the United States.

Although we see that the variability of… the variability is more important in the USA and France, there is a clear linear relationship between the absolute daily number of deaths and its standard deviation. The Pearson correlation coefficient for the UK is 0.97 (Brazil = 0.93, France=0.85, USA = 0.77). If we combine the four datasets, we can see that the relationship is incredibly similar in those five countries. The slope of the curve, representing the coefficient of variation, i.e., the standard deviation divided by the mean, throughout the scale is: UK = 0.35, Brazil = 0.35, France=0.48, USA = 0.26).

Some countries exhibit a different coefficient of variation, meaning a higher reproducibility of reporting. Iran’s reported deaths always looked very smooth to me. Indeed, the CV is 0.078, which indicates a whopping 4.5 more precise reporting. Although I am certain that Iran’s healthcare system is excellent, this figure looks suspiciously low.

When it becomes interesting is when the linear relationship is lost. Turkey’s daily death reports are also very smooth. However, the linearity between the variability is now mostly lost, with a standard deviation that remains almost constant no matter the absolute amount of deaths. If I had to guess, I would say that the data is massaged, albeit by people who did not really think about the reason underpinning the variability and what structure it should have to look natural.

And finally, we reach Russia. From the Russian statistics agency itself, we know that the official death toll from the government bears no relationship whatsoever with reality. What is interesting is that the people producing the daily reporting went further than the Turkish ones and did not even try to produce realistically looking data. On the contrary, the variability was smoothed out even more for the highest absolute death tolls, generating a ridiculous bridge-shaped curve.

Was this always the case? How did the coefficient of variation evolve since the beginning of the pandemics? Looking again at the UK and Brazil, we can see that the average CV stays pretty much steady over time with an increased variability between the big waves. We see nicely that the CV peaks and troughs alternate between Brazil and the UK, corresponding to the offset between waves.

The situation is a bit different for Turkey and Russia. The Turkish dataset shows a CV collapsing after the first six months of the pandemics. And indeed, the daily death reporting between October 2020 and March 2021 is ridiculously smooth. However, it seems someone decided that it was a bit too much and started to add some noise (that was, unfortunately for them, not adequately scaled up.)

Russia followed the opposite path. While during the pandemics’ initial months, the CV was on par with those of western datasets, all that quickly stopped, and the CV collapsed. This trend culminated with the current preposterous death tools, between 780 and 800 deaths every single day for the past two months. The Russian government is basically showing the world the numerical finger.

Scientists, do not make assumptions about your audience!

This is a post I could have written thirty years ago. The tendency of scientists (or any specialist really) to write texts assuming a similar level of background knowledge from their audience has always been a curse. However, with the advent of open access and open data, the consequences have become dearer. Recently, in what is probably one of the worst communication exercises of the COVID-19 pandemics, the CDC published an online message ominously entitled:

“Lab Alert: Changes to CDC RT-PCR for SARS-CoV-2 Testing”

Of course, this text meant to target a particular audience, as specified on the web page:

“Audience: Individuals Performing COVID-19 Testing”

However, the text was accessible to everyone; including many people who could not properly understand it. What did this message say?

“After December 31, 2021, CDC will withdraw the request to the U.S. Food and Drug Administration (FDA) for Emergency Use Authorization (EUA) of the CDC 2019-Novel Coronavirus (2019-nCoV) Real-Time RT-PCR Diagnostic Panel, the assay first introduced in February 2020 for detection of SARS-CoV-2 only. CDC is providing this advance notice for clinical laboratories to have adequate time to select and implement one of the many FDA-authorized alternatives.”

This sent people already questioning the tests into overdrive. “We’ve always told you. PCR tests do not work. This entire pandemic is a lie. We’ve been termed conspiracy theorists, but we were right all this time.” The CDC message is currently circulated all over the social networks to demonstrate their point.

Of course, this is not at all what the CDC meant. The explanation comes in the subsequent paragraph.

“In preparation for this change, CDC recommends clinical laboratories and testing sites that have been using the CDC 2019-nCoV RT-PCR assay select and begin their transition to another FDA-authorized COVID-19 test. CDC encourages laboratories to consider adoption of a multiplexed method that can facilitate detection and differentiation of SARS-CoV-2 and influenza viruses. Such assays can facilitate continued testing for both influenza and SARS-CoV-2 and can save both time and resources as we head into influenza season.”

The CDC really means that rather than using separate tests to detect SAR-COV-2 and influenza virus infections, the labs should use a single test that detects both simultaneously, hence the name “multiplex”. 

I have to confess that it took me a couple of readings to properly understand what they meant. What did the CDC do wrong?

First, calling those messages “Lab Alert”. For any regular citizen fed by Stephen King’s The Stand and movies like Contagion, the words “Lab Alert” mean “Pay attention, this is an apocalypse-class message”. What about “New recommendation” or “Lab communication”?

Second, the CDC should not have been assumed that everyone knew what the “CDC 2019-nCoV RT-PCR assay” was. Out there, people understood that the CDC was talking about all the RT-PCR assays meant to detect the presence of SARS-CoV-2, not just the specific test previously recommended by the CDC*.

Third, the authors should have clarified that “the many FDA-authorized alternatives” included other PCR tests, and the message was not meant to say that the CDC recommended ditching the RT-PCR tests altogether.

Finally, they should have clarified what a “multiplexed method” was. I received messages from people who believed a “multiplexed method” was an alternative to a PCR test, while it is just a PCR that detects several things simultaneously (in this example SARS-CoV-2 and flu viruses).

In conclusion, you can, of course, and should, think about your intended audience. However, you should not neglect the unintended audiences. This is more important than you think and not restricted to general communications. Whether a research article or a grant application, whatever scientific piece you write will reach three audience types. 

  • The first comprises the tiny circle sharing the same knowledge background, typically reviewers (if the editors do their job properly…). 
  • The second will be made up of the population at large, who will not understand a word, and frankly, are not interested in whatever you are babbling about.
  • The third is the dangerous one. It is made of people who have a certain scientific background, sufficient to globally understand the context of your text but lack the advanced knowledge to precisely grasp your idea, its novelty, its consequences. These people will read your text and believe they understood your points. The risk is that they did not. Misunderstand your point might be worse than not understanding it.

It is always good to get your texts read by someone belonging to this third population before submitting them to the journals of funding agencies.

*There is actually another very interesting story related to this topic when, at the beginning of the pandemic, many labs proposed to use their own PCR tests but could not because only the CDC-recommended test could be used, delaying the implementation of mass testing by many weeks.

Âges, vaccination et infections

Par Nicolas Gambardella

Combien de fois voit-on ces jours-ci passer le commentaire suivant sur les réseaux sociaux : « La plupart des cas de covid-19 sont maintenant chez des personnes vaccinées. C’est la preuve que les vaccins ne fonctionnent pas. »

Pas vraiment, non.

Tout dépend des populations relatives de vaccinés et de non-vaccinés. Dans un précédent billet, j’ai présenté un résumé de l’efficacité des vaccins sur les différentes variantes du SARS-CoV-2. Chaque figure représentait l’efficacité globale. Cependant, les taux de vaccination dépendent de l’âge, car la plupart des pays ont commencé à vacciner les personnes âgées en premier. Voyons donc si nous pouvons être plus précis.

Public Health England a récemment publié la dernière version de son SARS-CoV-2 variants of concern and variants under investigation in England. Il présente les détails des infections par les variants identifiés chez les personnes vaccinées et non vaccinées. Concentrons-nous sur le variant Delta.

Mais, mais, mais… chez les personnes de plus de 50 ans, seuls 976 cas ont été recensés chez les non-vaccinés, tandis que 3953 personnes ayant reçu une dose et 3546 personnes entièrement vaccinées ont été infectées ! Ce vaccin n’offre donc aucune protection, CQFD ?

Pas si vite. Voyons si nous pouvons calculer l’efficacité du vaccin, d’accord ? Pour cela, nous avons d’abord besoin du taux de vaccination par tranche d’âge. Heureusement, ce taux est publié chaque semaine par Public Health England. Comme le tableau porte sur les cas déclarés jusqu’au 21 juin, nous utiliserons les données publiées le 24 juin, qui comprenaient les vaccinations jusqu’au 20 juin. Bien sûr, tous les cas Delta ne sont pas apparus le 20 juin. Cependant, la plupart d’entre eux sont apparus au cours des derniers mois. De plus, l’administration de la 2e dose a atteint un plateau pour la population âgée.

Ensuite, nous devons savoir combien de personnes appartiennent à chacune de ces tranches d’âge. Pour cela, nous pouvons utiliser la population de 2020 prévue par l’Office for National Statistics sur la base des chiffres de 2018 (la pyramide des âges indique des pourcentages pour chaque année, mais nous pouvons télécharger les chiffres réels pour chaque tranche de 5 ans d’âge).

Nous pouvons maintenant calculer, pour chaque tranche d’âge, combien de personnes ont reçu deux doses, une seule dose ou ne sont toujours pas vaccinées (j’additionne les hommes et les femmes).

Âge1 dose2 dosesnon-vaccinés
0-17 56230 56584 14033150
18-24605991 726010 4318151
25-29 837303 728416 2924493
30-341587417 847650 2100138
35-39 1786373 10030801628239
40-44 1702290 1268632 1127652
45-491583028 1838131 890384
50-54548632 3378858 690501
55-59 396658 3567857 549185
60-64202651 3272462 387171
65-6995785 2998587 268383
70-7459031 3118291 191577
75-79 41180 2259096 111907
80+84457 3159787 164372
total9587027 28223441 29385301

Ces chiffres font apparaître 24118034 personnes de plus de 50 ans, 21754937 avec deux doses et 2363096 non vaccinées, dix fois plus de personnes complètement vaccinées ! Ainsi, les 3546 et 976 cas représentent 0,0163 % et 0,0413 % des populations respectives. En d’autres termes, la vaccination complète offre une protection de 60,5 % contre le variant Delta.

Le même calcul sur les moins de 50 ans montre une protection encore meilleure, à 70,8 % (ce qui montre encore qu’il faut vacciner les plus jeunes si nous voulons protéger les plus vieux et se débarrasser de ce virus).

Plus la couverture vaccinale est bonne, plus on observera de cas dans la population vaccinée. Cela ne signifie pas que le vaccin n’est pas efficace !

Ages, vaccination and infections

By Nicolas Gambardella

How many times are we seeing the following comment on social media those days: “Most covid-19 cases are now in vaccinated people. This is the proof that vaccines don’t work”.

Not quite.

It all depends on the relative populations of vaccinated versus unvaccinated. In a previous post, I presented a summary of vaccine effectiveness on different SARS-CoV-2 variants. Each figure represented the global effectiveness. However, vaccination rates depend on age since most countries started to vaccinate the elderly first. So let’s see if we can be more precise.

Public Health England recently published its latest SARS-CoV-2 variants of concern and variants under investigation in England. It contains the details of infections by identified variants in vaccinated and unvaccinated people. Let’s focus on the Delta variant.

Whaaaat? In people over 50 years of age, 0nly 976 cases in unvaccinated, while 3953 people with one dose and 3546 fully vaccinated people were infected! Surely this vaccine does not offer any protection, right?

Let’s see if we can compute the vaccine effectiveness, shall we? For that, we need first the vaccination rate per age. Fortunately, this is published by Public Health England every week. Since the table about report cases up to June 21, we will use the vaccinations data published on June 24, including vaccinations up to June 20. Of course, not all the Delta cases have appeared on June 20. However, most of them have appeared in the past few months. Moreover, administration of the 2nd dose has plateaued for the elderly population.

Then, we need to know how many people belong to each of those age groups. For that, we can use the 2020 population predicted by the Office for National Statistics based on the 2018 figure (the age pyramid shows percentages for each year, but we can download actual numbers for each 5-years age group).

We can now compute for each age group how many people had two doses, only one dose, or are still unvaccinated (I sum up males and females).

Age1 dose2 dosesunvaccinated
0-17 56230 56584 14033150
18-24605991 726010 4318151
25-29 837303 728416 2924493
30-341587417 847650 2100138
35-39 1786373 10030801628239
40-44 1702290 1268632 1127652
45-491583028 1838131 890384
50-54548632 3378858 690501
55-59 396658 3567857 549185
60-64202651 3272462 387171
65-6995785 2998587 268383
70-7459031 3118291 191577
75-79 41180 2259096 111907
80+84457 3159787 164372
total9587027 28223441 29385301

These numbers show 24118034 people over 50, 21754937 with two doses and 2363096 unvaccinated, tenfold more fully vaccinated! Thus, the 3546 and 976 cases represent 0.0163% and 0.0413% of the respective populations. In other words, the full vaccination offers 60.5% protection against the Delta variant.

The same calculation on under-50 shows even better protection at 70.8% (This, again, shows that we must vaccinate young people if we want to protect the older ones and get rid of this virus.

The better the vaccine coverage, the more cases will be observed in the vaccinated population. This does not mean the vaccine is not effective!

Des vaccins et des variants

Par Nicolas Gambardella

Depuis le développement des premiers vaccins contre le SARS-CoV-2, j’ai collectionné les données sur leur efficacité. Cette efficacité est continuellement remise en cause par l’apparition de virus variants, c’est-à-dire de nouvelles souches porteuses d’un groupe caractéristique de mutations. Avec autant de vaccins et autant de variants, il devient difficile de rester à jour. Ce problème est aggravé par l’abondance de publications présentant des types d’évaluations différents. Ainsi, bien qu’il soit très important de garder trace de toutes les valeurs et de leurs intervalles de confiance, j’ai pensé qu’il serait bon d’avoir une vue d’ensemble simplifiée de la situation actuelle.

La figure ci-dessous représente l’efficacité globale des principaux vaccins contre les principaux variants sous forme de pourcentages visuels. Les points bleus représentent les personnes protégées qui auraient été infectées sans vaccination. Les points gris représentent les paires {vaccin, variant} pour lesquelles on ne dispose pas de suffisamment de données. Ces nombres représentent la protection contre l’infection, et non la protection contre la maladie ou le décès (pour lesquels la protection est probablement plus élevée). De plus, ils sont obtenus après le protocole de dosage recommandé pour chaque vaccin. NB: Dans certains cas, « Wuhan » signifie « aucun des variants ci-dessous ».

Ces données sont les estimations les meilleures et les plus fiables au moment où j’écris ce billet (mise à jour le 12 novembre 2021). J’ai privilégié les données de vie réelle aux essais cliniques, l’efficacité directement mesurée à l’efficacité déduite des tests de neutralisation (où le sérum de personnes vaccinées est utilisé in vitro sur des virus ou des protéines recombinantes), et les données indépendantes aux données fournies par les fabricants de vaccins. J’ai omis certains vaccins autorisés en raison de la rareté des données (et de leur faible utilisation). Certaines des données utilisées pour faire la figure sont connues pour leur « particularité » et ont fait l’objet de critiques. Cependant, il n’existe rien de mieux. Espérons que ces graphiques deviendront plus précis à mesure que d’autres études seront publiées.

On vaccines and variants

By Nicolas Gambardella

Since the development of the first vaccines against SARS-CoV-2, I have gathered data about their efficacy. Unfortunately, this efficacy is continuously challenged by the appearance of variant viruses, i.e., novel strains carrying a bunch of mutations. With so many vaccines and so many variants, it becomes difficult to keep track of the data. This is compounded by the abundance of publications presenting different types of evaluations. So, while keeping track of all the values and their confidence intervals is very important, I thought it would be nice to have a single overview of where we stand.

The figure below represents the overall efficacy of the main vaccines for the main variants as visual percentages. The blue dots represent protected people who would have been infected without the vaccines. Grey dots represent pair {vaccine, variant} for which not enough data is available. This figure represents the protection from infection, not the protection from disease or death (which are likely higher). The figures are those achieved after the recommended dosing protocol for each vaccine. NB: in some plots, “Wuhan” means “none of the variants listed below”.

These numbers are the best and most reliable estimates as I write this post (updated 02 October 2021). I privileged real world data over clinical trials, directly measured efficacy over efficacy inferred from neutralisation assays (where the serum of vaccinated individuals is used in vitro with viruses or recombinant proteins), and independent data over data provided by vaccine manufacturers. I omitted some authorised vaccines because of data scarcity (and low usage). Some of the data used to plot the graph are known to present peculiarities and raised issues. However, nothing better is available. Hopefully, these plots will become more accurate as more studies are published.

Comment traduire « evidence-based medicine » ?

By Nicolas Gambardella

Abordons aujourd’hui une question d’actualité, et qui me tient à cœur, ce que l’on appelle en anglais « evidence-based medicine ». Comment traduire cette expression en français ?

Tout d’abord de quoi parlons-nous ? Depuis des temps immémoriaux, la médecine est un art, et les médecins sont des artisans. Autrement dit, après une formation initiale auprès de mentors, le médecin peaufine ses connaissances et sa réflexion sur la base de son expérience professionnelle. Cette approche présente des inconvénients qu’il n’est pas besoin de développer. Cet état de chose a commencé à évoluer au XIXe siècle avec la « médecine expérimentale » de Claude Bernard, puis au XXe avec l’arrivée de la pharmacologie moléculaire et de l’accélération des connaissances en biologie humaine. La transition de la médecine d’un art en une science s’est parachevée il y a un demi-siècle avec la généralisation des essais cliniques contrôlés, où l’on tâche d’éliminer l’arbitraire personnel et d’évaluer la validité des observations en utilisant des statistiques, souvent sophistiquées. L’avènement récent des données moléculaires à haut débit a ajouté à cet « evidence-based medicine » un aspect de précision et de personnalisation.

Ce qui nous amène à l’utilisation d’un faux ami. « Evidence » est le mot utilisé dans les tribunaux anglo-saxons équivalent au mot français « preuve ». En revanche, en science, « preuve » se dit « proof ». Cette dernière acception est beaucoup plus forte que la précédente (on pourrait du reste discuter longuement sur la différence de statut des « preuves » dans les tribunaux francophones et des « evidences » dans les tribunaux anglo-saxons). Andrew Wiles a fourni la preuve de la conjecture de Fermat (qui devrait donc s’appeler Théorème de Wiles…). Ce théorème de Fermat est toujours vrai. Pour des entiers strictement positifs x, y, z, il n’existe aucun n>2 tel que x2+y2=z2. Ce résultat est vrai, et le restera toujours.

En revanche, les résultats d’une expérience biologique ou d’un essai clinique nous renvoient une image beaucoup plus nuancée. Tout d’abord, les résultats sont associés à un niveau de confiance. Si la valeur p est de 0,05 (une valeur souvent utilisée en statistique médicale), cela veut dire qu’il y a 5 % de chances que le résultat soit dû au hasard (C’est un peu plus compliqué que ça, mais ce n’est pas le sujet du billet). De plus, des résultats différents pourraient être obtenus avec une autre cohorte, présentant d’autres caractéristiques, soit évidentes (âge, sexe, état de santé) soit plus subtiles (une proportion différente d’haplotypes clés entre les groupes témoins et traités). D’où l’existence des méta-analyses, qui permettent de réconcilier plusieurs essais cliniques.

Le résultat d’un essai clinique est très respectable et doit être une référence en l’absence d’information contraire (et de situations particulières comme les circonstances du patient, la disponibilité et le prix des traitements, etc.). Mais ce n’est pas une « preuve ». Je m’insurge donc contre la traduction de « evidence-based medicine » en « médecine fondée sur les preuves », bien qu’elle soit la plus utilisée. C’est selon moi un mauvais anglicisme.

Puisque nous en sommes au chapitre des anglicismes, évacuons de suite le « basé sur ». L’académie française nous dit :

« On s’accorde aujourd’hui pour employer Baser sur dans le domaine militaire et l’y réserver : Des troupes ont été basées sur la frontière. On évitera donc l’emploi figuré, transposition de l’anglais based on, qui s’est abusivement répandu, et on lui préfèrera des synonymes comme Fonder, Établir ou Asseoir. »

Comment dès lors traduire « evidence »? On pourrait, comme Wikipedia, utiliser l’aspect factuel du résultat, et utiliser « médecine fondée sur les faits ». Mais là encore, on confond le résultat et la conclusion. Le résultat de l’essai clinique, qu’un traitement administré selon un certain schéma thérapeutique à une cohorte donnée a probablement entraîné avec une probabilité supérieure à 0,95 une amélioration en moyenne de X %, 95 % des mesures se trouvant dans un intervalle donné autour de X, est un fait. La conclusion, à savoir que le traitement entraîne une amélioration de X % n’en est pas un.

À « fait », je préférerais donc « données », qui est… de fait (sic) le nom le plus utilisé après « preuve ». Au final, le praticien utilisera ces données, en conjonction avec les données venant d’autres essais, de surveillances longitudinales (cohortes ou expérience personnelle), des circonstances du patient, des circonstances géographiques, temporelles, et financière pour décider de la marche à suivre.

Et point n’est besoin de rajouter un adjectif pour ré-introduire la preuve par la petite porte comme on voit souvent avec « médecine fondée sur les données probantes ». Et si par « données probantes » on entend juste des données auxquelles on peut se fier, on tombe dans la tautologie. Si une donnée n’est pas fiable, pourquoi la prendre en compte ?

Evidence-based medicine = Médecine fondée sur les données