Reports or Newspapers – The Two Sides of Healthcare Priorities

[This article was first published on English – R-blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.


Both the World Health Organization‘s statistical profile of Qatar and the much more detailed Annual Health Report of the Department of Epidemiology and Medical Statistics of the Sate of Qatar show beyond the shadow of a doubt that cardiovascular diseases, diabetes, hypertension, obesity and other metabolic/noncommunicable diseases are the main causes of mortality in the country.[1] The incidence of lifestyle-related diseases on morbidity is so high that research on diabetes was confirmed as a healthcare national Grand Challenge in the 2014 edition of the Qatar National Research Strategy while campaigns aiming at incentivising the population to adopt a more active lifestyle are regularly conducted on almost any media and outreach support.

The objective of this blog post is to check if, and in which extent, the main focus of the Health section of The Peninsula, a major English newspaper in Qatar, reflects this national healthcare concern.


To build the dataset, I collected the articles published in the Health section of the online newspaper. Overall, the dataset comprises 446 articles published between 14 August 2015 and 30 April 2017 –the shortest article counts only 4 sentences versus 128 for the longest. A typical article looks like this:

“new york: people who are deprived of sleep regularly are likely to have a weak immune system, a study has found. the findings showed that chronic short sleep shuts down programmes involved in immune response of circulating white blood cells. seven or more hours of sleep is recommended for optimal health, thus the immune system functions best when it gets enough sleep, the researchers said. the results are consistent with studies that show when sleep deprived people are given a vaccine, there is a lower antibody response and if you expose sleep deprived people to a rhinovirus they are more likely to get the virus, said lead author nathaniel watson, university of washington in seattle. for the study, published in the journal sleep, the team took blood samples from 11 pairs of identical twins with different sleep patterns and discovered that the twin with shorter sleep duration had a depressed immune system, compared with his or her sibling. they used identical twins because genetics account for 31 to 55 per cent of sleep duration and behaviour and environment accounts for the remainder. modern society, with its control of light, omnipresent technology and countless competing interests for time, along with the zeitgeist de-emphasising sleep’s importance, has resulted in the widespread deprioritisation of sleep, the researchers noted.”


Once the dataset is freed from stop words, the list of most common uniGrams indicates that the Health section of the newspaper is more oriented to presenting and discussing the outcome of medical research than focusing on the national health priorities. Words such as study, researchers, found, university, research, journal, published are all amongst the top 25 most used uniGrams. They are used 3107 times across the 446 articles –i.e. 3.304% of all (‘cleaned’) words. In comparison, words like heart, diabetes, activity, obesity, pressure, lifestyle, cardiovascular, hypertension, sport are used only 943 in total –approx. 3.3 times less than the academia-related series of words !

Observing the distribution of bi- and triGrams shows a slightly different picture. Although groups of words such as study published, study found, lead author, study published journal, study researchers, published journal, paper published journal are still in the top 15 biGrams, we now also spot in the list a substantial presence of topics mentioning or evoking the national healthcare priority. Phrases like blood pressure, physical activity, heart disease, heart failure, blood sugar, olive oil, weight loss, type 2, 2 diabetes, heart attack, risk heart, fatty acids, cardiovascular disease, weight gain, fruits vegetables, heart rate –that mostly (but not exclusively) refer to noncommunicable metabolic diseases– are in the top 50 biGrams as well, and account for 0.61% of all biGrams.[2]

Yet, the more ‘targeted’ bi- and triGrams that can be semantically connected to typical topics of noncommunicable metabolic diseases appear in a limited number of articles only. This confirms the misalignment (meager alignment) of the editors’ interests on the one hand and the national health priorities on the other hand. Despite few exceptions (exercise, diabetes), phrases such as obesity, overweight, blood pressure, physical activity, heart disease, blood sugar, weight loss, cardiovascular disease, weight gain, body mass index, cardiovascular risk factor, risk cardiovascular disease, blood glucose are used in a small handful of articles at most.

Phrases Number of Articles % of all Articles
exercise 69 15.5
diabetes 67 15.0
obesity 49 11.0
heart disease 43 9.6
blood pressure 32 7.2
physical activity 28 6.3
overweight 24 5.4
cardiovascular disease 24 5.4
weight loss 17 3.8
blood sugar 14 3.1
body mass index 14 3.1
weight gain 11 2.5
risk cardiovascular disease 9 2.0
blood glucose 7 1.6
cardiovascular risk factor 4 0.9


Even the Term Frequency (tf) – Inverse Document Frequency (idf) analysis, which aims at measuring how important a word is to a document in a set of documents, does not reveal any particular semantic backbone built upon topics related to metabolic diseases.

This quick text mining exercise applied to articles published in the Health section of The Peninsula daily newspaper of Doha, Qatar reveals that the focus of the editors overlaps in a very partial manner only with the most challenging national health problems.



[1] It is worth noting that in Qatar unintentional injuries such as, but not limited to, road or construction site injuries are also in the top #3 causes of death, especially for men.

[2] It is worth noting that cancer paradoxically ranks number #6 in the list of most used uniGrams in the health section of The Peninsula. It also appears in 1% of all biGrams and 1.5% of all triGrams while it has a relatively low incidence on morbidity in Qatar.

The dataset and complete R code of this post can be downloaded from this link.

To leave a comment for the author, please follow the link and comment on their blog: English – R-blog. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)