The importance of keyword-rich descriptions
There are nearly 20,000 services on G-Cloud. Suppliers have strewn their services with keywords designed to grab the attention of buyers. So what should buyers search for, and how does that vary by cloud service category?
Only selected parts of the suppliers’ content are indexed for searching: The service title, a 50-word summary, and bulleted features and benefits. So suppliers must cram in thoughtful keyword-rich phrases to optimise their chances of success.
In this blog, I want to compare and contrast the most frequent keywords used by suppliers. I’ve selected four categories from the Cloud Hosting lot for this purpose:
- Compute & Application Hosting (C&AH)
- Object Storage
- Infrastructure & Platform Security (I&PS)
- Platform as a Service (PaaS)
Discarding distracting data
Services can belong to multiple categories as demonstrated in the Venn diagram below. For example, 53 (those at the heart of the plot) are aligned to all four categories. Comparing and contrasting the keywords for these would clearly be of little benefit. So I’m going to focus on those services around the periphery which are unique to each category, for example, the 323 for C&AH and so forth.
Having defined the scope, we now need to do a bit of cleaning. The words are converted to lower case so that we get a truer count of each distinct word. Common stop words, such as “and” and “the”, are removed. Words which are category-neutral, such as “cloud” and “service”, as well as the names of the suppliers or services themselves, are also weeded out. This cleaning will enable us to home in on service characteristics.
Visualisation of search terms
With that done, we could visualise the word frequency per category with a Word Cloud. The Compute & Application Hosting example below shows the most frequent words, where, for example, “uk”, “data”, “virtual”, “scale” and “security” figure prominently.
However, whilst visually appealing, we do need a better approach if we are to compare and contrast across categories. This facet-wrap plot shows the ten most frequent words in each category. The advantage here is that we can more easily see both common ground and points of distinction.
“Security” and “data” are among the top keywords for all four categories. In contrast, “API” and “integration” are distinctively important for Platform as a Service (PaaS). Similarly, “scale” and “virtual[isation]” are distinctively important for Compute and Application Hosting.
A more extensive analysis of this nature may help the G-Cloud team to identify inter-category dissimilarity and thus refine the service categorisation newly introduced in the ninth iteration of G-Cloud. It could also form the basis of guidance to buyers on the keywords to consider when preparing search terms for a given category.
R tools used
|rvest||read_html; html_nodes; html_text|
|dplyr||select; arrange; filter; count; mutate; if_else; anti_join|
|ggplot2||theme_set; geom_col; geom_text; coord_flip; facet_wrap|
R Development Core Team (2008). R: A language and environment for
statistical computing. R Foundation for Statistical Computing,
Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
Contains public sector information licensed under the Open Government Licence v3.0.