In the previous post (mostly based on EDA principles) I highlighted the main features of 1,720 Venture Capital deals that took place in 2016 in 50 different countries. It is of a central importance to once again underline that the dataset that I used is not a representative sample of the 2016 VC deals universe and should therefore not be considered as an accurate reflection of what happened on the investment scene over the year 2016.
That being said, the exercise allowed to draw some interesting conclusions (that apply to the dataset only) and visualise them in a clear manner:
- A large proportion of deals have a value below 7.5m US$ each;
- Compared to other industries, deals in pharma, healthcare and biotech show a more uniform distribution of value (less smaller deals, more bigger deals proportionally);
- There is a clear increase in deal value from a funding round to the next one;
- Country rankings are quite different when established on number of deals, total value of deals or by average value per deal;
- The difference in country rankings is originated by the country variation of the number of deals in each funding round;
- Many deals are closed by several investors, and several investors are involved in more than a single deal;
- The number of investors per deal is correlated to the funding round, not to the industry.
To build the dataset, I originally used info publicly circulated in press releases or web news. Therefore, in addition to facts & figures, the dataset comprises a substantial amount of text: 1,507 rows (out of 1,720) include statements such as:
“fourkites has raised USD 13 million in series.a funding. founded in 2013, fourkites leverages mobile, cloud and analytics software to provide real-time tracking of shipments all across north america. the company currently tracks more than 2.5 million electronic logging devices, with that number expected to increase rapidly toward the end of 2017, when a federal mandate requiring all commercial truckers to use such devices goes into effect. the company will use the funding to expand its product offerings and add at least 100 employees to its chicago headquarters over the next 12 months.”
Therefore, it could be useful to apply techniques of text mining and text analysis to this corpus to see if it’s possible (and worthwile) extracting valuable information from such an unstructured set of textual data.
First and foremost, I want to highlight that a vast majority of press clippings are quite brief, just a few sentences (on average 4.6 sentences per clippings, median value = 4). Actually, 1,214 clippings out of 1,507 count no more than 5 sentences. Consequently, the total number of words per clipping is also quite low.
This first observation helps to set the expectations. As we know that press releases in our dataset are quite short and also that VC deal announcements usually follow some kind of narrative pattern (that gives little room to literary fantasy or imagination), we are immediately aware that we might face limitations in finding sensible meaning from text mining techniques.
One more step must be completed before starting the text mining exercice: removing stop words from the dataset. Stop words can be defined as words which do not contain important significance to be used in the analysis. Routinely these words are filtered out from queries or analytical algorithms because they return unnecessary information. In addition to the usual English stop words, I also removed from the dataset a list of words that are so obvious in the context of VC deals press announcements that they don’t add any specific information, such as, but not limited to, usd, funding, funded, based, million, founded, funds, financing, company, companies, raised, raising, raises, capital, round, venture, ventures, investment, investor, investors, business, businesses,…
It then becomes easy to create a visualization of the most common words in the dataset.
This simple first step shows some interesting trends:
- it confirms seed and series A investments are more frequent than other rounds of funding
- words related to the development of a technology-driven company are frequent (platform, technology, product, development)
- end users, customers, people are a central focus behind the need for investment
- ICT industry is predominant (through words such as software, cloud, data, app, digital, online, web)
- the company’s team seems to be pillar (team, ceo, led)
- funding is needed for taking the company to the next level (see words such as expansion, development, growth, accelerate, plans)
- something that we could define as getting things done also appears to be amongst the priorities (see words such as build, real, operations, sales)
A further step down the same road of the word frequencies analysis can help to identify the relevance of any n-gram instead of single words only. And it does not require rocket science to see how much meaningfull information lies into it: bi-grams and tri-grams tell a lot about the main priorities of startups securing VC investment.
In terms of corporate functions or objectives, we see that product development, expansion of the business operations, acceleration of sales and marketing are the most quoted nGrams (through several wording variations). Additional investment primarily appears as an opportunity for the company to take its core activities and targets to the next level.
Another exercise could compare the wording used in the press releases across several industries. Although this could be applied to all the combinations, let’s focus on 2 examples only comparing first ICT to health, biotech, biomed press releases, then education, childcare to services.
1. ICT vs Health, biotech, biomed
Words such as platform, development, product, improve, engage are equally used (in relative frequencies) in both ICT and health, biotech, biomed press releases (they show close to the dotted bisector line). Health, clinical, cancer, therapeutics are clearly words belonging to the heath, biotech, biomed industry whereas security, payment, ai or networks are more ICT-marked.
Correlation test (Pearson): 0.6097 | p-value: 0
2. Education & childcare vs Services
Improve, expand, accelerate, build, platform are equally used. Employees, real, book, demand are more service-oriented words. Language, english, science, reach, children and students clearly are on the education side.
Correlation test (Pearson): 0.6121 | p-value: 0
Although sentiment analysis is a well-established field relying on proven approaches and methodologies, its application to the domain of finance still is at a lower stage of development, mainly due to the lack of comprehensive sentiment lexicons specific to the finance industry. Thank to the great work of Prof. Bill McDonald –who improved an existing Master Dictionary adding words and relative sentiments from the finance and accounting industry– it is at least possible to start from an existing lexicon.
The first, and perhaps most important, information that can be extracted from this dataset lies in the quite descriptive and neutral wording of the VC deals press releases overall. Out of 68,777 (cleaned) unigrams, only 6,043 express a sentiment (positive or negative). Therefore, 91.2% of all unigrams in the dataset do not carry any sentiment value. There are, however, enough words carrying sentiment value spread over the press releases to allow a sentiment quantification and classification of most of the press releases: 1,402 press releases (out of 1,720) use sentiment loaded words.
There is no real surprise in the distribution of the sentiment values, for it shows positive sentiments carried by VC deals press clippings winning over negative wording. A VC deal press realese usually anonounces good news: a company got money to support its growth and development, and in most cases the core business of the company is described as aiming to bring to the market a product or service that will solve some kind of problem.
Still, expression of sentiments is not similarly/evenly distributed across the industries. There is proportionally more negative wording in press realeses announcing VC deals in the health related industries (health, biotech, biomed and pharma) than in any other industry. This is quite easy to explain: despite these press releases announce positive news of additional funding, they also often mention several negatively connoted words such as disease, cancer, chronic, pain, symptoms, stress, anxiety, death, trauma, injuries, virus and so on.
ANOVA test of sentiment value ~ industry:
- F Value = 2.7911408
- Pr(>F) = 0.0045794
As a final note, I would like to give a warning against the risk of easy misinterpretation. It is worth noting that the ict industry shows a higher proportion of negative wording as well. The reason behind it, however, does not all lie in the industry itself, but is partially due to the lexicon. In the lexicon that I use (and that was built merging several lexicons from different sources), the word cloud is associated to a negative sentiment. The sentiment analysis algorithm does not make a difference between cloud-based IT solutions, products or services, and the sky (or the mood) saturated by grey clouds ! When we remove the negative value associated to the word cloud in the sentiment lexicon, the negative sentiment score for ict looks quite different…
|Case||Negative Score of ICT Press Releases (%)|
|word cloud considered as negative||24.35|
|word cloud removed from lexicon||20.68|
It took removing just 1 single word from the lexicon to change the sentiment score of a whole industry by 3.67%. Beware sentiments…!
The contribution of single words to the sentiments can be easily measured and understood by visualising the most common negative and positive words in the press releases that we analyse. It is now clear why removing a sentiment connotation from the words cloud or led (lead) would deeply impact the overall distribution of sentiments. The case of the word drones is also interesting. In our dataset, it is used as a descriptive and neutral word to mean a remote-controlled pilotless aircraft; but the lexicon connotes it negatively because it is refered to its other meaning, a continuous low humming sound. hence, listing the most common negative and positive words may help to iteratively clean or fine-tune the lexicon.
This is just an apetizer, actually. This post has just lightly scraped the surface of corpus analysis. Text mining could be taken many more steps further with the analysis of word and document frequency, topic modeling, quantification of words relationships. Rather than a blog post, it would become a long long case study. More to come, then. Maybe.
The dataset and complete R code of both posts can be downloaded from this link (2MB).
For this post, I owe a debt of gratitude to Julia Silge’s Text Mining with R book, which I have followed and applied to this case study (it goes without saying that merits are hers whereas mistakes are mines !).