rcrunchbase – An API Interface to CrunchBase

February 10, 2015

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

James Peruvankal
Sr. Program Manager, Revolution Analytics

Information about the technology business ecosystems is valuable to both established companies as well as startups. Fortunately CrunchBase – the world’s most comprehensive dataset of startup activity, captures quite a bit of such information. Founded in 2007 by Mike Arrington, CrunchBase began as a simple crowd-sourced database to track startups covered on TechCrunch. Today, you’ll find about 650K profiles of people and companies that are maintained by tens of thousands of contributors. Venture Capital firms have willingly shared this information so that others could benefit. It's also accessible to everyone as an API and to researches as downloadable workbook.

rcrunchbase is an R client to the CrunchBase API developed by Tarak Shah of UC Berkeley. It has several helpful functions that aim to create a compositional query flow. As much as possible, complex queries can be built up from simple requests. The intent is to have rcrunchbase handle the messy stuff while you focus on getting the data you want.

As an example, let us explore relationship between the companies through the founding teams. For example 'Paypal Mafia' is a group of former PayPal employees and founders who have since founded and developed additional technology companies such as Tesla Motors, LinkedIn, Palantir Technologies, SpaceX, YouTube, Yelp, and Yammer. You can read about Paypal Mafia in wikipedia and the San Jose Mercury News.

Let's find out more about the Paypal Mafia from CrunchBase. To get started you will first need to sign up to get an API key for CrunchBase access, and then install the package with the command:


 The following code lists the current and past team of Paypal people who are in CrunchBase.

# Start by looking up the node details of a company
pp <- crunchbase_get_details("organization/paypal")
# get the path to pull the collections corresponding to the companies “current team” and "past team"
crunchbase_expand_section(pp, c("current_team", "past_team"))

The output is a list of 230 people which can be obtained here:  Download Pp_team

These three functions can be combined in diverse ways, resulting in a much richer and more expressive approach to the API. To take full advantage of the compositional nature of these functions, it’s useful to have a “piping” operator to pass results of one function to inputs for the next function. For example, one could find the list of companies that Paypal's current and past teams have invested in:

pp_invests <- crunchbase_expand_section(pp, c("current_team", "past_team")) %>%
    crunchbase_get_details %>%

The result (Download Pp) is a list of over 300 companies! That's a huge impact by the Paypal Mafia.

The crunchbase database is a graph database and using the API may be time consuming. Crunchbase also publishes a xls workbook with information on companies, funding rounds, and acquisitions. (Available to academics and Crunchbase venture partners). More about that in another blog post.

What interesting questions do you have with such rich data on the startup ecosystems? Please comment below…

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

Comments are closed.

Search R-bloggers


Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)