Data Science on Blockchain with R, Part III: Helium-based IoT is taking over the world
How big is the people’s network?
By Thomas de Marchin and Milana Filatenkova
Thomas is Senior Data Scientist at Pharmalex. He is passionate about the incredible possibility that blockchain technology offers to make the world a better place. You can contact him on Linkedin or Twitter.
Milana is Data Scientist at Pharmalex. She is passionate about the power of analytical tools to discover the truth about the world around us and guide decision making. You can contact her on Linkedin.
What is the Blockchain: A blockchain is a growing list of records, called blocks, that are linked together using cryptography. It is used for recording transactions, tracking assets, and building trust between participating parties. Primarily known for Bitcoin and cryptocurrencies application, Blockchain is now used in almost all domains, including supply chain, healthcare, logistics, identity management… Some blockchains are public and can be accessed from everyone while some are private. Hundreds of blockchains exist with their own specifications and applications: Bitcoin, Ethereum, Tezos…
What is Helium: Helium is a decentralized wireless infrastructure. It is a blockchain that leverages a decentralized global network of Hotspots. A hotspot is a sort of a modem with an antenna, to provide long-range connectivity (it can reach 200 times farther than conventional Wi-Fi!) between wireless “internet of things” (IoT) devices. These devices can be environmental sensors to monitor air quality or for agricultural purpose, localisation sensors to track bike fleets… Explore the ecosystem here. People are incentivized to install hotspots and become a part of the network by earning Helium tokens, which can be bought and sold like any other cryptocurrency. To learn more about Helium, read this excellent article.
What is R: R language is widely used among statisticians and data miners for developing data analysis software.
This is the third article on a series of articles on interaction with blockchains using R. Part I focused on some basic concepts related to blockchain, including how to read the blockchain data. Part II focused on how to track NFTs data transactions and visualise it. If you haven’t read these articles, I strongly encourage you to do so to get familiar with the tools and terminology we use in this third article: Part I and Part II.
Helium is an amazing project. Unlike traditional blockchain-related projects, it is not just about finance but has real-world applications. It is intended to help solves problems outside the crypto world, which is awesome! In the past, deploying a communication infrastructure was only possible for big companies. Thanks to the blockchain, this is now accessible to collectives of individuals.
While a lot of content is available about the coverage aspect of Helium and how to correctly position your antenna to maximize your revenue, little is available about the real use of the network by connected devices and this is what we would like to address here. In this article, we attempt to examine a current snapshot of Helium blockchain by answering the following questions:
- How big is the Helium network?
- Where are the hotspots located?
- Are they actively utilized, i.e. are they used to transfer data with connected devices?
We will analyse all historical data since the first block of the blockchain, up to the latest. We will generate some statistics and put emphasis on visualisation. I believe there is nothing better than a good graph to communicate a message
To fetch the data, there are several possibilities:
- Set-up an ETL: This is the most flexible way of fetching data, as you can chose how you manage the database. That can be tricky though, as for this, you would need (1) to set up a server, (2) have a lot of space on your hard-drives (several TB for a database loaded and running) and (3) have your hard-drives fast enough to be able to catch-up the blockchain (blocks are constantly added at a fast peace). On the topic, see this, this and this.
- Use the API: Easy but you are limited in the number of rows you can download. Given the size of the blockchain, this will only represents a few days. See this.
- Download data from the Dewi ETL project: Thanks to Dewi, there is a dedicated ETL server up and running. An interface (metabase) is also available to navigate and manipulate the data. It is possible to extract the data from the interface but it is limited to 10⁶ rows. Alternatively, the team put CSV extracts in 10k/50k-block increments, and this is what we are going to use here! Data are available here.
When you work with big dataset, it can get very slow. Here are two tricks to speed it up a bit:
- Work with packages/function adapted to handle large dataset. To read the data, we use here the fread from the data.table package. it is much faster than read.table and takes care of decompressing files automatically. For data management operations, data.table is also much faster than tidyverse but I find the code written with the latter much easier to read. That’s why I use tidy approach unless it struggles and in that situation, we switch to data.table.
- Try to keep only the data you need to save memory. Discard any data you won’t use such as columns with unimportant attributes, as well as delete heavy objects you no longer need.
The code below is intended to read chain data about the hotspots and perform some data management. We use the H3 package to convert the Uber’s H3 index into latitude/longitude. H3 is a geospatial indexing system using a hierarchical hexagonal grid. H3 supports sixteen resolutions, and each finer resolution has cells with one seventh the area of the coarser resolution. Helium uses the resolution 8. To give you an idea, with this resolution, the earth is covered by 691,776,122 hexagons (see here).
This is how the hotspot dataset looks like. We have the address of the hotspot, the address of the owner (an owner is a Helium wallet to which several hotspots can be linked), the date the hotspot was first seen on the network and its location on the globe.
Table 1 shows a few descriptive statistics on the hotspot dataset.
Statistics and visualisation
The first statistic we calculate aims to characterise how many hotspots people may have. Since there are a lot of owners, showing all the combinations is not possible. Plotting a histogram of the distribution is not an option either as it is super skewed (there is an owner with about 2000 hotspots!). Therefore, we chose here to bin the number of hotspots into categories (Table 2). We see that most owners (about 80%) own only one single hotspot but some own hundreds of hotspots.
There are more than 500k hotspots in the world, which is a lot. These hotspots didn’t appear in one day. In Figure 1, we visualize the growth of the network in terms of how many hotspots were added to the network over time, using a cumulative plot. We see three phases: (1) a slow linear increase, (2) an exponential increase in the middle of 2021 followed by (3) a fast linear increase. In my opinion, the exponential phase could have continued further but has saturated due to the limited hotspot supply that happened because of world chips shortage following the Covid pandemic. To give you an idea, there was a 6 months lag between my hotspot order and its delivery.
Since we have the geographic information for Helium hotspots, we can visualize where they are located. We start by creating an empty world map on which we overlay the hotspot data. Plotting all the individual hotspots on a map would be too much (there are more than 500k hotspots) — the data would be easier to interpret when summarised. Here, we chose to cluster the hotspots into hexagons using a function found on the web (function here) and then plot them using the geom_hex ggplot2 function (Figure 2).
We can see that most hotspots are located in North America, Europe and Asia, mostly in big cities. There are practically no hotspots in Africa, Russia and very few in South America. Surprisingly, we see a few hotspots in the middle of the ocean. It could be either a data issue or simply cheating: People found ways to increase their rewards by spoofing their hotspot’s location, sadly.
In addition to visualisation, it is always useful to provide some numbers. Below we summaries the proportion of hotspot per continent. For this, we leverage the rworldmap package with a custom function from here which maps a longitude/latitude couple into the name of the continent/country it belongs to. Table 3 shows that nearly half the hotspots are located in North America, followed by Europe with 30% and then Asia with 16%. Note the Undefined group which probably refers to hotspots located either in the middle of the ocean or along continent border. Note also the four hotspots in… Antarctica.
3. Network usage
Now that we understand how the existing hotspots are distributed on the planet and among owners, next it would be interesting to find out if they are being actively used by connected devices and how often. To answer this question, let us download all the history of data transfer. This is a huge dataset (3GB).
On Helium, you only pay for the data you use. Every 24 bytes sent in an uplink or downlink packet cost 1 Data Credit (DC) = $0.00001. To get an idea of how much the network is used, we can look at it from two perspectives: (1) check the volume of data exchanged and (2) check how often the hotspots have been involved in data transfer with connected devices.
This is how the transaction dataset looks like. For each transaction, we have the block number, the address of the hotspot, the number of bytes transferred, the date, and the location of the hotspot.
Table 4 shows a few descriptive statistics on the transaction dataset as well as the volume of data exchanged so far. Clearly, the amount of data exchanged between hotspots and connected devices is small, this is about as much as the data volume created by my smartphone in recent years. This metric does not seem to be a good indication of the Helium usage. Indeed, the network is not intended to transfer huge volumes of data but rather to transfer data across long distance and for a small price. Below, we will look at the second metric, which is more appropriate in quantifying Helium usage.
Another interesting fact — the first transaction occurred on the 2020–05–15 while the first hotspot appeared on the network on 2019–07–31. It means there had been about 14 months delay between the appearance of the first hotspot and the first transaction being made. There are two reasons: (1) my initial guess — this is because a critical number of hotspots was needed to convince connected device manufacturers to work with the network and (2) data transfer was free in the beginning and DC transactions were only activated in April 2020 (more here).
Statistics and visualisation
To determine how often the hotspots have been involved in data transfer with connected devices, we can also analyse the total number of transactions. This is another metric of Helium usage. Each data transfer between a hotspot and a connected device corresponds to one transaction on the blockchain and one row in our dataset.
To summarise the evolution of this metric, we calculate the cumulative sum of the number of transactions per date and we then stratify it by continent. Globally, figure 3 is very similar to figure 1above: a slow linear increase followed by an exponential increase, which is finally followed by a fast linear increase. The only difference is the glitch in November 2021, which is due to a major outage of the blockchain (here). Surprisingly, we see that despite having about 15% of the hotspots, Asia don’t seem to be so active in terms of data transfer in contrast to North America and Europe.
This is confirmed by the distribution of the total number of transactions per continent, we see that Asia represents only 3% of the total.
We can also look at where the top 10 most active hotspots are located. Note that we shall use the data.table syntax instead of dplyr. As mentioned above, the dplyr syntax is preferred for its readability, in this case it takes only 2 seconds for data.table while dplyr is much slower. We see that the most active hotspot are located in France, US and Canada.
We can also calculate the proportion of hotspots involved in transactions and the median number of transactions per hotspot.
The median number of transactions per hotspot (excluding hotspots which didn’t participate in any transaction) is 42 and 40.48% hotspots did not participate in any transaction so far. We cannot really say that all hotspots are being exploited… Not yet! The network is still in its infancy and has a lot of spare capacity.
Let us again visualise the number of transactions on the the world map. We bin the data using the same makeHexData function and overlay the map with the number of data transactions. This time, we create a longitudinal animation using the gganimate package (Figure 4). Although direct comparison with figure 3 is difficult since we have here an additional dimension (the color refers to the number of transactions), the message is similar. We see that transactions mainly occur in North America before mid 2020, then followed by a strong wave in Europe and Asia. Barely no transaction have occurred in South America and Africa.
To add a bit of visual perspective, we can also turn the plot in 3D using the awesome rayshader package. We shall focus on two countries: (1) US as it is the country with the biggest number of hotspots and transactions and (2) Belgium, which is my home country. As this time around we intend to generate a static plot instead of an animation, we re-bin the data into hexagons. Note that it is possible to animate this 3D plot but it takes a lot of computing time and fine tuning (see this).
Figure 5shows the US map. We see that transactions are homogeneously distributed across the country although the peaks of activity (note that the legend is logarithmic!) are located around big cities (New York, Los Angeles, San Francisco, Miami).
Figure 6 shows a map of Belgium. Here the pattern is different as we see that transactions are not homogeneously distributed across the country. Most transactions happen in the upper part of the country, which is consistent with the lower part of the country being scarcely populated (Region of Ardennes).
I hope you enjoyed reading this article and have now a better understanding of what is the Helium network and its evolution over the past years. Here, we have shown some techniques on how to summarise and visualise the network growth in term of infrastructure (hotspots) and data usage (transactions). We have analysed spatio-temporal data and have plotted them using dedicated R packages.
We look forward to receiving your feedback and ideas regarding blockchain topics that deserve to be covered in our next post. If you wish continue learning about chain data analysis using R, please follow me on Medium, Linkedin and/or Twitter so you get alerted of a new article release. Thank you for reading and feel free to reach us if you have questions or comments.
I’d like to thank the Dewi team and Helium Discord community (@ediewald, @bigdavekers, @jamiedubs, #data-analysis) for their support and for providing the data.
If you wish to help us continue researching and writing about data science on blockchain, don’t hesitate to make a donation to our Ethereum (0xf5fC137E7428519969a52c710d64406038319169), Tezos (tz1ffZLHbu9adcobxmd411ufBDcVgrW14mBd) or Helium (13wfiNFC7NrxHR8wZNbu8CYcJdzTsNtiQ8ZwYW8VscNtzjskjBc) wallets.
All figures are from authors unless otherwise stated.
Data Science on Blockchain with R. Part III: Helium based IoT is taking the world was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.