Generate image captions with the Computer Vision API

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Azure Computer Vision API can extract all sorts of interesting information from images — tags describing objects found in the images, locations of detected faces, and more — but today I want to play around with just one: caption generation. I was inspired by @picdescbot on Twitter, which selects random images from Wikimedia Commons and generates a caption using the API. The results are sometimes impressive, sometimes funny, and sometimes bizarre. On the bizarre side, recurring motifs include “lush green field”, and “a clock tower on a building”, but the bot doesn't discriminate when the confidence score returned by the API is low, so I wanted to take a look at that as well.

I'll use R to interface with the API with the code provided below. You can use the code too, but you'll need an Azure login first. If you don't have one, you can sign for a free Azure account (or Azure for Students, which doesn't require a credit card), and get some free credits to boot. We won't be using any credits in this case, though, as the Computer Vision API has a free pricing tier: it's limited to 5,000 calls a month and 20 calls per minutes, but that's more than sufficient for our needs.

To generate a key for the Computer Vision API, visit the Azure Portal and click “Create a Resource”. Select “AI + Cognitive Services” and then “Computer Vision API”. Choose a name, your subscription (yours will be different), a data center (choose a region close to you), the pricing tier (F0 is the free tier), and create a new resource group to hold your keys:

Vison api

It'll take just a moment to set things up, but once that's done click select “Overview” to display the API endpoint, and “Keys” to display the API keys. (Two keys are generated, but you will only need to use Key 1.)

Now launch R, load a couple of packages we'll need, and save the endpoint and API key into R objects as shown below:

vision_api_endpoint <- ""
vision_api_key <- "7f1f01ac24064abd80970f41a90237e7"

(Your endpoint may differ depending on the region you chose, and your API key will definitely be different: that one's invalid.) With those two pieces of information, we're all set to go.

Step 1 is to generate the URL of an image from Wikimedia commons. This is possible to do using the Wikimedia API, and easy once you know how. (It took me quite a while and a lot of trial and error to figure out that API, though.) The simple R function below will query the Wikimedia Commons API and return the URL of a random image file. It also checks that the image meets the requirements of the Computer Vision API, and throws an error if not.

You can now call the function to generate the URL of a random image. (If it throws an error, just try again.) It also returns as attrributes the dimensions of the image and the description from the Wikimedia Commons page, which will be interesting to compare to the Computer Vision API generated caption.

> random_image()
[1] ""
   w    h 
3072 2304 
[1] "Villa Malva i Ramlösa brunnspark i Helsingborg."

Step 2 is to use the URL generated by that function as input to the Computer Vision API with the second R function below. It uses the global vision_api_endpoint and vision_api_key objects you defined earlier to call the Computer Vision API, requesting the Description (caption). It will also try and identify celebrities and famous landmarks, if it finds them in the image (for example, the generated caption for this image is "TOM CRUISE wearing a suit and tie").

That function prints the URL, caption generated by the Computer Vision API and its confidence score (a value of 0 and 1), along with the Wikimedia Commons description for comparison. Let's give it a go. In each case I simply ran the two lines below. 

> u <- random_image()
> image_caption(u)

Here are a few results:

image from
Wikimedia description: Gerfalke (Falco rusticolus) in Westgrönland
Computer Vision API caption (confidence: 92.9%): a bird that is standing in the grass

image from
Wikimedia description: Vista de la casa de la Hacienda Mozanga en 1995.
Computer Vision API caption (confidence: 90.2%): a house with trees in the background

image from
Wikimedia description: Dj Israeli, en un concierto.
Computer Vision API caption (confidence: 44.3%): a man riding on the back of a bicycle

That last example is a good lesson not to trust the captions when the confidence score is low! Let is know on any interesting, funny or strange captions you get in the comments below. 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)