Finding Economic Articles with Data: Boosting the App with GPT-3 Embeddings

[This article was first published on Economics and R - R posts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you seek empirical economic studies that can be replicated, try my Shiny-based search app:

The last couple of days, I was quite excited to explore the fantastic OpenAI API. First, I wrote mygpt, a package for customizable Chat-GPT addins in RStudio. Then I used using OpenAI’s text embeddings to improve my search app. You can now press a button with symbol ≈ next to an article to find similar articles:

To implement this feature, I first used the newest OpenAI text embedding model “text-embedding-ada-002” to generate text embeddings for 8000+ abstracts. An embedding of an abstract is a vector of 1536 real numbers that represents information about its content. The API can be easily used from R and there are also R packages that wrap the OpenAI API, like rgpt3 or, partly, gpt4r.

Although the OpenAI API costs money after a free trial, it seems quite affordable. Generating embeddings for my 8000+ abstracts costs just about 0.50 USD.

With embeddings for each abstract, one can compute the similarity of two abstracts by computing the cosine similarity (which is quite similar to a correlation). As embeddings are nicely normalized, one can actually compute all similarities at once using a simple matrix product.

After the similarities were computed, I stored for each abstract the 200 closest abstracts from other articles. The resulting file is not very small and could weigh down the app if always loaded on start-up. To maintain performance, I store the results using the great fst package. It allows loading selective rows once a user presses the button to find similar articles.

Overall, I think that OpenAI’s embeddings do a quite good job of finding similar articles. But just try out yourself!

Looking ahead, I am excited to experiment with OpenAI’s fine tuning and classification. Currently, I think of looking how well a fine-tuned GPT-3 model can predict from the abstract of an article whether the automatic Stata repoduction is above some threshold like 90%. A high reproduction rate indicates that probably most data sets are available. Perhaps, from the topic described in the abstract, GPT-3 can already make a good guess about data availability.

To leave a comment for the author, please follow the link and comment on their blog: Economics and R - R posts. offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)