Site icon R-bloggers

LLM-assisted Summarization of Abstracts And Bluesky Post via R

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

How do we identify relevant articles in our domains? This project uses example journal RSS feeds with abstracts, uses LLMs to extract points of interest, and shares insights on Bluesky—stimulating curiosity.

Motivations < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

There are countless excellent articles published daily, but how do we identify which ones are relevant to our domains? How can we extract the essence of an article—specifically, its abstract? Let’s build something that uses Bluesky as a forum to stimulate curiosity and learning.

The plan is to leverage RSS feeds provided by journals (which hopefully include abstracts) and then use those abstracts with an LLM to capture the essence. This could trigger curiosity in several ways: perhaps something seems unusual or intriguing, prompting us to verify if the LLM interpreted it correctly. Maybe we’ve already read the abstract and article, but the LLM identified something we missed.

Following the summary of the summary generation 🤣, how can we utilize Bluesky as a platform to share these insights? We’re going to explore how to implement all of these capabilities in R.

Objectives: < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Get ID RSS Feed < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

library(tidyverse)
library(tidyRSS)

url <- "https://academic.oup.com/rss/site_5269/3135.xml"

(df <- tidyfeed(url))

## # A tibble: 46 × 13
##    feed_title       feed_link feed_description feed_language feed_pub_date      
##    <chr>            <chr>     <chr>            <chr>         <dttm>             
##  1 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  2 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  3 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  4 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  5 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  6 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  7 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  8 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
##  9 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
## 10 Clinical Infect… http://a… "\n    "         en-us         2025-02-24 00:00:00
## # ℹ 36 more rows
## # ℹ 8 more variables: feed_last_build_date <dttm>, feed_generator <chr>,
## #   item_title <chr>, item_link <chr>, item_description <chr>,
## #   item_pub_date <dttm>, item_guid <chr>, item_category <list>

It’s that simple! Two lines of codes. I really appreciate that Oxford publisher adds abstract to their RSS feed. This is very helpful to send to LLM for summarization.

In order to set it as an automation, we’d have to pull old data and see if it matches new data. Let’s assume we already have old data that’s been save in rda. I’m going to simulate old data but removing random 50% of df, like so. And create a column sent with number 1, my way of stating these have been posted on bluesky.

Get old data < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

df_old <- df |>
  slice_sample(prop = 0.5) |>
  mutate(sent = 1)

but in reality we’d load it from data file like so.

(load("old_data.rda")) #assuming it's saved under df_old 

## [1] "df_old"

Also note that there is no LLM summary in our mock df_old.

Match with new data < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Let’s assume that df, the newly captured RSS feed is our new data, which we’ll want to anti_join the old ones to only return the ones that in our old data.

df_new <- df |>
  anti_join(df_old, by = "item_link") |>
  mutate(sent = 0)

df_new[10,"item_description"] |> pull()

## [1] "AbstractBackgroundBlastomycosis is an environmentally acquired fungal infection that can result in severe pulmonary illness and high hospitalization rates. In 2023, a blastomycosis outbreak was detected among workers at a paper mill in Delta County, Michigan.MethodsWe included patients with clinical and laboratory evidence of blastomycosis who had spent ≥40 hours in Delta County since 1 September 2022 and had illness onset 1 December 2022–1 July 2023. We assessed epidemiological and clinical features of patients and evaluated factors associated with hospitalization. We performed whole-genome sequencing to characterize genetic relatedness of clinical isolates from 8 patients.ResultsIn total, 131 patients were identified; all had worked at or visited the mill. Sixteen patients (12%) were hospitalized; 1 died. Compared with nonhospitalized patients, more hospitalized patients had diabetes (P = .03) and urine antigen titers above the lower limit of quantification (P &lt; .001). Hospitalized patients were also more likely to have had ≥1 healthcare visits before receiving a blastomycosis diagnostic test (P = .02) and to have been treated with antibiotics prior to antifungal prescription (P = .001). All sequenced isolates were identified as Blastomyces gilchristii and clustered into a distinct outbreak cluster.ConclusionsThis was the largest documented blastomycosis outbreak in the United States. Epidemiologic evidence indicated exposures occurred at or near the mill, and genomic findings suggested a common exposure source. Patients with diabetes may have increased risk of hospitalization, and elevated urine antigen titers could indicate greater disease severity. Early suspicion of blastomycosis may prompt earlier diagnosis and treatment, potentially reducing unnecessary antibiotic prescriptions and improving patient outcomes."

As you can see, looking at row 10 of the data, CID RSS feed does include abstract! 🙌 Lucky for us!

Summarize via LLM < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

## install openai virtualenv
## Run this if it's your first time by uncommenting it
## reticulate::virtualenv_install(envname = "openai", packages = c("openai","atproto"))
reticulate::use_virtualenv("openai")
library(reticulate)

openai <- import("openai")
OpenAI <- openai$OpenAI

client = OpenAI(api_key = 'YOUR API KEY') ## change this to yours  

response <- client$chat$completions$create(
  model = "gpt-4o-mini",
  messages = list(dict(
    role = "system",
    content = "You are a summarizer. Please summarize the following abstract. 
    Include statistics.Use emoji to shorten characters. The summary must not 
    exceed 200 characters—if you reach 200 characters, stop immediately. 
    Do not add any extra commentary or exceed the 200-character limit."
  ), dict(
    role = "user",
    content = df_new[10,"item_description"] |> pull()
  )
  ),
  temperature = 0
)

(summary <- response$choices[[1]]$message$content)

## [1] "26,233 adults studied (median age 71). 60.9% treated on day 0; \n30-day mortality: 7.5% (day 0), 8.5% (day 1), 10.2% (days 2-5). Delay increases \ndeath risk (OR 1.14 & 1.40). 🚨💊"

This is where you really can customize to your likely. What is it that is important to you that you’d want to see from an abstract so that you will click on the journal link and read more? For me, it’s the statistics. Hence, I specified that in the prompt. Because Bluesky only allows 300 graphenes (characters visible), I have experimented with 200,250,300 and found 200 appears to be the safest so we don’t go over and end up giving an error. For future improvement, it’d be good to create a function that checks this and either re-run the prompt with adjustment or thread it.

Post On Bluesky < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Please note that, this only demonstrates posting 1 specific title df_new[[10,]]. We can modify this code later so that it’s not so rigid. This is just to show how to send blue posts via atproto. This is an R package called bskyr which you can use as well for sending post, though I’m not sure if you can embed external.

atproto <- import("atproto")
Client <- atproto$Client
models <- atproto$models
client_utils <- atproto$client_utils

## Login
bsky <- Client()
bsky$login("username", "password") #change to your username and app password

## Build Text
text <- client_utils$TextBuilder()
text$text(summary)

## Embed 
embed <- models$app$bsky$embed$external$Main(
      external = models$AppBskyEmbedExternal$External(
          uri=df_new[[10, 'item_link']],  # The main link
          title=df_new[[10, 'item_title']], # Title of the preview
          description=df_new[[10, 'item_description']],  #  # Description of the link
      ))

bsky$send_post(text = text, embed = embed)

Save It And Post Later < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

You may have noticed that we actually have a column in df_old called sent. That is to enable posting of titles that we have not posted. We basically assign 1 whenever we sent a post, and 0 whenever there is a new title that we have not posted. In the end, we will need to save the entire dataframe. Let’s put all the code together just so it all makes sense. We’ll also assume that we already have old_data.rda, and also we already have env for openai and atproto.

### 1. Get Feed
library(tidyverse)
library(tidyRSS)

url <- "https://academic.oup.com/rss/site_5269/3135.xml"
df <- tidyfeed(url)

### 2. Get Old Data
load("old_data.rda") #df_old is our dataframe

### 3. Match feed and old data, to return only ones that are not in old data
df_new <- df |>
  anti_join(df_old, by = "item_link") |>
  mutate(sent = 0)

### 4. Randomly pull a title that has not been sent previously
topic_to_send <- df_new |> 
  filter(sent == 0) |>
  slice_sample(n = 1)

### 5. OpenAI LLM summarization
reticulate::use_virtualenv("openai")
library(reticulate)

openai <- import("openai")
OpenAI <- openai$OpenAI

client = OpenAI(api_key = 'YOUR API KEY') ## change this to yours  

response <- client$chat$completions$create(
  model = "gpt-4o-mini",
  messages = list(dict(
    role = "system",
    content = "You are a summarizer. Please summarize the following abstract. Include statistics.Use emoji to shorten characters. The summary must not exceed 200 characters—if you reach 200 characters, stop immediately. Do not add any extra commentary or exceed the 200-character limit."
  ), dict(
    role = "user",
    content = topic_to_send |> pull(item_description)
  )
  ),
  temperature = 0
)

summary <- response$choices[[1]]$message$content

### 6. Post on Bluesky
atproto <- import("atproto")
Client <- atproto$Client
models <- atproto$models
client_utils <- atproto$client_utils

## Login
bsky <- Client()
bsky$login("username", "password") #change to your username and app password

## Build Text
text <- client_utils$TextBuilder()
text$text(summary)

## Embed 
embed <- models$app$bsky$embed$external$Main(
      external = models$AppBskyEmbedExternal$External(
          uri=topic_to_send[[1, 'item_link']],  # The main link
          title=topic_to_send[[1, 'item_title']], # Title of the preview
          description=topic_to_send[[1, 'item_description']],  #  # Description of the link
      ))

## Post
bsky$send_post(text = text, embed = embed)

### 7. Combine old and new 
df_old <- rbind(df_old, df_new) |>
  mutate(sent = case_when(
    item_link == topic_to_send$item_link ~ 1,
    TRUE ~ sent
  ))

save(df_old, file = "old_data.rda")

Not the most elegant code, but you get the concept and that will do for now 🤣. You can then set task scheduler or crontab to run it once a week etc. The posting part can probably be separate from the entire script so that your new feed screener and post are separate. I’ll leave that to your creative tidy coding 🙌

Limitation/Opportunities For Improvement < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Final Thoughts < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

Acknowledgement < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

I really appreciate Jonathan Ryder and Joseph Marcus’ feedback on this project! They helped to cross-check what were important to summarize, what weren’t. Which summary made sense, which didn’t. Also provided good links for ID journal monitoring as well. Really couldn’t have done it without their insight and oversight. Give them a follow on 🦋

Lessons Learnt < svg class="anchor-symbol" aria-hidden="true" height="26" width="26" viewBox="0 0 22 22" xmlns="http://www.w3.org/2000/svg"> < path d="M0 0h24v24H0z" fill="currentColor"> < path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z">

If you like this article:

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version