Hugging Face 🤗, with a warm embrace, meet R️ ❤️

Posted on August 31, 2023 by r on Everyday Is A School Day in R bloggers | 0 Comments

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

I’m delighted that R users can have access to the incredible Hugging Face pre-trained models. In this demonstration, we provide a straightforward example of how to utilize them for sentiment analysis using GPT-generated synthetic data from evaluation comments. Let’s go!

Interesting Problem 😎

What if you’re faced with a list of survey comments that you need to sift through? Apart from reading them one by one, is there a method that could potentially introduce a new perspective and expedite this process? Are there any models available for performing sentiment analysis?

Objectives:

Brief Intro to Transfomers Python Module & Hugging Face

`Transformers`

In comes Transformers, which provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. These models support common tasks in different modalities, such as: NLP, computer vision, audio, and multimodal. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX. Pretty cool, right!? But wait, this is a python 🐍 API! No fear, as before, we’ve demonstrated how R is able to use python modules with ease. Let’s code!

About Hugging Face 🤗

Hugging Face is a technology company specializing in natural language processing (NLP) and machine learning, best known for its Transformers library, an open-source collection of pre-trained models and tools that simplify the use of advanced NLP techniques. Established in 2016, the company has become a significant contributor to the field of AI, democratizing access to state-of-the-art models like BERT, GPT-2, and many others. Their platform allows developers, researchers, and businesses to easily implement complex NLP tasks such as sentiment analysis, text summarization, and machine translation. With a robust community of users contributing to its ecosystem, Hugging Face has become a go-to resource for those looking to harness the power of machine learning for language-based tasks.

Installing Transformers and Loading Module

library(reticulate)
library(tidyverse)
library(DT)

# install transformers
# py_install("transformers", pip = T) # remember to uncomment and do this first

# load transformers module 
transformer <- import("transformers")
autotoken <- transformer$AutoTokenizer
autoModelClass <- transformer$AutoModelForSequenceClassification

The above code when loading transformers resemble the below in python

from transformers import AutoTokenizer, AutoModelForSequenceClassification

Load Reuter Dataset

# load data
df <- read_csv("reuters_headlines2.csv") |>
  head(10)

# extract the headlines section
df_list <- df |>
  pull(Headlines)

Load Pre-trained Model & Predict

When you go to Hugging Face Model section, click on text classification and then sort by most likes. The above is a snapshot of that. Through wisdom of crowd, I think the top liked pre-trained models might be good ones to try out! Let’s give them a try!

Load model

tokenizer <- autotoken$from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model <- autoModelClass$from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Let’s look at what the model predicts

model$config$id2label

## $`0`
## [1] "NEGATIVE"
## 
## $`1`
## [1] "POSITIVE"

Ahh, ok. For distilbert-base-uncased-finetuned-sst-2-english, the output would be negative which is 0 or positive which is 1.

Let’s feed our data onto `tokinzer` and see what is in it?

inputs <- tokenizer(df_list, padding=TRUE, truncation=TRUE, return_tensors='pt') # pt stands for pytorch

inputs$data

## $input_ids
## tensor([[  101,  3956,  2000,  2224,  3424,  1011,  7404,  6627,  2000,  4675,
##          21887, 23350,  1005,  8841,  4099,  1005,   102,     0,     0],
##         [  101,  1057,  1012,  1055,  1012,  4259,  2457,  2000,  3319,  9531,
##           1997, 14316,  3860,  4277,   102,     0,     0,     0,     0],
##         [  101, 24547, 28637,  3619,  2000,  2485,  2055,  3263,  5324,  1999,
##           2142,  2163,   102,     0,     0,     0,     0,     0,     0],
##         [  101,  2762,  1005,  1055,  2482,  3422, 16168,  4520, 20075, 29227,
##           2000,  6366,  6206,  7937,  4007,  1024,  3189,   102,     0],
##         [  101,  2859,  2758,  1057,  1012,  1055,  1012,  2323,  2425,  7608,
##           2000,  2689, 11744,  1999,  6629,  5216,   102,     0,     0],
##         [  101, 21396,  6290,  4152,  2117,  6105,  4895, 23467,  2075,  7045,
##           7708,  2011,  1057,  1012,  1055,  1012, 17147,   102,     0],
##         [  101, 10321, 20202,  1999,  2148,  3792,  2000,  3789,  2006,  2586,
##           3989,  1024,  1059,  2015,  3501,   102,     0,     0,     0],
##         [  101,  3119,  1011,  7591, 15768,  2006, 14607,  2004, 12503, 21094,
##            102,     0,     0,     0,     0,     0,     0,     0,     0],
##         [  101,  4717,  2884,  4487,  3736,  9397, 25785,  2015,  2007,  2117,
##           1011,  4284,  3463,  1010, 17472,  3105,  7659,   102,     0],
##         [  101,  2317,  2160,  1005,  1055, 23524,  2758,  1005,  2093,  9326,
##           2017,  1005,  2128,  2041,  1005,  2005,  1062,  2618,   102]])
## 
## $attention_mask
## tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0],
##         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

Interesting! input_ids are the numerical representations of the tokens in your input sequence(s). The first value 101 is the special token [CLS], which is often used as a sequence classifier in models like BERT. attention_mask tensor indicates which positions in the input sequence should be attended to and which should not (usually padding positions). A 1 means the position should be used in the attention mechanism, while a 0 usually signifies padding or another value to be ignored.

Now let’s dive into the tokenization of the data

df_list[1:5]

input$data[[1]][0:4] # notice that python begins with 0

Above is a snapshot of my console that showed the actual words and the tokens. It looks like token 2000 is to. Note that the tokens begin with 101 and end with 102.

In transformer models like BERT, certain special tokens are often used to help the model understand the task it should perform. These special tokens are represented by special IDs. The 101 and 102 tokens are such special tokens, and they have particular meanings:

101 represents the [CLS] (classification) token. This is usually the first token in a sequence and is used for classification tasks. For tasks like sequence classification, the hidden state corresponding to this token is used as the aggregate sequence representation for classification.

102 represents the [SEP] (separator) token. This token is used to separate different segments in a sequence. For instance, if you’re inputting two sentences into BERT for a task like question-answering or natural language inference, the [SEP] token helps the model distinguish between the two sentences.

As practice, what tokens are U.S.? Hover here for answer.

Let’s Check The Prediction

## reticulate does not have ** function to pass the params
outputs <- model(inputs$input_ids, attention_mask=inputs$attention_mask)

outputs

## $logits
## tensor([[ 2.1572, -1.8241],
##         [-1.6254,  1.5929],
##         [ 1.3497, -1.1460],
##         [ 3.3878, -2.8804],
##         [ 3.8068, -3.1309],
##         [ 2.1719, -1.8269],
##         [ 1.6600, -1.5161],
##         [ 2.0822, -1.8792],
##         [ 4.2344, -3.4873],
##         [ 1.8456, -1.4874]], grad_fn=<AddmmBackward0>)

Ahh, these are in logits. Also, noted that we cannot do model(**inputs) like in python, we’d have to pass in individual parameters.

Load `torch` and change to probability

torch <- import("torch")
predictions <- torch$nn$functional$softmax(outputs$logits, dim=1L)

predictions

## tensor([[9.8168e-01, 1.8320e-02],
##         [3.8484e-02, 9.6152e-01],
##         [9.2384e-01, 7.6159e-02],
##         [9.9811e-01, 1.8920e-03],
##         [9.9903e-01, 9.6959e-04],
##         [9.8199e-01, 1.8008e-02],
##         [9.5992e-01, 4.0076e-02],
##         [9.8132e-01, 1.8679e-02],
##         [9.9956e-01, 4.4292e-04],
##         [9.6555e-01, 3.4454e-02]], grad_fn=<SoftmaxBackward0>)

Yes! There’re in probabilities now. But how do we turn these tensors into tibble ?

# turn tensor to list
pred_table <- predictions$tolist()

# map list into dataframe
table <- map_dfr(pred_table, ~ tibble(positive = .[2], negative = .[1]))

datatable(table)

Awesome! Looks like at least the coding worked. Let’s combine the comments and the scores to check.

df |>
  head(10) |>
  select(Headlines) |>
  add_column(table) |>
  datatable()

wow, most news are quite negative. 🤣 Not sure if distilbert-base-uncased-finetuned-sst-2-english is the best pre-trained model for these data.

Let’s check out `ProsusAI/finbert`

tokenizer <- autotoken$from_pretrained("ProsusAI/finbert")
model <- autoModelClass$from_pretrained("ProsusAI/finbert")
inputs <- tokenizer(df_list, padding=TRUE, truncation=TRUE, return_tensors='pt')
outputs <- model(inputs$input_ids, attention_mask=inputs$attention_mask)
predictions <- torch$nn$functional$softmax(outputs$logits, dim=1L)
pred_table <- predictions$tolist()
table <- map_dfr(pred_table, ~ tibble(positive = .[1], negative = .[2], neutral = .[3]))

df |>
  select(Headlines) |>
  add_column(table) |>
  datatable()

I like the additional option of neutral. This might actually be very helpful for our actual problem in evaluation comments.

Predict GPT4 generated comments 🤖

First, Generate Data

Second, Use `finBERT` for Sentiment Analysis

eval_df <- read_csv("eval_comment.csv") |> 
  pull(comment)

inputs <- tokenizer(eval_df, padding=TRUE, truncation=TRUE, return_tensors='pt')
outputs <- model(inputs$input_ids, attention_mask=inputs$attention_mask)
predictions <- torch$nn$functional$softmax(outputs$logits, dim=1L)
pred_table <- predictions$tolist()
table <- map_dfr(pred_table, ~ tibble(positive = .[1], negative = .[2], neutral = .[3]))

df_final <- tibble(comment = eval_df) |>
  add_column(table) |>
  select(-negative) |>
  mutate(positive = positive + neutral) |>
  select(-neutral)

datatable(df_final)

Wow, not bad! If we put a threshold of 0.9 or more to screen out negative comments we might do pretty good!

Third, `datatable` with `backgroundColor` conditions for Aesthetics 📊

datatable(df_final, options = list(columnDefs = list(list(visible = FALSE, targets = 2)))) |>
  formatStyle(columns = "comment",
              backgroundColor = styleInterval(cuts =
                c(0.5, 0.95), values =
                c('#FF000033', '#FFA50033', '#0000FF33')
              ),
                valueColumns = "positive")

Notice that I had to set a threshold of 0.95 to ensure all negative comments are captured. Meaning, only comments with sentiment of more than 0.95 will have blue background. If anything between 0.5 and 0.95 it would be yellow. Anything less than 0.5 will be red.

We’re done !!! Now we know how to access Hugging Face pre-trained model through transformers! This opens up another realm of awesomeness!

Acknowledgement

This Colab link really had helped me to modify some of the codes to make it work in R
Thanks to my brother Ken S’ng, who inspired me to explore hugging face with his previous python script
Thanks to chatGPT for generating synthetic evaluation data!
Of course, last but not least, the wonderful open-source community of Hugging Face! 🤗

Lessons learnt

Markdown hover text can be achieved through [](## "")
Changing alpha of hex code can be achieved through chatGPT prompt.
There are tons of great pre-trained models in Hugging Face, can’t wait to explore further!

If you like this article:

please feel free to send me a comment or visit my other blogs
please feel free to follow me on twitter, GitHub or Mastodon
if you would like collaborate please feel free to contact me

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Hugging Face 🤗, with a warm embrace, meet R️ ❤️

Interesting Problem 😎

Objectives:

Brief Intro to Transfomers Python Module & Hugging Face

`Transformers`

About Hugging Face 🤗

Installing Transformers and Loading Module

Load Reuter Dataset

Load Pre-trained Model & Predict

Load model

Let’s look at what the model predicts

Let’s feed our data onto `tokinzer` and see what is in it?

Let’s Check The Prediction

Load `torch` and change to probability

Let’s check out `ProsusAI/finbert`

Predict GPT4 generated comments 🤖

First, Generate Data

Second, Use `finBERT` for Sentiment Analysis

Third, `datatable` with `backgroundColor` conditions for Aesthetics 📊

Acknowledgement

Lessons learnt

Related

Interesting Problem 😎

Objectives:

Brief Intro to Transfomers Python Module & Hugging Face

Transformers

About Hugging Face 🤗

Installing Transformers and Loading Module

Load Reuter Dataset

Load Pre-trained Model & Predict

Load model

Let’s look at what the model predicts

Let’s feed our data onto tokinzer and see what is in it?

Let’s Check The Prediction

Load torch and change to probability

Let’s check out ProsusAI/finbert

Predict GPT4 generated comments 🤖

First, Generate Data

Second, Use finBERT for Sentiment Analysis

Third, datatable with backgroundColor conditions for Aesthetics 📊

Acknowledgement

Lessons learnt

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

`Transformers`

Let’s feed our data onto `tokinzer` and see what is in it?

Load `torch` and change to probability

Let’s check out `ProsusAI/finbert`

Second, Use `finBERT` for Sentiment Analysis

Third, `datatable` with `backgroundColor` conditions for Aesthetics 📊

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)