2 library-t… 1 words Interval… 0 3.60 0.08 0.74 "lib… #> 3 library-t… 1 words Interval… 0 3.60 0.74 1.12 "tid… #> 4 library-t… 1 words Interval… 0 3.60 1.12 1.58 "ver… #> 5 library-t… 1 words Interval… 0 3.60 1.58 1.74 "" #> 6 library-t… 1 words Interval… 0 3.60 1.74 2.46 "lib… #> 7 library-t… 1 words Interval… 0 3.60 2.46 2.72 "b" #> 8 library-t… 1 words Interval… 0 3.60 2.72 2.9 "r" #> 9 library-t… 1 words Interval… 0 3.60 2.9 3.04 "m" #> 10 library-t… 1 words Interval… 0 3.60 3.04 3.46 "s" #> # ℹ 33 more rows #> # ℹ 1 more variable: annotation_num The function returns a tidy tibble with one row per annotation. The filename is stored as a column too so that we can lapply() over a directory of files. Annotations are numbered so that we can group_by(text, annotation_num) and have repeated words handled separately. With this textgrid in R, I can measure speaking rate, for example: data_tg |> filter(tier_name == "phones", text != "") |> summarise( speaking_time = sum(xmax - xmin), # vowels have numbers to indicate degree of stress num_vowels = sum(str_detect(text, "\\d")) ) |> mutate( syllables_per_sec = num_vowels / speaking_time ) #> # A tibble: 1 × 3 #> speaking_time num_vowels syllables_per_sec #> #> 1 3.22 13 4.04 Or annotate a spectrogram: library(tidyverse) library(ggplot2) path_spectrogram ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "," #> dbl (6): y, x, power, time, frequency, db #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. data_spectrogram |> mutate( # reserve more of the color variation for intensities above 15 dB db = ifelse(db ggplot() + aes(x = time, y = frequency) + geom_raster(aes(fill = db)) + geom_text( aes(label = text, x = (xmin + xmax) / 2), data = data_tg |> filter(tier_name == "words"), y = 6500, vjust = 0 ) + geom_text( aes(label = text, x = (xmin + xmax) / 2), data = data_tg |> filter(tier_name == "phones"), y = 6100, vjust = 0, size = 2 ) + ylim(c(NA, 6600)) + theme_minimal() + scale_fill_gradient(low = "white", high = "black") + guides(fill = "none") + labs(x = "time [s]", y = "frequency [Hz]") Spectrogram of me saying 'library tidyverse library brms' I released the first version of the package in 2020. This package, notably for me, contains the first hex badge I ever made. My original .TextGrid parser and its problem Here is what the contents of the .TextGrid file look like. It’s not the whole file but enough to give a sense of the structure: path_tg |> readLines() |> head(26) |> c("[... TRUNCATED ... ]") |> writeLines() #> File type = "ooTextFile" #> Object class = "TextGrid" #> #> xmin = 0 #> xmax = 3.596009 #> tiers? #> size = 2 #> item []: #> item [1]: #> class = "IntervalTier" #> name = "words" #> xmin = 0 #> xmax = 3.596009 #> intervals: size = 11 #> intervals [1]: #> xmin = 0.0 #> xmax = 0.08 #> text = "" #> intervals [2]: #> xmin = 0.08 #> xmax = 0.74 #> text = "library" #> intervals [3]: #> xmin = 0.74 #> xmax = 1.12 #> text = "tidy" #> [... TRUNCATED ... ] The first 7 lines provide some metadata about the time range of the audio and the number of tiers (size = 2). The file then writes out each tier (item [n] lines) by first giving the class, name, time duration and number of marks or intervals. Each mark or interval is enumerated with time values xmin, xmax and text values. Because nearly everything here follows a key = value syntax and because sections are split from each other very neatly with item [n]: or interval [n]: lines, I was able to write a simple parser using regular expressions: Split the file into item [n] sections, split those into interval [n] sections, and extract key-value pairs. This easy approach came with limitations. First, the TextGrid specification was much more flexible. For example, Praat also provides much less verbose “short” format textgrids which are like a stream of time and text annotations: path_tg_short readLines() |> head(26) |> c("[... TRUNCATED ... ]") |> writeLines() #> File type = "ooTextFile" #> Object class = "TextGrid" #> #> 0 #> 3.596009 #> #> 2 #> "IntervalTier" #> "words" #> 0 #> 3.596009 #> 11 #> 0 #> 0.08 #> "" #> 0.08 #> 0.74 #> "library" #> 0.74 #> 1.12 #> "tidy" #> 1.12 #> 1.58 #> "verse" #> 1.58 #> 1.74 #> [... TRUNCATED ... ] Everything is in the same order, but the annotations are gone. It turns out that all of the helpful labels from before were actually comments that get ignored. Everything that isn’t a number or a string in double-quotes (or a ) is a comment. There are also other quirks (" escapement, ! comments, deviations between the Praat description of the format and the behavior of praat.exe). I have them documented as a kind of unofficial specification in an article on the package website. But my original regular-expression based parser could only handle the verbose long-format textgrids. I knew this. I put this in a GitHub issue in 2020. And this compatibility oversight was never a problem for me until I tried a new phonetics tool that defaulted to saving the textgrids in the short format. Now, readtextgrid could not in fact “read textgrid”. The new R-based tokenizer Josef Fruehwald, a linguist with lots of acoustics/phonetics software, submitted a pull request to implement a proper parser that I eventually rewrote to handle various edge cases and undocumented behavior in the .TextGrid specification. I made an adversarial .TextGrid file 😈 that could still be opened by praat.exe but was meant to be difficult to parse. This was a fun development loop: Make the file harder, update the parser to handle the new feature, repeat. Because the essential data in the file are just string tokens and number tokens, I needed to make a tokenizer: a piece of software that reads in characters, groups them into tokens, and figures out what kind of data the token represents. The initial R-based version of the tokenizer did the following: Read the file character by character Gather the characters for the current token and keep them when they form a valid string or number Shift between three states (in_string, in_strong_comment for ! comments, in_escaped_quote) These three states determine how we interpret spaces, newlines, and " characters. For example, a newline ends a ! comment but a newline can appear in a string so it doesn’t end a string. Moreover, in a comment, " is ignored, but in a string, it might be the end of the string or an escaped quote (doubled double-quotes are used for " characters: the string """a""" has the text "a"). But at a high level, the code was simple: for (i in seq_along(all_char)) { # { ... examine current character ... } # { ... handle comment state ... } # { ... collect token if we see whitespace and are not in a string ... } # { ... handle string and escaped quote state ... } } The new character-by-character parser worked 🎉. It had conquered the adversarial example file, but there was still one more problem. It was slower than the original regular-expression parser! tg_lines # A tibble: 2 × 6 #> expression min median `itr/sec` mem_alloc `gc/sec` #> #> 1 legacy 75.1ms 76.1ms 13.1 6.81MB 19.7 #> 2 new_r 70.7ms 72.3ms 13.7 590.88KB 10.3 At this point, I asked ChatGPT for tips on speeding up the tokenizer. Some thoughts about LLMs the thing about (the current) chatgpt is that it writes like a fucking idiot with excellent grammar— sarah jeong (@sarahjeong.bsky.social) July 6, 2025 at 7:20 PM Now, let’s talk about large language models (LLMs). There’s a lot I could say about them.1 As a language scientist, I’ll start here: They know syntax. They know which words go together and can generate very plausible sequences of words. They do not know semantics however. They don’t have any firsthand knowledge or experience about what those sequences express. They can’t introspect about that knowledge or experience to see whether things “make sense”.2 They don’t care about the truth or falsity of statements. They just make plausible sequences of words. Now, it turns out that if you learn how to make sequences of words from an Internet-sized corpus of text, then a lot of the plausible sequences you make will turn out to be true. If you read 10,000 cookbooks, you could probably provide a very classic recipe for scrambled eggs. But because you don’t know about sarcasm or can’t draw on your own experience of trying to not ingest non-food chemicals, you might suggest putting glue on a pizza. So, as we use an LLM, we need to ask ourselves how much we care about the truth or care about knowing or understanding things. That may sound like a glib or weird statement: Shouldn’t we always care about the truth? Well, sometimes we don’t. We just want some syntax; we want boilerplate or templates to fill out.3 For example, I can ask an LLM to “write some unit tests for a function round_to(xs, unit) that rounds a vector of values to an arbitrary unit” and receive: test_that("round_to() rounds to nearest multiple of unit", { expect_equal(round_to(5, 2), 6) expect_equal(round_to(4.9, 2), 4) expect_equal(round_to(5.1, 2), 6) expect_equal(round_to(c(1, 2, 3, 4), 2), c(2, 2, 4, 4)) }) These tests are not useful until I plug in the correct values for the expected output. In other cases, we don’t quite care about truth or comprehension because we can get external corroboration.4 When I ask ChatGPT for an obfuscated R script to make Pac-Man in ggplot2, I can run the code to see if it works without trying to decipher its syntax: library(ggplot2) ggplot()+ geom_polygon(aes(x,y), data=within(data.frame(t(sapply(seq(a Error in eval(substitute(expr), e): object '.' not found (Strangely, this is the case where a dot kills Pac-Man.) Vibes are semantic vapor When we abandon caring about truth or understanding things and just rely on external corroboration, we are in the realm of vibe coding. I like this term because of its insouciant honesty: Truth? Comprehension? We’re just going off the vibes. It would be a great help if we used the word more liberally. A YouTube video called “A vibe history of NES videogames”? No thanks.5 If we lean into vibes, we need to get better at external corroboration and know our programming languages even better. R is a flexible programming language and it does some things that “help” the user that can lead to silent bugs. Famously, function arguments and $ will match partial names. # Look at the "Call:" in the output lm(f = hp ~ cyl, d = mtcars) #> #> Call: #> lm(formula = hp ~ cyl, data = mtcars) #> #> Coefficients: #> (Intercept) cyl #> -51.05 31.96 # There is no `m` column all(mtcars$m == mtcars$mpg) #> [1] TRUE A student I work with was trying to compute sensitivity and specificity on weighted data. The LLM suggested the following: # Make some weighted data using frequencies data count(outcome, age, name = "weight") # What the LLM did: pROC::roc(data, "outcome", "age", weights = data$weight) #> Setting levels: control = Good, case = Poor #> Setting direction: controls < cases #> #> Call: #> roc.data.frame(data = data, response = "outcome", predictor = "age", weights = data$weight) #> #> Data: age in 44 controls (outcome Good) < 30 cases (outcome Poor). #> Area under the curve: 0.5947 This code runs without any problems. It’s wrong, but it runs. The problem is that pROC::roc(...) supports variadic arguments (...): # Note the dots pROC:::roc |> formals() |> str() #> Dotted pair list of 1 #> $ ...: symbol pROC:::roc.data.frame |> formals() |> str() #> Dotted pair list of 5 #> $ data : symbol #> $ response : symbol #> $ predictor: symbol #> $ ret : language c("roc", "coords", "all_coords") #> $ ... : symbol Those ... are for forwarding arguments to other functions that roc() might call internally. Unfortunately, functions by default don’t check the contents of the ... to see if they have unsupported arguments. Thus, bad arguments are ignored silently: # method and weights are not real arguments pROC::roc(data, "outcome", "age", method = fake, weights = fake) #> Setting levels: control = Good, case = Poor #> Setting direction: controls < cases #> #> Call: #> roc.data.frame(data = data, response = "outcome", predictor = "age", method = fake, weights = fake) #> #> Data: age in 44 controls (outcome Good) < 30 cases (outcome Poor). #> Area under the curve: 0.5947 The LLM hallucinated a weights argument, which is a plausible argument,6 and the ... syntax behavior swallowed it up like Pac-Man. It always comes back to Pac-Man. I ended up writing a function that could compute sens and spec on weighted data. Unfortunately the space of LLM code errors and the space of human errors are not the same, making hard-won code review instincts misfire— Eugene Vinitsky 🍒 (@eugenevinitsky.bsky.social) November 13, 2025 at 6:21 PM As users, we can guard against the first two silent problems with options(warnPartialMatchArgs, warnPartialMatchDollar), and as developers, we can prevent the second problem with rlang::check_dots_used() and friends. But like I said at the outset, external corroboration requires us to know even more about the language in order to vibe safely. Syntax and semantics, again In this mini-position statement on LLM assistance, the two principles I am trying to develop are: LLMs know text distributions very well. Use them to generate starter syntax. LLMs don’t understand anything. It’s all bullshit and vibes. If we think of LLMs as syntax generators, we can imagine some pretty good use cases: Write unit tests for a function that does… Set up Roxygen docs for this function Create a function to simulate data for a model of rt ~ group + (1 | id) Write a Stan program to fit this model. (Mind your priors.) Spoiler alert: Convert this R loop into C++ code Still, we need to be mindful of the semantic limitations and skeptical of the output. We should audit the results and make sure we comprehend them, or admit upfront that this code is running on vibes. In either case, we also need to be vigilant about bugs that could happen silently or bugs that a machine might make but a human wouldn’t (hallucinations). One thing I worry about with LLM reliance is skill atrophy. If I keep using this bot as a crutch, then some of my skills will get weaker. Sam Mehr has a take I quite like that puts this concern upfront. LLMs are fine for code we don’t feel bothered to learn: re AI, a PhD student mentioned sheepishly that they used chatgpt for advice on coding up an unusual element in javascript. Almost apologized I'm like no no no you're a psych PhD, not CS, this is exactly what LLMs are for! Doing a so-so job at things you just need done & don't care about learning!— samuel mehr (@mehr.nz) May 13, 2025 at 10:32 PM I quite like programming and want to learn. I like to read the release notes, dig into the documentation and experiment with new modeling features. At the same time, sometimes I just want a bash script to unzip all .zip files in a directory. Time was, we would find something from Stack Overflow to adapt for that problem. Now, we ask ChatGPT for the code, look it over quick, test it and move on. That seems fine. A metacognitive awareness about what is worth learning and what problems are worth solving in a slower methodical way is very useful for an LLM user. Finally, to be clear—I can’t believe I need to make this disclaimer—we should always care about truth and accuracy when we write prose and publish it and put our name on it. Vibes are not scientific or scholarly. When I see emails or code documentation with immaculate formatting and perfect language, my bullshit sensor goes off and I worry that I need to read extra carefully because a smooth-talking robot is trying to pull a fast one on me. I don’t use LLMs for writing except for proofreading or requests for nitpicking. I have an instruction in ChatGPT that says not to revise anything I write unless it sneaks Magic: The Gathering card names into the output. (Alas, it generally ignores that diabolic edict of mine.) AI assistance in readtextgrid Because the old parser was outperforming the newer, more robust parser, I asked ChatGPT for ways to make my textgrid parsing faster. For example, one version of the loop collected characters in a vector and then paste0()-ed them together. ChatGPT suggested that because we are iterating over character indices we instead use substring() to extract tokens from the text. That worked, and it ran faster, until it failed a unit test on a character wearing a diacritic. After a few rounds of trying to improve the loop, I asked quite bluntly: “How can we move the tokenize loop into Rcpp or cpp11 with the viewest [sic] headaches possible”. And it provided some very legible cpp11 code. I had never used C++ with R before. To get started, I had to call on usethis::use_cpp11() to make the necessary boilerplate—you just need syntax sometimes—and I had to troubleshoot the first couple versions of the function because of errors. The cpp11 documentation is small in a good way. It has examples of converting R code into C++ equivalents, which is precisely the activity that I was up to. What I liked about the ChatGPT output is how clear the translation was. In the R version, part of the character processing loop is to peek ahead to the next character to see whether " is an escaped quote "" or the end of a string: # ... in the character processing loop # Start or close string mode if we see " if (c_starts_string) { # Check for "" escapes peek_c " />

readtextgrid now uses C++ (and ChatGPT helped)

Posted on November 14, 2025 by Higher Order Functions in R bloggers | 0 Comments

[This article was first published on Higher Order Functions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post, I announce the release of version of 0.2.0 of the readtextgrid R package, describe the problem that the package solves, and share some thoughts on LLM-assisted programming.

Textgrids are a way to annotate audio data

Praat is a program for speech and acoustic analysis that has been around for over 30 years. It includes a scripting language for manipulating and analyzing data and for creating annotation workflows. Users can annotate intervals or points of time in a sound file using a textgrid object. Here is a screenshot of a textgrid in Praat:

Screenshot of a Praat editor window showing the amplitude wave form, spectrogram, and textgrid annotations. The audio file is of me saying *library tidyverse library brms*. — Screenshot of a Praat editor window.

There are three rows in the image, all three of them sharing the same x axis (time).

Amplitude waveform, showing intensity over time
Spectrogram, showing how the intensity (color) at frequencies (y) changes over time. Red dots mark estimated formants (resonances) in the speech signal.
Textgrid of text annotations for the recording

A user can edit the textgrid by adding or adjusting boundaries and adding annotations, and Praat will save this data to a .TextGrid file.

Other programs can produce .TextGrid files: the textgrid pictured here is the result of forced alignment, specifically by the Montreal Forced Aligner. I told the program I said “library tidy verse library b r m s”, and it looked up the pronunciations of those words and used an acoustic model to estimate the time intervals of each word and each speech sound. The aligner produced a .TextGrid file for this alignment.

These textgrids are the bread and butter of some of the research that we do. For example, our article on speaking/articulation rate in children involved over 30,000 single-sentence .wav files and .TextGrid files. We used the alignments to determine the duration of time spent speaking, the number of vowels in each utterance and hence the speaking rate in syllables per second.

Reading these .TextGrid files into R was cumbersome, so I wrote and released readtextgrid, an R package built around one simple function:

library(tidyverse)
library(readtextgrid)

path_tg <- "_R/data/mfa-out/library-tidyverse-library-brms.TextGrid" 
data_tg <- read_textgrid(path_tg)

data_tg
#> # A tibble: 43 × 10
#>    file       tier_num tier_name tier_type tier_xmin tier_xmax  xmin  xmax text 
#>    <chr>         <int> <chr>     <chr>         <dbl>     <dbl> <dbl> <dbl> <chr>
#>  1 library-t…        1 words     Interval…         0      3.60  0     0.08 ""   
#>  2 library-t…        1 words     Interval…         0      3.60  0.08  0.74 "lib…
#>  3 library-t…        1 words     Interval…         0      3.60  0.74  1.12 "tid…
#>  4 library-t…        1 words     Interval…         0      3.60  1.12  1.58 "ver…
#>  5 library-t…        1 words     Interval…         0      3.60  1.58  1.74 ""   
#>  6 library-t…        1 words     Interval…         0      3.60  1.74  2.46 "lib…
#>  7 library-t…        1 words     Interval…         0      3.60  2.46  2.72 "b"  
#>  8 library-t…        1 words     Interval…         0      3.60  2.72  2.9  "r"  
#>  9 library-t…        1 words     Interval…         0      3.60  2.9   3.04 "m"  
#> 10 library-t…        1 words     Interval…         0      3.60  3.04  3.46 "s"  
#> # ℹ 33 more rows
#> # ℹ 1 more variable: annotation_num <int>

The function returns a tidy tibble with one row per annotation. The filename is stored as a column too so that we can lapply() over a directory of files. Annotations are numbered so that we can group_by(text, annotation_num) and have repeated words handled separately.

With this textgrid in R, I can measure speaking rate, for example:

data_tg |> 
  filter(tier_name == "phones", text != "") |> 
  summarise(
    speaking_time = sum(xmax - xmin),
    # vowels have numbers to indicate degree of stress
    num_vowels = sum(str_detect(text, "\\d"))
  ) |> 
  mutate(
    syllables_per_sec = num_vowels / speaking_time 
  )
#> # A tibble: 1 × 3
#>   speaking_time num_vowels syllables_per_sec
#>           <dbl>      <int>             <dbl>
#> 1          3.22         13              4.04

Or annotate a spectrogram:

library(tidyverse)
library(ggplot2)
path_spectrogram <- "_R/data/mfa/library-tidyverse-library-brms.csv"
data_spectrogram <- readr::read_csv(path_spectrogram)
#> Rows: 249366 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (6): y, x, power, time, frequency, db
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

data_spectrogram |> 
  mutate(
    # reserve more of the color variation for intensities above 15 dB
    db = ifelse(db < 15, 15, db)
  ) |> 
  ggplot() + 
  aes(x = time, y = frequency) +
  geom_raster(aes(fill = db)) +
  geom_text(
    aes(label = text, x = (xmin + xmax) / 2),
    data = data_tg |> filter(tier_name == "words"),
    y = 6500,
    vjust = 0
  )  +
  geom_text(
    aes(label = text, x = (xmin + xmax) / 2),
    data = data_tg |> filter(tier_name == "phones"),
    y = 6100,
    vjust = 0,
    size = 2
  )  +
  ylim(c(NA, 6600)) +
  theme_minimal() +
  scale_fill_gradient(low = "white", high = "black") +
  guides(fill = "none") +
  labs(x = "time [s]", y = "frequency [Hz]")

Spectrogram of me saying 'library tidyverse library brms'

Spectrogram of me saying ‘library tidyverse library brms’

I released the first version of the package in 2020. This package, notably for me, contains the first hex badge I ever made.

My original `.TextGrid` parser and its problem

Here is what the contents of the .TextGrid file look like. It’s not the whole file but enough to give a sense of the structure:

path_tg |> 
  readLines() |> 
  head(26) |> 
  c("[... TRUNCATED ... ]") |> 
  writeLines()
#> File type = "ooTextFile"
#> Object class = "TextGrid"
#> 
#> xmin = 0 
#> xmax = 3.596009 
#> tiers? <exists> 
#> size = 2 
#> item []: 
#>     item [1]:
#>         class = "IntervalTier" 
#>         name = "words" 
#>         xmin = 0 
#>         xmax = 3.596009 
#>         intervals: size = 11 
#>         intervals [1]:
#>             xmin = 0.0 
#>             xmax = 0.08 
#>             text = "" 
#>         intervals [2]:
#>             xmin = 0.08 
#>             xmax = 0.74 
#>             text = "library" 
#>         intervals [3]:
#>             xmin = 0.74 
#>             xmax = 1.12 
#>             text = "tidy" 
#> [... TRUNCATED ... ]

The first 7 lines provide some metadata about the time range of the audio and the number of tiers (size = 2). The file then writes out each tier (item [n] lines) by first giving the class, name, time duration and number of marks or intervals. Each mark or interval is enumerated with time values xmin, xmax and text values.

Because nearly everything here follows a key = value syntax and because sections are split from each other very neatly with item [n]: or interval [n]: lines, I was able to write a simple parser using regular expressions: Split the file into item [n] sections, split those into interval [n] sections, and extract key-value pairs.

This easy approach came with limitations. First, the TextGrid specification was much more flexible. For example, Praat also provides much less verbose “short” format textgrids which are like a stream of time and text annotations:

path_tg_short <- "_R/data/mfa-out/library-tidyverse-library-brms-short.TextGrid"
path_tg_short |> 
  readLines() |> 
  head(26) |> 
  c("[... TRUNCATED ... ]") |> 
  writeLines()
#> File type = "ooTextFile"
#> Object class = "TextGrid"
#> 
#> 0
#> 3.596009
#> <exists>
#> 2
#> "IntervalTier"
#> "words"
#> 0
#> 3.596009
#> 11
#> 0
#> 0.08
#> ""
#> 0.08
#> 0.74
#> "library"
#> 0.74
#> 1.12
#> "tidy"
#> 1.12
#> 1.58
#> "verse"
#> 1.58
#> 1.74
#> [... TRUNCATED ... ]

Everything is in the same order, but the annotations are gone. It turns out that all of the helpful labels from before were actually comments that get ignored. Everything that isn’t a number or a string in double-quotes (or a <flag>) is a comment.

There are also other quirks (" escapement, ! comments, deviations between the Praat description of the format and the behavior of praat.exe). I have them documented as a kind of unofficial specification in an article on the package website.

But my original regular-expression based parser could only handle the verbose long-format textgrids. I knew this. I put this in a GitHub issue in 2020. And this compatibility oversight was never a problem for me until I tried a new phonetics tool that defaulted to saving the textgrids in the short format. Now, readtextgrid could not in fact “read textgrid”.

The new R-based tokenizer

Josef Fruehwald, a linguist with lots of acoustics/phonetics software, submitted a pull request to implement a proper parser that I eventually rewrote to handle various edge cases and undocumented behavior in the .TextGrid specification. I made an adversarial .TextGrid file 😈 that could still be opened by praat.exe but was meant to be difficult to parse. This was a fun development loop: Make the file harder, update the parser to handle the new feature, repeat.

Because the essential data in the file are just string tokens and number tokens, I needed to make a tokenizer: a piece of software that reads in characters, groups them into tokens, and figures out what kind of data the token represents. The initial R-based version of the tokenizer did the following:

Read the file character by character
Gather the characters for the current token and keep them when they form a valid string or number
Shift between three states (in_string, in_strong_comment for ! comments, in_escaped_quote)

These three states determine how we interpret spaces, newlines, and " characters. For example, a newline ends a ! comment but a newline can appear in a string so it doesn’t end a string. Moreover, in a comment, " is ignored, but in a string, it might be the end of the string or an escaped quote (doubled double-quotes are used for " characters: the string """a""" has the text "a").

But at a high level, the code was simple:

for (i in seq_along(all_char)) {

  # { ... examine current character ... }
  
  # { ... handle comment state ... }
  
  # { ... collect token if we see whitespace and are not in a string  ... }
  
  # { ... handle string and escaped quote state ... }

}

The new character-by-character parser worked 🎉. It had conquered the adversarial example file, but there was still one more problem. It was slower than the original regular-expression parser!

tg_lines <- readLines(path_tg)

bench::mark(
  legacy = readtextgrid:::legacy_read_textgrid_lines(tg_lines),
  new_r = readtextgrid:::r_read_textgrid_lines(tg_lines)
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 legacy       75.1ms   76.1ms      13.1    6.81MB     19.7
#> 2 new_r        70.7ms   72.3ms      13.7  590.88KB     10.3

At this point, I asked ChatGPT for tips on speeding up the tokenizer.

Some thoughts about LLMs

the thing about (the current) chatgpt is that it writes like a fucking idiot with excellent grammar
— sarah jeong (@sarahjeong.bsky.social) July 6, 2025 at 7:20 PM

Now, let’s talk about large language models (LLMs). There’s a lot I could say about them.¹ As a language scientist, I’ll start here: They know syntax. They know which words go together and can generate very plausible sequences of words. They do not know semantics however. They don’t have any firsthand knowledge or experience about what those sequences express. They can’t introspect about that knowledge or experience to see whether things “make sense”.² They don’t care about the truth or falsity of statements. They just make plausible sequences of words.

Now, it turns out that if you learn how to make sequences of words from an Internet-sized corpus of text, then a lot of the plausible sequences you make will turn out to be true. If you read 10,000 cookbooks, you could probably provide a very classic recipe for scrambled eggs. But because you don’t know about sarcasm or can’t draw on your own experience of trying to not ingest non-food chemicals, you might suggest putting glue on a pizza.

So, as we use an LLM, we need to ask ourselves how much we care about the truth or care about knowing or understanding things. That may sound like a glib or weird statement: Shouldn’t we always care about the truth? Well, sometimes we don’t. We just want some syntax; we want boilerplate or templates to fill out.³ For example, I can ask an LLM to “write some unit tests for a function round_to(xs, unit) that rounds a vector of values to an arbitrary unit” and receive:

test_that("round_to() rounds to nearest multiple of unit", {
  expect_equal(round_to(5, 2), 6)
  expect_equal(round_to(4.9, 2), 4)
  expect_equal(round_to(5.1, 2), 6)
  expect_equal(round_to(c(1, 2, 3, 4), 2), c(2, 2, 4, 4))
})

These tests are not useful until I plug in the correct values for the expected output.

In other cases, we don’t quite care about truth or comprehension because we can get external corroboration.⁴ When I ask ChatGPT for an obfuscated R script to make Pac-Man in ggplot2, I can run the code to see if it works without trying to decipher its syntax:

library(ggplot2)
ggplot()+
geom_polygon(aes(x,y),
data=within(data.frame(t(sapply(seq(a<-pi/9,2*pi-a,l<-4e2),
function(t)c(cos(t),sin(t))))),
{rbind(.,0,0,cos(a),sin(a))->df;x=df[,1];y=df[,2]}),
fill="#FF0",col=1)+
annotate("point",x=.35,y=.5,size=3)+
annotate("point",x=c(1.4,2,2.6),y=0,size=3)+
coord_equal(xlim=c(-1.2,3),ylim=c(-1.2,1.2))+
theme_void()
#> Error in eval(substitute(expr), e): object '.' not found

(Strangely, this is the case where a dot kills Pac-Man.)

Vibes are semantic vapor

When we abandon caring about truth or understanding things and just rely on external corroboration, we are in the realm of vibe coding. I like this term because of its insouciant honesty: Truth? Comprehension? We’re just going off the vibes. It would be a great help if we used the word more liberally. A YouTube video called “A vibe history of NES videogames”? No thanks.⁵

If we lean into vibes, we need to get better at external corroboration and know our programming languages even better. R is a flexible programming language and it does some things that “help” the user that can lead to silent bugs. Famously, function arguments and $ will match partial names.

# Look at the "Call:" in the output
lm(f = hp ~ cyl, d = mtcars)
#> 
#> Call:
#> lm(formula = hp ~ cyl, data = mtcars)
#> 
#> Coefficients:
#> (Intercept)          cyl  
#>      -51.05        31.96

# There is no `m` column
all(mtcars$m == mtcars$mpg)
#> [1] TRUE

A student I work with was trying to compute sensitivity and specificity on weighted data. The LLM suggested the following:

# Make some weighted data using frequencies
data <- pROC::aSAH |> 
  count(outcome, age, name = "weight")

# What the LLM did:
pROC::roc(data, "outcome", "age", weights = data$weight)
#> Setting levels: control = Good, case = Poor
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.data.frame(data = data, response = "outcome", predictor = "age",     weights = data$weight)
#> 
#> Data: age in 44 controls (outcome Good) < 30 cases (outcome Poor).
#> Area under the curve: 0.5947

This code runs without any problems. It’s wrong, but it runs. The problem is that pROC::roc(...) supports variadic arguments (...):

# Note the dots
pROC:::roc |> formals() |> str()
#> Dotted pair list of 1
#>  $ ...: symbol
pROC:::roc.data.frame |> formals() |> str()
#> Dotted pair list of 5
#>  $ data     : symbol 
#>  $ response : symbol 
#>  $ predictor: symbol 
#>  $ ret      : language c("roc", "coords", "all_coords")
#>  $ ...      : symbol

Those ... are for forwarding arguments to other functions that roc() might call internally. Unfortunately, functions by default don’t check the contents of the ... to see if they have unsupported arguments. Thus, bad arguments are ignored silently:

# method and weights are not real arguments
pROC::roc(data, "outcome", "age", method = fake, weights = fake)
#> Setting levels: control = Good, case = Poor
#> Setting direction: controls < cases
#> 
#> Call:
#> roc.data.frame(data = data, response = "outcome", predictor = "age",     method = fake, weights = fake)
#> 
#> Data: age in 44 controls (outcome Good) < 30 cases (outcome Poor).
#> Area under the curve: 0.5947

The LLM hallucinated a weights argument, which is a plausible argument,⁶ and the ... syntax behavior swallowed it up like Pac-Man. It always comes back to Pac-Man. I ended up writing a function that could compute sens and spec on weighted data.

Unfortunately the space of LLM code errors and the space of human errors are not the same, making hard-won code review instincts misfire
— Eugene Vinitsky 🍒 (@eugenevinitsky.bsky.social) November 13, 2025 at 6:21 PM

As users, we can guard against the first two silent problems with options(warnPartialMatchArgs, warnPartialMatchDollar), and as developers, we can prevent the second problem with rlang::check_dots_used() and friends. But like I said at the outset, external corroboration requires us to know even more about the language in order to vibe safely.

Syntax and semantics, again

In this mini-position statement on LLM assistance, the two principles I am trying to develop are:

LLMs know text distributions very well. Use them to generate starter syntax.
LLMs don’t understand anything. It’s all bullshit and vibes.

If we think of LLMs as syntax generators, we can imagine some pretty good use cases:

Write unit tests for a function that does…
Set up Roxygen docs for this function
Create a function to simulate data for a model of rt ~ group + (1 | id)
Write a Stan program to fit this model. (Mind your priors.)
Spoiler alert: Convert this R loop into C++ code

Still, we need to be mindful of the semantic limitations and skeptical of the output. We should audit the results and make sure we comprehend them, or admit upfront that this code is running on vibes. In either case, we also need to be vigilant about bugs that could happen silently or bugs that a machine might make but a human wouldn’t (hallucinations).

One thing I worry about with LLM reliance is skill atrophy. If I keep using this bot as a crutch, then some of my skills will get weaker. Sam Mehr has a take I quite like that puts this concern upfront. LLMs are fine for code we don’t feel bothered to learn:

re AI, a PhD student mentioned sheepishly that they used chatgpt for advice on coding up an unusual element in javascript. Almost apologized

I'm like no no no you're a psych PhD, not CS, this is exactly what LLMs are for! Doing a so-so job at things you just need done & don't care about learning!
— samuel mehr (@mehr.nz) May 13, 2025 at 10:32 PM

I quite like programming and want to learn. I like to read the release notes, dig into the documentation and experiment with new modeling features. At the same time, sometimes I just want a bash script to unzip all .zip files in a directory. Time was, we would find something from Stack Overflow to adapt for that problem. Now, we ask ChatGPT for the code, look it over quick, test it and move on. That seems fine. A metacognitive awareness about what is worth learning and what problems are worth solving in a slower methodical way is very useful for an LLM user.

Finally, to be clear—I can’t believe I need to make this disclaimer—we should always care about truth and accuracy when we write prose and publish it and put our name on it. Vibes are not scientific or scholarly. When I see emails or code documentation with immaculate formatting and perfect language, my bullshit sensor goes off and I worry that I need to read extra carefully because a smooth-talking robot is trying to pull a fast one on me. I don’t use LLMs for writing except for proofreading or requests for nitpicking. I have an instruction in ChatGPT that says not to revise anything I write unless it sneaks Magic: The Gathering card names into the output. (Alas, it generally ignores that diabolic edict of mine.)

AI assistance in readtextgrid

Because the old parser was outperforming the newer, more robust parser, I asked ChatGPT for ways to make my textgrid parsing faster. For example, one version of the loop collected characters in a vector and then paste0()-ed them together. ChatGPT suggested that because we are iterating over character indices we instead use substring() to extract tokens from the text. That worked, and it ran faster, until it failed a unit test on a character wearing a diacritic. After a few rounds of trying to improve the loop, I asked quite bluntly: “How can we move the tokenize loop into Rcpp or cpp11 with the viewest [sic] headaches possible”.

And it provided some very legible cpp11 code. I had never used C++ with R before. To get started, I had to call on usethis::use_cpp11() to make the necessary boilerplate—you just need syntax sometimes—and I had to troubleshoot the first couple versions of the function because of errors. The cpp11 documentation is small in a good way. It has examples of converting R code into C++ equivalents, which is precisely the activity that I was up to.

What I liked about the ChatGPT output is how clear the translation was. In the R version, part of the character processing loop is to peek ahead to the next character to see whether " is an escaped quote "" or the end of a string:

# ... in the character processing loop

    # Start or close string mode if we see "
    if (c_starts_string) {
      # Check for "" escapes
      peek_c <- all_char[i + 1]
      if (peek_c == "\"" & in_string) {
        in_escaped_quote <- TRUE
      } else {
        in_string <- !in_string
      }
    }

# ...

And here is the C++ version of the peek ahead code:

// ... helper functions ...

  // Is this a UTF-8 continuation byte? (10xxxxxx)
  auto is_cont = [](unsigned char b)->bool {
    // Are the first two bits 10?
    return (b & 0xC0) == 0x80;
  };

// ... in the character processing loop ...

    if (b == 0x22) { // '"'
      // peek ahead to see if we have a double "" escapement
      size_t j = i + 1;
      // We need the next character, not just the next byte, so we skip
      // continuation characters.
      while (j < nbytes && is_cont(static_cast<unsigned char>(src[j]))) ++j;
      // Use `0x00` dummy character if we are at the end of the string
      unsigned char nextb = (j < nbytes) ? static_cast<unsigned char>(src[j]) : 0x00;

      if (in_string && nextb == 0x22) {
        esc_next = true;    // consume next '"' once
      } else {
        in_string = !in_string;
      }
    }

// ...

There is a logical correspondence between the lines that I wrote myself in R and the lines that the LLM provided for C++. The C++ version works at the level of bytes instead of characters, and that matters:

"é" |> nchar(type = "chars")
#> [1] 1
"é" |> nchar(type = "bytes")
#> [1] 2

But the C++ code makes sense to me. It looks plausible, right? Still, plausible isn’t enough. I asked the LLM a lot of follow-up questions: what does auto do, what is size_t doing, and so on. And I annotated the C++ code with comments for my own understanding.

During my auditing, I went down a particular rabbithole to make sure I understood how Unicode bytes get packed into UTF-8 sequences. I learned how the character é for example has the codepoint (character number) U+00E9 in Unicode, so it falls in the range of codepoints that need to be split into two bytes. The scheme for two-byte encoding is

character number               ->               character encoding
codepoint -> 00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx -> UTF-8 bytes
00E9      -> 00000000 11101001 -> 11000011 10101001 -> c3 a9

Which we can check by hand:

bitchar_to_raw <- function(xs) {
  xs |> 
    strsplit("") |> 
    lapply(function(x) as.integer(x) |> rev() |> packBits()) |> 
    unlist()
}

bitchar_to_raw(c("11000011", "10101001"))
#> [1] c3 a9
charToRaw("é")
#> [1] c3 a9

In the UTF-8 scheme, bytes that start with 10 are only the second, third and fourth bytes in a character’s encoding—that is, only the continuation bytes. Now, at this point, we can comprehend why the C++ is checking for continuation characters and why the check for continuation characters involves checking the first two bits.

Another rabbithole involved how to parse numbers. At first, the LLM suggested I use one of R’s own C functions to handle it. That idea seems really powerful to me—wait, now I can tap into what R’s own routines?!—but R’s parser was a bit stricter than what I needed to match praat.exe.

This new C++ based tokenizer yielded a huge performance gain:

bench::mark(
  legacy = readtextgrid:::legacy_read_textgrid_lines(tg_lines),
  new_r = readtextgrid:::r_read_textgrid_lines(tg_lines),
  new_cpp = readtextgrid::read_textgrid_lines(tg_lines)
)
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 legacy      65.22ms  68.77ms      13.8    6.49MB     4.59
#> 2 new_r       72.67ms   88.5ms      11.5  363.33KB     5.74
#> 3 new_cpp      3.12ms   3.64ms     272.    96.77KB     4.11

That’s an improvement of 10–15x! Now, I find myself wondering: What else could use a cpp11 speed boost?

One downside of adopting cpp11 is that the package needs to compile code. As a result, I can’t just tell people to try the developer version of the package with remotes::install_github(). CRAN compiles packages so end users don’t face this issue when installing the official released version of packages.

One workaround I adopted was relying on R Universe which will provide compiled versions of packages hosted on GitHub. Then we change the installation instructions to:

install.packages(
  "readtextgrid", 
  repos = c("https://tjmahr.r-universe.dev", "https://cloud.r-project.org")
)

You might have seen this pattern elsewhere. cmdstanr skips CRAN entirely and only uses R Universe.

Parting thoughts

An LLM helped me translate pokey R code into fast C++ code. The code is live now on CRAN, released in readtextgrid 0.2.0. I’m maybe kind of a C++ developer now? (Nah.)

This kind of code translation strikes me as an easy win for R developers: “I have my version that works right now, but I think it can go faster. Help me convert this to C++.” I took care to make sure I understood the output. The syntax came easy, but the semantics (comprehension and validation) took more time.

If I ask myself, could I have done this translation to C++ without an LLM? The answer is no, not in a reasonable timeframe, certainly not as fast as the two days it took me in this case. That’s a pretty undeniable boost.

Things I won’t talk about: Plagiarism, safety, energy use, hype, undercooked AI features making things slower and dumber, stupid people emboldened by how trivial AI makes everything seem—we won’t need programmers or doctors or historians or whatever is what someone with no expertise in programming, medicine, history, etc. would say—dumdums tearing down fences, creativity versus productivity, aesthetic homogenization or how I keep seeing the same comic style in YouTube thumbnails, nobody asked for slop, oh they did ask for slop, etc. ↩
There is something introspective about reasoning models which will break a prompt into steps and work through them. But still, I’m thinking about what the ground truth is in this reasoning. The statistical regularities of word patterns? ↩
I think there is a great “tradition”—not sure of the right word here—in learning programming and other tools where we start from a starter template or maybe small sample project and we experimentally tweak the code and iterate until it turns into the thing we want. It’s like scaffolding but at a less metaphorical level: Code that sets a foundation for self-directed learning. ↩
I asked ChatGPT for help making a shopping list for a small woodworking project, and it offered a cutting plan for the lumber. Sure, why not? It messed up the math with a plan that involved cutting off 74 inches of wood from a 6-foot piece of lumber. My external corroboration in this case was a scrap of wood. ↩
I am still immensely annoyed about a YouTube video that tried to tell me Abadox was a “controversial” NES game. Get out of here. Nobody talked about that game. Show me a newspaper clipping or something. ↩

Let’s count functions with weights arguments in some base R packages:

get_funcs_with_weights <- function(pkg) {
  ns <- asNamespace(pkg)
  ls(ns) |> 
    lapply(get, envir = ns) |> 
    setNames(ls(ns)) |> 
    Filter(f = is.function) |> 
    lapply(formals) |> 
    Filter(f = function(x) "weights" %in% names(x)) |> 
    names()
}

get_funcs_with_weights("stats")
#>  [1] "density.default" "glm"             "glm.fit"         "lm"             
#>  [5] "loess"           "nls"             "ppr.default"     "ppr.formula"    
#>  [9] "predict.lm"      "predLoess"       "simpleLoess"
get_funcs_with_weights("mgcv")
#>  [1] "bam"             "bfgs"            "deriv.check"     "deriv.check5"   
#>  [5] "efsud"           "efsudr"          "find.null.dev"   "gam"            
#>  [9] "gam.fit3"        "gam.fit4"        "gam.fit5"        "gamm"           
#> [13] "gammPQL"         "initial.spg"     "jagam"           "mgcv.find.theta"
#> [17] "mgcv.get.scale"  "newton"          "scasm"           "score.transect" 
#> [21] "simplyFit"
get_funcs_with_weights("MASS")
#> [1] "glm.nb"      "glmmPQL"     "polr"        "rlm.default" "rlm.formula"
#> [6] "theta.md"    "theta.ml"    "theta.mm"
get_funcs_with_weights("nlme")
#>  [1] "gls"               "gnls"              "lme"              
#>  [4] "lme.formula"       "lme.groupedData"   "lme.lmList"       
#>  [7] "nlme"              "nlme.formula"      "nlme.nlsList"     
#> [10] "plot.simulate.lme"

Related

To leave a comment for the author, please follow the link and comment on their blog: Higher Order Functions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2025 | MH Corporate basic by MH Themes