You shall know a word by the company it keeps — so choose your prompts wisely

Pablo Bernabeu

1 day ago

[This article was first published on Pablo Bernabeu, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In computational linguistics, word meanings are shaped by their contexts. As the British linguist John Rupert Firth put it in 1957, ‘You shall know a word by the company it keeps’ (see Brunila & LaViolette, 2022, for a re-examination of the intellectual history). It sounds almost like life advice, but Firth meant something technical: words that habitually appear alongside each other tend to share semantic territory. The adjective ‘good’, for instance, is far more likely to appear near ‘kind’, ‘genuine’, ‘fair’ and ‘quality’ than near ‘broken’ or ‘fraud’ – and a model that tracks those neighbours can learn what ‘good’ means without ever being told. The principle extends to polysemy: ‘bank’ means something entirely different in the company of ‘river’ and ‘fishing rod’ than in the company of ‘overdraft’ and ‘mortgage’. Context is everything.

This deceptively simple insight is the bedrock on which generative AI was built. The earliest computational implementations of Firth’s principle – distributional semantic models such as Latent Semantic Analysis (LSA; Landauer & Dumais, 1997) and the Hyperspace Analogue to Language (Lund & Burgess, 1996) – were modest by today’s standards: a matrix of word co-occurrence counts, a few hundred latent dimensions and a vocabulary of perhaps tens of thousands of words. Yet even these pocket-sized models captured real-world structure with startling fidelity. Louwerse and Zwaan (2009) showed that the frequency with which city names co-occur in English text predicts their actual geographical distances: cities close together on a map tend to be mentioned together more often, and an LSA model trained on text alone can reconstruct approximate maps of the United States without ever seeing one. Louwerse (2011) extended this further, showing that text statistics encode not just geography but sensory properties, emotional associations and conceptual relationships across a wide range of domains. Indeed, distributional language statistics may track some sensorimotor properties of concepts (Bernabeu, 2022; Louwerse & Connell, 2011; cf. Xu et al., 2025), especially after fine-tuning on human sensorimotor ratings (Wu et al., 2026). In short, language does not merely label the world – it encodes its structure, and even a simple co-occurrence model can read that encoding back.

We can see this for ourselves. The R code included below (click ‘Expand’ to view it) applies LSA – one of the simplest distributional models – to three text collections, projects the resulting word vectors into two dimensions via PCA (principal component analysis) and plots them. In brief, LSA builds a term-document matrix (a large table recording how often each word appears in each document), weights it with TF-IDF (term frequency–inverse document frequency, which highlights words distinctive to particular documents rather than ubiquitous everywhere) and then compresses it via truncated SVD (singular value decomposition, a form of dimensionality reduction). Each corpus is split into two groups: the most distinctive words per group (selected by the difference in mean TF-IDF weight between groups) are plotted in the group’s colour, while the most frequent shared words appear in purple. Words that co-occur in similar contexts cluster together; words from different domains drift apart.

PCA works by finding new axes – principal components – that capture the maximum variance in the data. Each word receives a loading on each component: a number ranging from −1 to +1 that indicates how strongly that word contributes to that axis of variation (a gentle introduction to PCA in R is available in an earlier post on this blog). High absolute loadings on a component mean that the word is a strong marker of the distinction that component captures.

How are the thematic groups decided? The code computes the mean TF-IDF weight of every word in each group of documents and then takes the difference. Words whose weight is much higher in group A than in group B are classified as distinctive to A, and vice versa. The top 15 words at each extreme become the coloured labels in the plot, while the most frequent words that do not belong to either extreme are labelled ‘Shared’. The grouping is therefore entirely data-driven: no human decides which words are ‘finance’ or ‘energy’ – the corpus statistics do. Above each plot, a table shows the mean loading of each thematic group on the first two principal components, with the highest positive loading per group highlighted in bold. A high absolute loading tells us that a given group of words is strongly aligned with that component – in other words, that the component captures precisely the distinction between those groups. When one group loads heavily on PC1 while another does not, the first principal component is essentially the axis that separates them.

Reuters Newswire: Finance vs Energy

The first corpus uses two classic newswire collections from the tm package (Feinerer et al., 2008): acq (50 Reuters articles on corporate acquisitions) and crude (20 articles on crude oil markets). Both have been standard NLP benchmarks since the 1980s (Lewis, 1997). The code builds a TF-IDF weighted term-document matrix, reduces it to a 20-dimensional LSA space via truncated SVD, and computes pairwise cosine similarities – a standard measure of how close two word vectors sit, on a scale from –1 (opposite) to +1 (identical) – using LSAfun::Cosine() (Günther et al., 2016). The PCA loadings table and word-vector plot below show the results.

pkgs <- c("LSAfun", "tm", "ggplot2", "plotly")
invisible(lapply(pkgs, function(p)
  if (!requireNamespace(p, quietly = TRUE)) install.packages(p)))
library(LSAfun)
library(tm)
library(ggplot2)
library(plotly)

# --- Reusable helper: LSA + PCA plot ------------------------------------
# Builds a TF-IDF term-document matrix, computes a truncated SVD,
# selects the most distinctive and most shared words and projects them
# to 2D via PCA *on the selected words only* for maximum spread.

lsa_pipeline <- function(doc_list, labels, grp_a, grp_b,
                         lab_a, lab_b, colour_a, colour_b,
                         top_n = 15, n_shared = 10,
                         k = 20, min_docs = 4) {
  corp <- VCorpus(VectorSource(doc_list))
  corp <- tm_map(corp, content_transformer(tolower))
  corp <- tm_map(corp, removePunctuation)
  corp <- tm_map(corp, removeNumbers)
  corp <- tm_map(corp, removeWords, stopwords("en"))
  corp <- tm_map(corp, stripWhitespace)
  tdm  <- as.matrix(TermDocumentMatrix(corp,
             control = list(weighting = weightTfIdf,
                            bounds = list(global = c(min_docs, Inf)))))
  k_use <- min(as.integer(k), nrow(tdm) - 1L, ncol(tdm) - 1L)
  sv    <- svd(tdm, nu = k_use, nv = k_use)
  wlsa  <- sv$u %*% diag(sv$d[1:k_use])
  rownames(wlsa) <- rownames(tdm)
  idx_a  <- which(labels == grp_a)
  idx_b  <- which(labels == grp_b)
  mean_a <- rowMeans(tdm[, idx_a, drop = FALSE])
  mean_b <- rowMeans(tdm[, idx_b, drop = FALSE])
  total  <- mean_a + mean_b
  spec   <- mean_a - mean_b           # positive = distinctive to A
  top_a  <- names(sort(spec, decreasing = TRUE))[1:top_n]
  top_b  <- names(sort(spec, decreasing = FALSE))[1:top_n]
  shared_pool <- setdiff(names(sort(total, decreasing = TRUE)),
                         c(top_a, top_b))
  shared <- head(shared_pool, n_shared)
  hl   <- unique(c(top_a, top_b, shared))
  hl   <- hl[hl %in% rownames(wlsa)]
  # PCA on the selected words only, for better spatial spread
  wlsa_hl <- wlsa[hl, , drop = FALSE]
  pca  <- prcomp(wlsa_hl, scale. = FALSE)
  cd   <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2],
                     word = rownames(wlsa_hl))
  cd$topic <- ifelse(cd$word %in% top_a & !cd$word %in% top_b, lab_a,
              ifelse(cd$word %in% top_b & !cd$word %in% top_a, lab_b,
                     "Shared"))
  p <- ggplot(cd, aes(PC1, PC2, colour = topic,
                      text = paste0(word, " (", topic, ")"))) +
    geom_point(size = 0, alpha = 0) +
    scale_colour_manual(values = setNames(c(colour_a, colour_b, "#7B2D8E"),
                                          c(lab_a, lab_b, "Shared")),
                        guide = guide_legend(override.aes = list(size = 3, alpha = 1))) +
    labs(x = "Principal Component 1", y = "Principal Component 2",
         colour = NULL) +
    theme_minimal(base_size = 12) +
    theme(legend.position = "bottom",
          legend.margin   = margin(t = -5),
          axis.title.x    = element_text(margin = margin(t = 12)),
          axis.title.y    = element_text(margin = margin(r = 12)),
          plot.margin     = margin(0, 0, 0, 0))

  # Map each word to its group colour for label text
  col_map <- setNames(c(colour_a, colour_b, "#7B2D8E"),
                      c(lab_a, lab_b, "Shared"))
  cd$label_col <- col_map[cd$topic]

  # Trim spatial outliers so the dense cluster is readable.
  # Words beyond the IQR fence are dropped from the plot (not from LSA).
  q1  <- quantile(cd$PC1, 0.25); q3 <- quantile(cd$PC1, 0.75)
  iqr <- q3 - q1; fence <- 2.5
  keep <- cd$PC1 >= (q1 - fence * iqr) & cd$PC1 <= (q3 + fence * iqr)
  q1y <- quantile(cd$PC2, 0.25); q3y <- quantile(cd$PC2, 0.75)
  iqry <- q3y - q1y
  keep <- keep & cd$PC2 >= (q1y - fence * iqry) & cd$PC2 <= (q3y + fence * iqry)
  cd <- cd[keep, , drop = FALSE]

  pp <- ggplotly(p, tooltip = "text")
  # Hide all ggplot traces from plot AND legend
  for (k in seq_along(pp$x$data)) {
    pp$x$data[[k]]$marker$size    <- 0.1
    pp$x$data[[k]]$marker$opacity <- 0
    pp$x$data[[k]]$showlegend <- FALSE
  }
  # Constrain axes to the data range (with a small pad)
  pad_x <- diff(range(cd$PC1)) * 0.06
  pad_y <- diff(range(cd$PC2)) * 0.06
  # Add text traces per group (toggleable via legend)
  legend_groups <- c(lab_a, lab_b, "Shared")
  legend_cols   <- c(colour_a, colour_b, "#7B2D8E")
  offscreen_x <- max(cd$PC1) + pad_x * 50
  offscreen_y <- max(cd$PC2) + pad_y * 50
  for (i in seq_along(legend_groups)) {
    grp <- legend_groups[i]
    grp_data <- cd[cd$topic == grp, , drop = FALSE]
    if (nrow(grp_data) == 0) next
    # Text trace at actual positions (no legend entry)
    pp <- pp %>% add_trace(
      x = grp_data$PC1, y = grp_data$PC2,
      type = "scatter", mode = "text",
      text = grp_data$word,
      text = list(size = 11, color = legend_cols[i]),
      name = grp, legendgroup = grp, showlegend = FALSE,
      hoverinfo = "text",
      hovertext = paste0(grp_data$word, " (", grp, ")"),
      inherit = FALSE
    )
    # Legend-only marker trace (off-screen, linked via legendgroup)
    pp <- pp %>% add_trace(
      x = offscreen_x, y = offscreen_y, type = "scatter", mode = "markers",
      marker = list(size = 12, color = legend_cols[i], opacity = 1,
                    symbol = "circle"),
      name = grp, legendgroup = grp, showlegend = TRUE,
      hoverinfo = "skip", inherit = FALSE
    )
  }
  pp <- pp %>% layout(
    legend = list(orientation = "h", x = 1, xanchor = "right",
                  y = -0.12, tracegroupgap = 4, itemwidth = 30,
                  itemsizing = "constant",
                   = list(size = 12),
                  bordercolor = "#CCCCCC", borderwidth = 1,
                  bgcolor = "#FAFAFA",
                  xpad = 4, ypad = 10),
    xaxis = list(title = list(text = "Principal Component 1",
                              standoff = 8),
                 range = c(min(cd$PC1) - pad_x, max(cd$PC1) + pad_x)),
    yaxis = list(title = list(text = "Principal Component 2",
                              standoff = 8),
                 range = c(min(cd$PC2) - pad_y, max(cd$PC2) + pad_y)),
    margin = list(b = 60)
  )
  list(plot = pp, lsa = wlsa, tdm = tdm, pca = pca, words = cd)
}

# --- 1. Reuters newswire ------------------------------------------------
data(acq)
data(crude)

docs   <- c(lapply(acq, content), lapply(crude, content))
labels <- c(rep("acq", length(acq)), rep("crude", length(crude)))

res1 <- lsa_pipeline(docs, labels,
  grp_a = "acq", grp_b = "crude",
  lab_a = "Finance", lab_b = "Energy",
  colour_a = "#D55E00", colour_b = "#0072B2",
  min_docs = 4)

# Cosine similarities in the 20-dimensional LSA space
pairs <- list(
  c("oil", "barrel"), c("shares", "acquisition"),
  c("price", "barrel"), c("price", "shares"),
  c("shares", "oil"), c("acquisition", "barrel"))
pairs <- Filter(function(p) all(p %in% rownames(res1$lsa)), pairs)
sims  <- sapply(pairs, function(p)
  round(Cosine(p[1], p[2], tvectors = res1$lsa), 3))
names(sims) <- sapply(pairs, paste, collapse = " ~ ")
sims
#>         oil ~ barrel shares ~ acquisition       price ~ barrel 
#>                0.675                0.236                0.938 
#>       price ~ shares         shares ~ oil acquisition ~ barrel 
#>                0.129               -0.014               -0.043

Table 1: *Mean PCA Loadings on the First Two Components (Highest Positive Loading per Group in Bold, Excluding Shared)*
Group	PC1	PC2
Energy	.439	-.321
Finance	-.265	.24
Shared	.172	.069

Figure 1: Word Vectors from Reuters Newswire Articles (Finance vs Energy) Projected to Two Dimensions via PCA on a 20-Dimensional LSA Space. Finance terms (vermillion) cluster in a distinct region from energy terms (blue); shared vocabulary occupies intermediate positions. Select an area of the plot to zoom in; double-click to reset.

The cosine similarities confirm what Figure 1 shows geometrically. Within-domain pairs cluster tightly – oil ~ barrel and price ~ barrel have high positive cosines because these words habitually appear together in oil-market dispatches – while cross-domain pairs like shares ~ oil and acquisition ~ barrel sit near zero: they simply never keep each other’s company. Notice, too, that price ~ shares is far lower than price ~ barrel. The same word, ‘price’, lands in a different region of the space depending on the context in which it predominantly occurs. Firth’s principle, made numerical.

State of the Union: Pre-War vs Post-War

From newswire to politics. The sotu package provides the full text of every US State of the Union address. Splitting at 1945 – the end of the Second World War – reveals how American political vocabulary has shifted over two centuries: from the constitutional and agrarian language of the early republic to the geopolitical and welfare-state vocabulary of the modern era. The loadings table and figure below present the results.

if (!requireNamespace("sotu", quietly = TRUE)) install.packages("sotu")

sotu_texts  <- sotu::sotu_text
sotu_years  <- sotu::sotu_meta$year
sotu_labels <- ifelse(sotu_years < 1945, "Pre-1945", "Post-1945")

res2 <- lsa_pipeline(as.list(sotu_texts), sotu_labels,
  grp_a = "Pre-1945", grp_b = "Post-1945",
  lab_a = "Pre-1945", lab_b = "Post-1945",
  colour_a = "#E69F00", colour_b = "#009E73",
  min_docs = 5)

Table 2: *Mean PCA Loadings on the First Two Components (Highest Positive Loading per Group in Bold, Excluding Shared)*
Group	PC1	PC2
Pre-1945	-.845	-.499
Post-1945	-.492	.385
Shared	-.379	.132

Figure 2: Word Vectors from US State of the Union Addresses Projected to Two Dimensions, Split at 1945. Pre-war speeches (amber) feature constitutional and agrarian vocabulary; post-war speeches (green) shift to geopolitical and welfare-state terms. Select an area of the plot to zoom in; double-click to reset.

The separation is striking. Table 2 reveals that both groups have negative mean loadings on PC1, so the first component does not cleanly separate them – it mainly captures variance shared across eras (general political vocabulary that appears throughout the full 200-year span). The real separation lives on PC2: pre-war words load negatively while post-war words load positively, confirming that the vertical axis in Figure 2 is the one that distinguishes the two eras. Pre-war presidents address ‘gentlemen’ (the formal salutation of a different era) and discuss ‘vessels’, ‘militia’, ‘commerce’ and ‘treasury’ – the vocabulary of a young republic preoccupied with trade, territorial expansion and the mechanics of governance. Modern presidents speak of ‘tonight’ (State of the Union addresses have been televised since the 1960s), ‘jobs’, ‘nuclear’ and ‘program’ – the vocabulary of a superpower managing a welfare state and a global military presence. Words like ‘congress’, ‘government’, ‘war’ and ‘people’ anchor both eras, sitting in the shared middle ground.

IMDB Film Reviews: Positive vs Negative

Now a harder test. The text2vec package includes 5,000 IMDB film reviews labelled as positive or negative – a classic sentiment-analysis benchmark. Unlike the two corpora above, the split here is not by topic but by evaluative tone. Both positive and negative reviews discuss films, characters, plots and acting; the difference lies in the adjectives and evaluative phrasing. This makes the separation task far harder for a simple co-occurrence model – and the result is instructive. The loadings table and figure below present the results.

if (!requireNamespace("text2vec", quietly = TRUE)) install.packages("text2vec")

data("movie_review", package = "text2vec")
mv_labels <- ifelse(movie_review$sentiment == 1, "Positive", "Negative")

res3 <- lsa_pipeline(as.list(movie_review$review), mv_labels,
  grp_a = "Positive", grp_b = "Negative",
  lab_a = "Positive", lab_b = "Negative",
  colour_a = "#009E73", colour_b = "#D55E00",
  min_docs = 50)

Table 3: *Mean PCA Loadings on the First Two Components (Highest Positive Loading per Group in Bold, Excluding Shared)*
Group	PC1	PC2
Negative	.329	-.255
Positive	-.081	.192
Shared	.103	.281

Figure 3: Word Vectors from 5,000 IMDB Film Reviews (Positive vs Negative) Projected to Two Dimensions. Unlike the clean topic-based separations in the Reuters and SOTU corpora, the sentiment-based distinction is much muddier: positive and negative reviews share most of their vocabulary, and evaluative words overlap heavily. Select an area of the plot to zoom in; double-click to reset.

What the Plots Capture – and What They Miss

Taken together, Figures 1–3 illustrate both the power and the limits of distributional models. Figure 1 captures the real-world distinction between financial and energy markets with striking clarity: domain-specific vocabulary clusters tightly, and a polysemous word like ‘price’ lands in different positions depending on its dominant context – precisely the kind of structure that Louwerse and colleagues have documented at larger scale. Figure 2 captures genuine historical change: pre-war addresses use the vocabulary of a young republic (‘gentlemen’, ‘militia’, ‘vessels’); modern ones use the vocabulary of a superpower (‘tonight’, ‘jobs’, ‘nuclear’), reflecting two centuries of political evolution.

Figure 3, however, reveals a clear limitation. Because both positive and negative reviews discuss the same subject – films – the topical vocabulary is largely shared, and the evaluative words that do separate them (‘excellent’ vs ‘worst’, for instance) form only a thin layer atop a large common vocabulary. A 20-dimensional LSA space simply lacks the resolution to untangle sentiment from topic. The model captures what people write about more easily than how they feel about it.

These imprecisions are not accidental; they reflect a fundamental constraint: model capacity.

From Toy Models to Titans

The LSA spaces above used just 20 latent dimensions, trained on corpora of a few dozen to a few thousand documents. The vocabulary that survives the minimum-frequency filter numbers in the low thousands. Under these conditions, the model does a remarkable job of sorting finance from energy or 19th-century language from modern – but it lacks the capacity to encode the subtler distributional cues that distinguish evaluative tone, sarcasm or register.

The history of distributional models is, in large part, a history of scale. As Connell and Lynott (2024) illustrate, growth in model size over the past three decades has been staggering. The LSA models of the late 1990s (Landauer & Dumais, 1997) had a few hundred latent dimensions and were trained on roughly 30,000 documents – already enough to pass synonym tests at near-human levels. Word2Vec (Mikolov et al., 2013) moved to shallow neural networks with a few million learnable parameters trained on billions of words. Then came the Transformer-based models, and the scale exploded: BERT (Devlin et al., 2019) had 340 million parameters, GPT-3 (Brown et al., 2020) reached 175 billion, and today’s largest models are estimated at well over a trillion parameters, trained on text corpora so vast they encompass a substantial fraction of everything ever written on the internet.

The core principle has not changed: predict the next word on the basis of the company it keeps. What changed is capaciousness. A model with 20 dimensions and 10,000 words can distinguish finance from energy; a model with billions of parameters and trillions of training tokens can distinguish a Shakespearean sonnet from a legal brief, track the implications of a subordinate clause across a 3,000-word passage and generate fluent prose in dozens of languages. Generative AI was not built on a fundamentally new idea about language – it was built by scaling Firth’s old idea up by many orders of magnitude and combining it with a crucial algorithmic innovation.

The Transformer Revolution

That algorithmic innovation was the Transformer, introduced by Vaswani et al. in their 2017 paper ‘Attention Is All You Need’. Earlier language models relied on recurrent or convolutional neural networks, which processed words sequentially – reading a sentence one word at a time while trying to hold everything so far in memory. The approach worked, after a fashion, but it was slow and struggled with long-range dependencies.

The Transformer replaced all of that with multi-head self-attention: a mechanism that lets the model weigh every word in a passage simultaneously, comparing each one directly with every other. In plain terms, attention allows the model to ask, for each word, ‘which other words here matter most for understanding me?’ The idea is simple but transformative. It outperformed existing models on translation and a host of other tasks – without any recurrence or convolution – and was far faster to train in parallel.

With Transformers in hand, NLP entered a new era. Large pretrained models like BERT (Devlin et al., 2019) and the GPT series (Brown et al., 2020) set successive benchmarks for language understanding and generation. The combination of the Transformer architecture with the massive scale described above – hundreds of billions of parameters trained on essentially the whole internet – is what made generative AI possible. From Firth’s insight about co-occurrence, through LSA’s matrix decompositions and Word2Vec’s neural embeddings, to the attention-powered behemoths of today, the thread is continuous: predict the next word on the basis of the company it keeps. But despite their extraordinary power, these models remain predictors of text, not infallible oracles of truth. The Transformer revolution made the storyteller more eloquent; it did not make the storyteller more honest.

Fluency Is Not Truth

Crucially, LLMs are optimised for fluency, not truth. They have no built-in fact-checking; they simply predict plausible continuations. As Radford et al. (2019) showed with GPT-2, the training objective is straightforward: learn to predict the next token in a sequence, given all preceding tokens. The loss function rewards fluent, likely text – but it never rewards the model for replying ‘I don’t know.’ Lin et al. (2022) demonstrated with their TruthfulQA benchmark that models frequently produce confident but false answers rather than admitting uncertainty, and that larger models can actually perform worse on truthfulness because they are better at reproducing convincing-sounding misinformation from their training data. The upshot: models tend to guess when unsure, and they guess with alarming confidence.

Consider what this means in practice. Ask an LLM about a niche historical event, and it may cheerfully invent plausible-sounding details – dates, names, citations – that are entirely fabricated. Ask it about a scientific finding at the edges of its training data, and it may blend two real studies into one fictional hybrid, complete with a convincing journal name. This phenomenon, known as hallucination, is not a bug that will eventually be patched away; it is a structural feature of how these models work. Xu et al. (2024) demonstrated formally that if an LLM cannot reliably distinguish true from false statements in its training data, hallucinations are mathematically inevitable. The model’s very fluency becomes its greatest liability: it weaves a convincing narrative whether or not the underlying facts support it. In short, current LLMs are trained to be good storytellers, not guaranteed truth-tellers.

Why Prompts Matter – and Why One Is Rarely Enough

Because of this predictive nature, prompt engineering is essential. A vague or generic question will often yield a superficial, off‑target or simply wrong answer. One guide defines prompt engineering as ‘the art and science of designing and optimising prompts to guide AI models towards generating the desired responses’. That sounds rather grand, but in practice it often means something as prosaic as adding context, specifying a format, giving an example or two, and then refining iteratively until the output is actually useful.

The sensitivity of LLMs to phrasing is remarkable – and, on first encounter, a little humbling. Asking ‘What are some criticisms of capitalism?’ and ‘What are the main drawbacks of market economies?’ can elicit strikingly different responses, even though the questions are conceptually near‑identical. Sclar et al. (2024) showed that even tiny changes – swapping a single word, reordering a clause, adding an explicit instruction to be concise – can dramatically alter what a model produces, with performance varying by up to 76 percentage points across prompt formats for the same task. Wang et al. (2024) found that well-engineered prompts can yield ‘ideal and stable answers’, but that different formulations can have very different effects on performance. Researchers testing LLMs rigorously often try dozens of prompt variants to achieve reliable output. A single query is, in most cases, simply not enough.

There is also the matter of role, tone and constraints. Instructing the model to respond as a sceptical scientist, a sympathetic teacher or a meticulous copy‑editor changes its behaviour markedly. Asking it to respond in plain English, to avoid jargon, to stay under 150 words or to number its assumptions shapes the answer in ways a bare question never could. Each of these additions is, in Firth’s terms, part of the ‘company’ the prompt keeps – and consequently part of what determines the model’s response.

What Good Prompting Looks Like

The savvy user treats an LLM as a collaborator requiring careful, iterative guidance – not a search engine that delivers verdicts on demand. Where possible, provide as much background information as possible, and invite the model to ask you any questions before responding. Consider the first reply from the model as a draft, not a conclusion. It is worth asking follow‑up questions, pushing back on suspect claims, requesting sources or alternative views, and rephrasing when the model goes off track. Ask it to explain its reasoning. Ask it to consider counter‑arguments. Ask it to flag what it is uncertain about. Each move draws more of the model’s latent capability to the surface.

This iterative approach mirrors good intellectual practice more generally. A scientist does not run one experiment and publish; they replicate, vary conditions and triangulate across methods. A journalist does not accept a single source; they seek corroboration. A doctor does not diagnose on one symptom; they gather a fuller picture. Using an LLM well requires the same instinct: treat each exchange as one data point in an ongoing investigation, not as the final word.

A Powerful Tool, Not an Oracle

Using an LLM is a bit like navigating a foreign city without a map: you will stumble upon genuinely useful places, but you will also take wrong turns, end up in dead ends, and occasionally find yourself confidently heading in exactly the wrong direction. These models will often produce accurate information, because language genuinely encodes reality – words cluster around what they describe, and texts about geography, commodity markets or sensory properties track how the world actually works (Louwerse & Zwaan, 2009; Louwerse, 2011). But an LLM is not ontologically conducive to truth: hallucinations are not a bug to be patched but a mathematical inevitability of the architecture (Xu et al., 2024). The underlying mechanism is still word co-occurrence – Firth’s old principle, scaled up. Neither the brute force of massive training data (Connell & Lynott, 2024) nor the ingenious attention mechanisms of modern architectures (Vaswani et al., 2017) has yet tamed this heuristic machine into a reliable truth-teller.

Good results take deliberate effort. A well-crafted prompt – with specific context, clear constraints, iterative refinement and healthy scepticism – does not transform the model into a truth engine. What it does is steer its predictions towards the regions of language that most faithfully reflect the world. Skip that effort, and arriving at the right destination becomes a matter of luck rather than design.

Firth’s insight about words applies equally to prompts: you shall know an answer by the company the question keeps (He et al., 2024; Wang et al., 2024).

References

Bernabeu, P. (2022). Language and sensorimotor simulation in conceptual processing: Multilevel analysis and statistical power [Doctoral thesis, Lancaster University]. https://doi.org/10.17635/lancaster/thesis/1795

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems (Vol. 33, pp. 1877–1901). https://doi.org/10.48550/arXiv.2005.14165

Brunila, M., & LaViolette, J. (2022). What company do words keep? Revisiting the distributional semantics of J.R. Firth & Zellig Harris. Proceedings of NAACL 2022. https://doi.org/10.18653/v1/2022.naacl-main.327

Connell, L., & Lynott, D. (2024). What can language models tell us about human cognition? Current Directions in Psychological Science. https://doi.org/10.1177/09637214241242746

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54. https://doi.org/10.18637/jss.v025.i05

Firth, J. R. (1957). Studies in Linguistic Analysis. Basil Blackwell.

Günther, F., Dudschig, C., & Kaup, B. (2016). LSAfun: An R package for computations based on Latent Semantic Analysis. Behavior Research Methods, 48(2), 409-421. https://doi.org/10.3758/s13428-015-0662-x

He, J., Rungta, M., Koleczek, D., Sekhon, A., Wang, F. X., & Hasan, S. (2024). Does prompt formatting have any impact on LLM performance? arXiv. https://doi.org/10.48550/arXiv.2411.10541

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211

Lewis, D. D. (1997). Reuters-21578 text categorization test collection, distribution 1.0 [Dataset]. AT&T Bell Laboratories. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html

Lin, S., Hilton, J., & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (pp. 3214–3252). https://doi.org/10.18653/v1/2022.acl-long.229

Louwerse, M. M. (2011). Symbol interdependency in symbolic and embodied cognition. Topics in Cognitive Science, 3(2), 273–302. https://doi.org/10.1111/j.1756-8765.2010.01106.x

Louwerse, M., & Connell, L. (2011). A taste of words: Linguistic context and perceptual simulation predict the modality of words. Cognitive Science, 35(2), 381–398. https://doi.org/10.1111/j.1551-6709.2010.01157.x

Louwerse, M. M., & Zwaan, R. A. (2009). Language encodes geographical information. Cognitive Science, 33(1), 51–73. https://doi.org/10.1111/j.1551-6709.2008.01003.x

Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208. https://doi.org/10.3758/BF03204766

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv. https://doi.org/10.48550/arXiv.1301.3781

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Sclar, M., Choi, Y., Tsvetkov, Y., & Suhr, A. (2024). Quantifying language models’ sensitivity to spurious features in prompt design. In Proceedings of ICLR 2024. https://doi.org/10.48550/arXiv.2310.11324

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (Vol. 30). https://doi.org/10.48550/arXiv.1706.03762

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., & Li, J. (2024). Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Medicine, 7, Article 41. https://doi.org/10.1038/s41746-024-01029-4

Wu, M., Conde, J., Reviriego, P., & Brysbaert, M. (2026). How does fine-tuning improve sensorimotor representations in large language models? arXiv. https://doi.org/10.48550/arXiv.2603.03313

Xu, Z., Jain, S., & Kankanhalli, M. (2024). Hallucination is inevitable: An innate limitation of large language models. arXiv. https://doi.org/10.48550/arXiv.2401.11817

Xu, Q., Peng, Y., Nastase, S. A., Chodorow, M., Wu, M., & Li, P. (2025). Large language models without grounding recover non-sensorimotor but not sensorimotor features of human concepts. Nature Human Behaviour, 9(9), 1871–1886. https://doi.org/10.1038/s41562-025-02203-8

To leave a comment for the author, please follow the link and comment on their blog: Pablo Bernabeu.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.