Introducing formatdown

[This article was first published on Layton R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Summary

Convert the elements of a numerical vector or data frame column to character strings in which the numbers are formatted using powers-of-ten notation in scientific or engineering form and delimited for rendering as inline equations in an rmarkdown document.

Initial release of the formatdown R package providing tools for formatting output in rmarkdown or quarto markdown documents.

This first version has one function only, format_power(), for converting numbers to character strings formatted in powers-of-ten notation and delimited in $...$ for rendering as inline equations in .Rmd or .qmd output documents. Provides two powers-of-ten formatting options—scientific notation and engineering notation—with an option to omit powers-of-ten notation for a specified range of exponents.

To illustrate the different formats, I show in Table 1 the same number rendered using different formats, all with 4 significant digits.

The R code for the post is listed under the “R code” pointers. In the examples, I use data.table syntax for data manipulation, though the code can be translated into base R or dplyr syntax if desired.

R code
library("formatdown")
library("data.table")

x <- 4.567E-4                                   # value
x1 <- format_power(x, 4, omit_power = c(-6, 0)) # omit power-of-ten
x2 <- format_power(x, 4, format = "sci")        # scientific
x3 <- format_power(x, 4)                        # engineering

# render in markdown table below
Table 1: Rendering a number using different formats
Notation Name Value Rendered as
without x1 "$0.0004567$"
scientific x2 "$4.567\\times{10}^{-4}$"
engineering x3 "$456.7\\times{10}^{-6}$"

Background

My first attempt to provide powers-of-ten formatting was in my 2016 package, docxtools. That implementation has several shortcomings.

I wrote its formatting function to accept a data frame as input, which entailed a lot of programming overhead to separate numerical from non-numerical variable classes and to reassemble them after the numerical columns were formatted. This could have been simplified with judicious use of lapply(), with which I was not sufficiently experienced at the time. I also failed to take advantage of formatC() in constructing the output.

With formatdown, my goal is to provide similar functionality but with more concise code, greater flexibility, and a more balanced approach to package dependencies.

Improvements

The primary design change is that the format_power() function operates on a numerical vector instead of a data frame. The benefits of this change are: 1) simpler code that should be easier to revise and maintain; 2) scalar values can be formatted for rendering inline; and 3) data frames can still be formatted, by column, using lapply().

To illustrate formatting a scalar value inline, the markup for Avogadro’s number (x = 6.0221E+23) in engineering format is given by,

    $N_A =$ `r format_power(x, digits = 5, format = "engr")`

which is rendered (in this output document) as .

The second improvement is the addition of an option for scientific notation. For example, the markup for Avogadro’s number in scientific notation is given by,

    $N_A =$ `r format_power(x, digits = 5, format = "sci")`

which renders as .

The third improvement is the addition of an option for omitting powers-of-ten notation over a range of exponents. For example, the markup for x = 1.23E-4 in decimal notation is given by,

    $x =$ `r format_power(x = 1.234E-4, omit_power = c(-4, 0))`

which renders as .

A final (internal) improvement is a more balanced approach to package dependencies. With a tighter focus on what formatdown is to accomplish compared to docxtools, I have reduced the dependencies to checkmate, wrapr, and data.table.

The package vignette illustrates package usage in detail.

However, having successfully submitted the package to CRAN, I started working on this post and immediately (!) uncovered an issue that had not appeared while working on the package vignettes.

Delimiter issue

I wrote the package vignette using the rmarkdown::html_vignette output style per usual. All the formatted output rendered as expected in that document. I write this blog using quarto. As seen in the examples above, inline math is rendered as expected.

The issue arises when using knitr::kable() and kableExtra::kbl() to display data tables in this blog post. To illustrate, consider this data frame, included with formatdown (ideal gas properties of air at room temperature).

R code
density
         date  trial humidity    T_K   p_Pa     R  density
       <Date> <char>   <fctr>  <num>  <num> <int>    <num>
1: 2018-06-12      a      low 294.05 101100   287 1.197976
2: 2018-06-13      b     high 294.15 101000   287 1.196384
3: 2018-06-14      c   medium 294.65 101100   287 1.195536
4: 2018-06-15      d      low 293.35 101000   287 1.199647
5: 2018-06-16      e     high 293.85 101100   287 1.198791

Formatting the pressure column, the markup looks OK.

R code
DT <- copy(density)
DT$p_Pa <- format_power(DT$p_Pa, 4)
DT
         date  trial humidity    T_K                   p_Pa     R  density
       <Date> <char>   <fctr>  <num>                 <char> <int>    <num>
1: 2018-06-12      a      low 294.05 $101.1\\times{10}^{3}$   287 1.197976
2: 2018-06-13      b     high 294.15 $101.0\\times{10}^{3}$   287 1.196384
3: 2018-06-14      c   medium 294.65 $101.1\\times{10}^{3}$   287 1.195536
4: 2018-06-15      d      low 293.35 $101.0\\times{10}^{3}$   287 1.199647
5: 2018-06-16      e     high 293.85 $101.1\\times{10}^{3}$   287 1.198791

knitr::kable() yields the expected output with pressure formatted in engineering notation.

R code
knitr::kable(DT, align = "r")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 287 1.197976
2018-06-13 b high 294.15 287 1.196384
2018-06-14 c medium 294.65 287 1.195536
2018-06-15 d low 293.35 287 1.199647
2018-06-16 e high 293.85 287 1.198791

Problem

kableExtra::kbl() does not render the math markup as expected.

R code
kableExtra::kbl(DT, align = "r")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 $101.1\times{10}^{3}$ 287 1.197976
2018-06-13 b high 294.15 $101.0\times{10}^{3}$ 287 1.196384
2018-06-14 c medium 294.65 $101.1\times{10}^{3}$ 287 1.195536
2018-06-15 d low 293.35 $101.0\times{10}^{3}$ 287 1.199647
2018-06-16 e high 293.85 $101.1\times{10}^{3}$ 287 1.198791

In fact, having loaded kableExtra above, knitr::kable() now fails in the same way.

R code
knitr::kable(DT, align = "r")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 $101.1\times{10}^{3}$ 287 1.197976
2018-06-13 b high 294.15 $101.0\times{10}^{3}$ 287 1.196384
2018-06-14 c medium 294.65 $101.1\times{10}^{3}$ 287 1.195536
2018-06-15 d low 293.35 $101.0\times{10}^{3}$ 287 1.199647
2018-06-16 e high 293.85 $101.1\times{10}^{3}$ 287 1.198791

Solution

I found a suggestion from MathJax to replace the $ ... $ delimiters with \\( ... \\). I wrote a short function (below) to do that.

R code
# Substitute math delimiters
sub_delim <- function(x) {
  x <- sub("\\$", "\\\\(", x) # first $
  x <- sub("\\$", "\\\\)", x) # second $
}

DT$p_Pa <- sub_delim(DT$p_Pa)
DT
         date  trial humidity    T_K                       p_Pa     R  density
       <Date> <char>   <fctr>  <num>                     <char> <int>    <num>
1: 2018-06-12      a      low 294.05 \\(101.1\\times{10}^{3}\\)   287 1.197976
2: 2018-06-13      b     high 294.15 \\(101.0\\times{10}^{3}\\)   287 1.196384
3: 2018-06-14      c   medium 294.65 \\(101.1\\times{10}^{3}\\)   287 1.195536
4: 2018-06-15      d      low 293.35 \\(101.0\\times{10}^{3}\\)   287 1.199647
5: 2018-06-16      e     high 293.85 \\(101.1\\times{10}^{3}\\)   287 1.198791

knitr::kable() yields the expected output.

R code
knitr::kable(DT, align = "c")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 \(101.1\times{10}^{3}\) 287 1.197976
2018-06-13 b high 294.15 \(101.0\times{10}^{3}\) 287 1.196384
2018-06-14 c medium 294.65 \(101.1\times{10}^{3}\) 287 1.195536
2018-06-15 d low 293.35 \(101.0\times{10}^{3}\) 287 1.199647
2018-06-16 e high 293.85 \(101.1\times{10}^{3}\) 287 1.198791

kableExtra::kbl() yields the expected output.

R code
kableExtra::kbl(DT, align = "c")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 \(101.1\times{10}^{3}\) 287 1.197976
2018-06-13 b high 294.15 \(101.0\times{10}^{3}\) 287 1.196384
2018-06-14 c medium 294.65 \(101.1\times{10}^{3}\) 287 1.195536
2018-06-15 d low 293.35 \(101.0\times{10}^{3}\) 287 1.199647
2018-06-16 e high 293.85 \(101.1\times{10}^{3}\) 287 1.198791

I can use the features from kableExtra to print a pretty table.

R code
library("kableExtra")

var_names <- c("Date", "Trial", "Humidity", "Temperature", "Pressure", "Gas constant", "Density" )
var_units <- c("", "", "", "[K]", "[Pa]", "[J/(kg K)]", "[kg/m\\(^3\\)]")
var_align <- "r"

DT |> 
  kbl(align = var_align, col.names = var_units) |>
  column_spec(1:6, color = "black", background = "white") |>
  add_header_above(header = var_names, align = var_align, background = "#c7eae5", line_sep = 0) |>
  kable_paper(lightable_options = "basic", full_width = TRUE)
Table 2: Data frame displayed using kableExtra
Date
Trial
Humidity
Temperature
Pressure
Gas constant
Density
[K] [Pa] [J/(kg K)] [kg/m\(^3\)]
2018-06-12 a low 294.05 \(101.1\times{10}^{3}\) 287 1.197976
2018-06-13 b high 294.15 \(101.0\times{10}^{3}\) 287 1.196384
2018-06-14 c medium 294.65 \(101.1\times{10}^{3}\) 287 1.195536
2018-06-15 d low 293.35 \(101.0\times{10}^{3}\) 287 1.199647
2018-06-16 e high 293.85 \(101.1\times{10}^{3}\) 287 1.198791

Follow up

To address this issue, the next version of format_power() will include a new delim argument,

    format_power(x, digits, format, omit_power, delim)

that allows a user to set the math delimiters to $ ... $ or \\( ... \\) or even custom left and right markup to suit their environment.

Fixed exponents

Preparing this post, I adapted a table of water properties from the hydraulics package to use as an example and discovered another, more subtle issue. First, I’ll construct the data frame.

R code
# Construct a table of water properties
temperature     <- seq(0, 45, 10) + 273.15
density         <- c(1000, 1000, 998, 996, 992)
specific_weight <- c(9809, 9807, 9793, 9768, 9734)
viscosity       <- c(173, 131, 102, 81.7, 67.0) * 1E-8
bulk_modulus    <- c(202, 210, 218, 225, 228) * 1E+7

water <- data.table(temperature, density, specific_weight, viscosity,  bulk_modulus)

water
   temperature density specific_weight viscosity bulk_modulus
         <num>   <num>           <num>     <num>        <num>
1:      273.15    1000            9809  1.73e-06     2.02e+09
2:      283.15    1000            9807  1.31e-06     2.10e+09
3:      293.15     998            9793  1.02e-06     2.18e+09
4:      303.15     996            9768  8.17e-07     2.25e+09
5:      313.15     992            9734  6.70e-07     2.28e+09

Problem

I format all the columns and change the delimiters as described earlier and display the result. The viscosity column reveals the problem.

R code
DT <- copy(water)

# 5 signif digits
cols_to_format <- c("temperature")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x, 5)), .SDcols = cols_to_format]

# 4 signif digits
cols_to_format <- c("specific_weight")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x, 4)), .SDcols = cols_to_format]

# 3 signif digits
cols_to_format <- c("viscosity", "bulk_modulus")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x)), .SDcols = cols_to_format]

# 3 signif digits omit powers
cols_to_format <- c("density")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x, omit_power = c(0, 3))), .SDcols = cols_to_format]

# change the delimiters
DT <- DT[, lapply(.SD, function(x) sub_delim(x))]

# Table 
DT |> 
  kbl(align = "cclrrrr") |>
  kable_paper(lightable_options = "basic", full_width = TRUE) |>
  row_spec(0, background = "#c7eae5") |>
  column_spec(1:5, color = "black", background = "white")
temperature density specific_weight viscosity bulk_modulus
\(273.15\) \(1000\) \(9.809\times{10}^{3}\) \(1.73\times{10}^{-6}\) \(2.02\times{10}^{9}\)
\(283.15\) \(1000\) \(9.807\times{10}^{3}\) \(1.31\times{10}^{-6}\) \(2.10\times{10}^{9}\)
\(293.15\) \(998\) \(9.793\times{10}^{3}\) \(1.02\times{10}^{-6}\) \(2.18\times{10}^{9}\)
\(303.15\) \(996\) \(9.768\times{10}^{3}\) \(817\times{10}^{-9}\) \(2.25\times{10}^{9}\)
\(313.15\) \(992\) \(9.734\times{10}^{3}\) \(670\times{10}^{-9}\) \(2.28\times{10}^{9}\)

The viscosity column displays three values using and two using . Visually comparing the values in a column is easier if the powers of ten are identical. The table below illustrates the desired result, created by manually editing the two viscosity values.

R code
# Manually edit strings to illustrate
DT$viscosity[4] <- "\\(0.82\\times{10}^{-6}\\)"
DT$viscosity[5] <- "\\(0.67\\times{10}^{-6}\\)"

# Table 
DT |> 
  kbl(align = "cclrrrr") |>
  kable_paper(lightable_options = "basic", full_width = TRUE) |>
  row_spec(0, background = "#c7eae5") |>
  column_spec(1:5, color = "black", background = "white")
temperature density specific_weight viscosity bulk_modulus
\(273.15\) \(1000\) \(9.809\times{10}^{3}\) \(1.73\times{10}^{-6}\) \(2.02\times{10}^{9}\)
\(283.15\) \(1000\) \(9.807\times{10}^{3}\) \(1.31\times{10}^{-6}\) \(2.10\times{10}^{9}\)
\(293.15\) \(998\) \(9.793\times{10}^{3}\) \(1.02\times{10}^{-6}\) \(2.18\times{10}^{9}\)
\(303.15\) \(996\) \(9.768\times{10}^{3}\) \(0.82\times{10}^{-6}\) \(2.25\times{10}^{9}\)
\(313.15\) \(992\) \(9.734\times{10}^{3}\) \(0.67\times{10}^{-6}\) \(2.28\times{10}^{9}\)

This revision satisfies two conventions of tabulating empirical engineering information.

  1. Units.   With all the reported values reported to the same power-of-ten, the units can all be interpreted in the same way. In this case for example, the units of the viscosity coefficients (1.73, 1.31, etc.) are all micro-Pascal-seconds (Pa-s).

  2. Uncertainty.   In rewriting the two viscosity values, I changed from three significant digits to two decimal places, consistent with the assumption that empirical information is reported to the same level of uncertainty unless noted otherwise.

Potential revision

Add the water data to formatdown and the following functionality to format_power().

  1. A new argument (perhaps fixed_power) that automatically selects a fixed exponent for a numerical vector or permits the user to directly assign a fixed exponent.

     format_power(x, digits, format, omit_power, delim, fixed_power)
  2. In conjunction with the fixed power-of-ten, I would also round all numbers in a column to the same number of decimal places to address the uncertainty assumption. This could be a separate argument.

Units

And now for something completely different!

Thinking about measurement units, I looked for relevant R packages and found units. With appropriate units, powers-of-ten notation can be practically eliminated. For example, a pressure reading of Pa can be reported as GPa.

To illustrate, I start with the basic water data,

R code
water
   temperature density specific_weight viscosity bulk_modulus
         <num>   <num>           <num>     <num>        <num>
1:      273.15    1000            9809  1.73e-06     2.02e+09
2:      283.15    1000            9807  1.31e-06     2.10e+09
3:      293.15     998            9793  1.02e-06     2.18e+09
4:      303.15     996            9768  8.17e-07     2.25e+09
5:      313.15     992            9734  6.70e-07     2.28e+09

With tools from the units package, I can define a symbol uP to represent micropoise (a non-SI viscosity unit equal to 10 Pa-s). And I can write a short function to convert the numbers from basic units to displayed units, for example, converting Pa to GPa (gigapascal) or Pa-s to P (micropoise).

R code
library("units")

# Define the uP units
install_unit("uP", "micropoise", "micropoise")

# Function to assign and convert units 
assign_units <- function(x, base_unit, display_unit) {
  
  # convert x to "Units" class in base units
  units(x) <- base_unit
  
  # convert from basic to display units
  units(x) <- as_units(display_unit)
  
  # return value
  x
}

Convert each column and output the results.

R code
# Apply to one variable at a time
DT <- copy(water)
DT$temperature     <- assign_units(DT$temperature, "K", "degree_C")
DT$density         <- assign_units(DT$density, "kg/m^3", "kg/m^3")
DT$specific_weight <- assign_units(DT$specific_weight, "N/m^3", "kN/m^3")
DT$viscosity       <- assign_units(DT$viscosity, "Pa*s", "uP")
DT$bulk_modulus    <- assign_units(DT$bulk_modulus, "Pa", "GPa")

# Output
DT |> 
  kbl(align = "r") |>
  kable_paper(lightable_options = "basic", full_width = TRUE) |>
  row_spec(0, background = "#c7eae5") |>
  column_spec(1:5, color = "black", background = "white") 
temperature density specific_weight viscosity bulk_modulus
0 [°C] 1000 [kg/m^3] 9.809 [kN/m^3] 17.30 [uP] 2.02 [GPa]
10 [°C] 1000 [kg/m^3] 9.807 [kN/m^3] 13.10 [uP] 2.10 [GPa]
20 [°C] 998 [kg/m^3] 9.793 [kN/m^3] 10.20 [uP] 2.18 [GPa]
30 [°C] 996 [kg/m^3] 9.768 [kN/m^3] 8.17 [uP] 2.25 [GPa]
40 [°C] 992 [kg/m^3] 9.734 [kN/m^3] 6.70 [uP] 2.28 [GPa]

The entries in the data frame are still numeric but are of the “Units” class, enabling math operations among values with compatible units. See the units website for details.

R code
str(DT)
Classes 'data.table' and 'data.frame':  5 obs. of  5 variables:
 $ temperature    : Units: [°C] num  0 10 20 30 40
 $ density        : Units: [kg/m^3] num  1000 1000 998 996 992
 $ specific_weight: Units: [kN/m^3] num  9.81 9.81 9.79 9.77 9.73
 $ viscosity      : Units: [uP] num  17.3 13.1 10.2 8.17 6.7
 $ bulk_modulus   : Units: [GPa] num  2.02 2.1 2.18 2.25 2.28
 - attr(*, ".internal.selfref")=<externalptr> 

If I were to refine this table further, I would report the numerical values without labels in each cell, moving the unit labels to a sub-header row. Possible future work.

Potential revision

Incorporate tools from the units package to create a new function (perhaps format_units()) that would convert basic units to display units that can substitute for powers-of-ten notation.

Closing

The new formatdown package formats numbers in powers-of-ten notation for inline math markup. A new argument is already in the works for managing the math delimiters. Potential new features include a fixed power-of-tens option as well as replacing powers-of-ten notation with deliberate manipulation of physical units.

Additional software credits

  • checkmate for internal function argument checks
  • wrapr for internal function authoring tools
  • units for managing units of physical quantities
To leave a comment for the author, please follow the link and comment on their blog: Layton R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)