Site icon R-bloggers

Introducing formatdown

[This article was first published on Layton R blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Summary

Convert the elements of a numerical vector or data frame column to character strings in which the numbers are formatted using powers-of-ten notation in scientific or engineering form and delimited for rendering as inline equations in an rmarkdown document.

Initial release of the formatdown R package providing tools for formatting output in rmarkdown or quarto markdown documents.

This first version has one function only, format_power(), for converting numbers to character strings formatted in powers-of-ten notation and delimited in $...$ for rendering as inline equations in .Rmd or .qmd output documents. Provides two powers-of-ten formatting options—scientific notation and engineering notation—with an option to omit powers-of-ten notation for a specified range of exponents.

To illustrate the different formats, I show in Table 1 the same number rendered using different formats, all with 4 significant digits.

The R code for the post is listed under the “R code” pointers. In the examples, I use data.table syntax for data manipulation, though the code can be translated into base R or dplyr syntax if desired.

< details> < summary>R code
library("formatdown")
library("data.table")

x <- 4.567E-4                                   # value
x1 <- format_power(x, 4, omit_power = c(-6, 0)) # omit power-of-ten
x2 <- format_power(x, 4, format = "sci")        # scientific
x3 <- format_power(x, 4)                        # engineering

# render in markdown table below
Table 1: Rendering a number using different formats
Notation Name Value Rendered as
without x1 "$0.0004567$"
scientific x2 "$4.567\\times{10}^{-4}$"
engineering x3 "$456.7\\times{10}^{-6}$"
< section id="background" class="level2">

Background

My first attempt to provide powers-of-ten formatting was in my 2016 package, docxtools. That implementation has several shortcomings.

I wrote its formatting function to accept a data frame as input, which entailed a lot of programming overhead to separate numerical from non-numerical variable classes and to reassemble them after the numerical columns were formatted. This could have been simplified with judicious use of lapply(), with which I was not sufficiently experienced at the time. I also failed to take advantage of formatC() in constructing the output.

With formatdown, my goal is to provide similar functionality but with more concise code, greater flexibility, and a more balanced approach to package dependencies.

< section id="improvements" class="level2">

Improvements

The primary design change is that the format_power() function operates on a numerical vector instead of a data frame. The benefits of this change are: 1) simpler code that should be easier to revise and maintain; 2) scalar values can be formatted for rendering inline; and 3) data frames can still be formatted, by column, using lapply().

To illustrate formatting a scalar value inline, the markup for Avogadro’s number (x = 6.0221E+23) in engineering format is given by,

    $N_A =$ `r format_power(x, digits = 5, format = "engr")`

which is rendered (in this output document) as .

The second improvement is the addition of an option for scientific notation. For example, the markup for Avogadro’s number in scientific notation is given by,

    $N_A =$ `r format_power(x, digits = 5, format = "sci")`

which renders as .

The third improvement is the addition of an option for omitting powers-of-ten notation over a range of exponents. For example, the markup for x = 1.23E-4 in decimal notation is given by,

    $x =$ `r format_power(x = 1.234E-4, omit_power = c(-4, 0))`

which renders as .

A final (internal) improvement is a more balanced approach to package dependencies. With a tighter focus on what formatdown is to accomplish compared to docxtools, I have reduced the dependencies to checkmate, wrapr, and data.table.

The package vignette illustrates package usage in detail.

However, having successfully submitted the package to CRAN, I started working on this post and immediately (!) uncovered an issue that had not appeared while working on the package vignettes.

< section id="delimiter-issue" class="level2">

Delimiter issue

I wrote the package vignette using the rmarkdown::html_vignette output style per usual. All the formatted output rendered as expected in that document. I write this blog using quarto. As seen in the examples above, inline math is rendered as expected.

The issue arises when using knitr::kable() and kableExtra::kbl() to display data tables in this blog post. To illustrate, consider this data frame, included with formatdown (ideal gas properties of air at room temperature).

< details> < summary>R code
density
         date  trial humidity    T_K   p_Pa     R  density
       <Date> <char>   <fctr>  <num>  <num> <int>    <num>
1: 2018-06-12      a      low 294.05 101100   287 1.197976
2: 2018-06-13      b     high 294.15 101000   287 1.196384
3: 2018-06-14      c   medium 294.65 101100   287 1.195536
4: 2018-06-15      d      low 293.35 101000   287 1.199647
5: 2018-06-16      e     high 293.85 101100   287 1.198791

Formatting the pressure column, the markup looks OK.

< details> < summary>R code
DT <- copy(density)
DT$p_Pa <- format_power(DT$p_Pa, 4)
DT
         date  trial humidity    T_K                   p_Pa     R  density
       <Date> <char>   <fctr>  <num>                 <char> <int>    <num>
1: 2018-06-12      a      low 294.05 $101.1\\times{10}^{3}$   287 1.197976
2: 2018-06-13      b     high 294.15 $101.0\\times{10}^{3}$   287 1.196384
3: 2018-06-14      c   medium 294.65 $101.1\\times{10}^{3}$   287 1.195536
4: 2018-06-15      d      low 293.35 $101.0\\times{10}^{3}$   287 1.199647
5: 2018-06-16      e     high 293.85 $101.1\\times{10}^{3}$   287 1.198791

knitr::kable() yields the expected output with pressure formatted in engineering notation.

< details> < summary>R code
knitr::kable(DT, align = "r")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 287 1.197976
2018-06-13 b high 294.15 287 1.196384
2018-06-14 c medium 294.65 287 1.195536
2018-06-15 d low 293.35 287 1.199647
2018-06-16 e high 293.85 287 1.198791
< section id="problem" class="level3">

Problem

kableExtra::kbl() does not render the math markup as expected.

< details> < summary>R code
kableExtra::kbl(DT, align = "r")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 $101.1\times{10}^{3}$ 287 1.197976
2018-06-13 b high 294.15 $101.0\times{10}^{3}$ 287 1.196384
2018-06-14 c medium 294.65 $101.1\times{10}^{3}$ 287 1.195536
2018-06-15 d low 293.35 $101.0\times{10}^{3}$ 287 1.199647
2018-06-16 e high 293.85 $101.1\times{10}^{3}$ 287 1.198791

In fact, having loaded kableExtra above, knitr::kable() now fails in the same way.

< details> < summary>R code
knitr::kable(DT, align = "r")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 $101.1\times{10}^{3}$ 287 1.197976
2018-06-13 b high 294.15 $101.0\times{10}^{3}$ 287 1.196384
2018-06-14 c medium 294.65 $101.1\times{10}^{3}$ 287 1.195536
2018-06-15 d low 293.35 $101.0\times{10}^{3}$ 287 1.199647
2018-06-16 e high 293.85 $101.1\times{10}^{3}$ 287 1.198791
< section id="solution" class="level3">

Solution

I found a suggestion from MathJax to replace the $ ... $ delimiters with \\( ... \\). I wrote a short function (below) to do that.

< details> < summary>R code
# Substitute math delimiters
sub_delim <- function(x) {
  x <- sub("\\$", "\\\\(", x) # first $
  x <- sub("\\$", "\\\\)", x) # second $
}

DT$p_Pa <- sub_delim(DT$p_Pa)
DT
         date  trial humidity    T_K                       p_Pa     R  density
       <Date> <char>   <fctr>  <num>                     <char> <int>    <num>
1: 2018-06-12      a      low 294.05 \\(101.1\\times{10}^{3}\\)   287 1.197976
2: 2018-06-13      b     high 294.15 \\(101.0\\times{10}^{3}\\)   287 1.196384
3: 2018-06-14      c   medium 294.65 \\(101.1\\times{10}^{3}\\)   287 1.195536
4: 2018-06-15      d      low 293.35 \\(101.0\\times{10}^{3}\\)   287 1.199647
5: 2018-06-16      e     high 293.85 \\(101.1\\times{10}^{3}\\)   287 1.198791

knitr::kable() yields the expected output.

< details> < summary>R code
knitr::kable(DT, align = "c")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 \(101.1\times{10}^{3}\) 287 1.197976
2018-06-13 b high 294.15 \(101.0\times{10}^{3}\) 287 1.196384
2018-06-14 c medium 294.65 \(101.1\times{10}^{3}\) 287 1.195536
2018-06-15 d low 293.35 \(101.0\times{10}^{3}\) 287 1.199647
2018-06-16 e high 293.85 \(101.1\times{10}^{3}\) 287 1.198791

kableExtra::kbl() yields the expected output.

< details> < summary>R code
kableExtra::kbl(DT, align = "c")
date trial humidity T_K p_Pa R density
2018-06-12 a low 294.05 \(101.1\times{10}^{3}\) 287 1.197976
2018-06-13 b high 294.15 \(101.0\times{10}^{3}\) 287 1.196384
2018-06-14 c medium 294.65 \(101.1\times{10}^{3}\) 287 1.195536
2018-06-15 d low 293.35 \(101.0\times{10}^{3}\) 287 1.199647
2018-06-16 e high 293.85 \(101.1\times{10}^{3}\) 287 1.198791

I can use the features from kableExtra to print a pretty table.

< details> < summary>R code
library("kableExtra")

var_names <- c("Date", "Trial", "Humidity", "Temperature", "Pressure", "Gas constant", "Density" )
var_units <- c("", "", "", "[K]", "[Pa]", "[J/(kg K)]", "[kg/m\\(^3\\)]")
var_align <- "r"

DT |> 
  kbl(align = var_align, col.names = var_units) |>
  column_spec(1:6, color = "black", background = "white") |>
  add_header_above(header = var_names, align = var_align, background = "#c7eae5", line_sep = 0) |>
  kable_paper(lightable_options = "basic", full_width = TRUE)
Table 2: Data frame displayed using kableExtra
Date
Trial
Humidity
Temperature
Pressure
Gas constant
Density
[K] [Pa] [J/(kg K)] [kg/m\(^3\)]
2018-06-12 a low 294.05 \(101.1\times{10}^{3}\) 287 1.197976
2018-06-13 b high 294.15 \(101.0\times{10}^{3}\) 287 1.196384
2018-06-14 c medium 294.65 \(101.1\times{10}^{3}\) 287 1.195536
2018-06-15 d low 293.35 \(101.0\times{10}^{3}\) 287 1.199647
2018-06-16 e high 293.85 \(101.1\times{10}^{3}\) 287 1.198791
< section id="follow-up" class="level3">

Follow up

To address this issue, the next version of format_power() will include a new delim argument,

    format_power(x, digits, format, omit_power, delim)

that allows a user to set the math delimiters to $ ... $ or \\( ... \\) or even custom left and right markup to suit their environment.

< section id="fixed-exponents" class="level2">

Fixed exponents

Preparing this post, I adapted a table of water properties from the hydraulics package to use as an example and discovered another, more subtle issue. First, I’ll construct the data frame.

< details> < summary>R code
# Construct a table of water properties
temperature     <- seq(0, 45, 10) + 273.15
density         <- c(1000, 1000, 998, 996, 992)
specific_weight <- c(9809, 9807, 9793, 9768, 9734)
viscosity       <- c(173, 131, 102, 81.7, 67.0) * 1E-8
bulk_modulus    <- c(202, 210, 218, 225, 228) * 1E+7

water <- data.table(temperature, density, specific_weight, viscosity,  bulk_modulus)

water
   temperature density specific_weight viscosity bulk_modulus
         <num>   <num>           <num>     <num>        <num>
1:      273.15    1000            9809  1.73e-06     2.02e+09
2:      283.15    1000            9807  1.31e-06     2.10e+09
3:      293.15     998            9793  1.02e-06     2.18e+09
4:      303.15     996            9768  8.17e-07     2.25e+09
5:      313.15     992            9734  6.70e-07     2.28e+09
< section id="problem-1" class="level3">

Problem

I format all the columns and change the delimiters as described earlier and display the result. The viscosity column reveals the problem.

< details> < summary>R code
DT <- copy(water)

# 5 signif digits
cols_to_format <- c("temperature")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x, 5)), .SDcols = cols_to_format]

# 4 signif digits
cols_to_format <- c("specific_weight")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x, 4)), .SDcols = cols_to_format]

# 3 signif digits
cols_to_format <- c("viscosity", "bulk_modulus")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x)), .SDcols = cols_to_format]

# 3 signif digits omit powers
cols_to_format <- c("density")
DT[, (cols_to_format) := lapply(.SD, function(x) format_power(x, omit_power = c(0, 3))), .SDcols = cols_to_format]

# change the delimiters
DT <- DT[, lapply(.SD, function(x) sub_delim(x))]

# Table 
DT |> 
  kbl(align = "cclrrrr") |>
  kable_paper(lightable_options = "basic", full_width = TRUE) |>
  row_spec(0, background = "#c7eae5") |>
  column_spec(1:5, color = "black", background = "white")
temperature density specific_weight viscosity bulk_modulus
\(273.15\) \(1000\) \(9.809\times{10}^{3}\) \(1.73\times{10}^{-6}\) \(2.02\times{10}^{9}\)
\(283.15\) \(1000\) \(9.807\times{10}^{3}\) \(1.31\times{10}^{-6}\) \(2.10\times{10}^{9}\)
\(293.15\) \(998\) \(9.793\times{10}^{3}\) \(1.02\times{10}^{-6}\) \(2.18\times{10}^{9}\)
\(303.15\) \(996\) \(9.768\times{10}^{3}\) \(817\times{10}^{-9}\) \(2.25\times{10}^{9}\)
\(313.15\) \(992\) \(9.734\times{10}^{3}\) \(670\times{10}^{-9}\) \(2.28\times{10}^{9}\)

The viscosity column displays three values using and two using . Visually comparing the values in a column is easier if the powers of ten are identical. The table below illustrates the desired result, created by manually editing the two viscosity values.

< details> < summary>R code
# Manually edit strings to illustrate
DT$viscosity[4] <- "\\(0.82\\times{10}^{-6}\\)"
DT$viscosity[5] <- "\\(0.67\\times{10}^{-6}\\)"

# Table 
DT |> 
  kbl(align = "cclrrrr") |>
  kable_paper(lightable_options = "basic", full_width = TRUE) |>
  row_spec(0, background = "#c7eae5") |>
  column_spec(1:5, color = "black", background = "white")
temperature density specific_weight viscosity bulk_modulus
\(273.15\) \(1000\) \(9.809\times{10}^{3}\) \(1.73\times{10}^{-6}\) \(2.02\times{10}^{9}\)
\(283.15\) \(1000\) \(9.807\times{10}^{3}\) \(1.31\times{10}^{-6}\) \(2.10\times{10}^{9}\)
\(293.15\) \(998\) \(9.793\times{10}^{3}\) \(1.02\times{10}^{-6}\) \(2.18\times{10}^{9}\)
\(303.15\) \(996\) \(9.768\times{10}^{3}\) \(0.82\times{10}^{-6}\) \(2.25\times{10}^{9}\)
\(313.15\) \(992\) \(9.734\times{10}^{3}\) \(0.67\times{10}^{-6}\) \(2.28\times{10}^{9}\)

This revision satisfies two conventions of tabulating empirical engineering information.

  1. Units.   With all the reported values reported to the same power-of-ten, the units can all be interpreted in the same way. In this case for example, the units of the viscosity coefficients (1.73, 1.31, etc.) are all micro-Pascal-seconds (Pa-s).

  2. Uncertainty.   In rewriting the two viscosity values, I changed from three significant digits to two decimal places, consistent with the assumption that empirical information is reported to the same level of uncertainty unless noted otherwise.

< section id="potential-revision" class="level3">

Potential revision

Add the water data to formatdown and the following functionality to format_power().

  1. A new argument (perhaps fixed_power) that automatically selects a fixed exponent for a numerical vector or permits the user to directly assign a fixed exponent.

     format_power(x, digits, format, omit_power, delim, fixed_power)
  2. In conjunction with the fixed power-of-ten, I would also round all numbers in a column to the same number of decimal places to address the uncertainty assumption. This could be a separate argument.

< section id="units" class="level2">

Units

And now for something completely different!

Thinking about measurement units, I looked for relevant R packages and found units. With appropriate units, powers-of-ten notation can be practically eliminated. For example, a pressure reading of Pa can be reported as GPa.

To illustrate, I start with the basic water data,

< details> < summary>R code
water
   temperature density specific_weight viscosity bulk_modulus
         <num>   <num>           <num>     <num>        <num>
1:      273.15    1000            9809  1.73e-06     2.02e+09
2:      283.15    1000            9807  1.31e-06     2.10e+09
3:      293.15     998            9793  1.02e-06     2.18e+09
4:      303.15     996            9768  8.17e-07     2.25e+09
5:      313.15     992            9734  6.70e-07     2.28e+09

With tools from the units package, I can define a symbol uP to represent micropoise (a non-SI viscosity unit equal to 10 Pa-s). And I can write a short function to convert the numbers from basic units to displayed units, for example, converting Pa to GPa (gigapascal) or Pa-s to P (micropoise).

< details> < summary>R code
library("units")

# Define the uP units
install_unit("uP", "micropoise", "micropoise")

# Function to assign and convert units 
assign_units <- function(x, base_unit, display_unit) {
  
  # convert x to "Units" class in base units
  units(x) <- base_unit
  
  # convert from basic to display units
  units(x) <- as_units(display_unit)
  
  # return value
  x
}

Convert each column and output the results.

< details> < summary>R code
# Apply to one variable at a time
DT <- copy(water)
DT$temperature     <- assign_units(DT$temperature, "K", "degree_C")
DT$density         <- assign_units(DT$density, "kg/m^3", "kg/m^3")
DT$specific_weight <- assign_units(DT$specific_weight, "N/m^3", "kN/m^3")
DT$viscosity       <- assign_units(DT$viscosity, "Pa*s", "uP")
DT$bulk_modulus    <- assign_units(DT$bulk_modulus, "Pa", "GPa")

# Output
DT |> 
  kbl(align = "r") |>
  kable_paper(lightable_options = "basic", full_width = TRUE) |>
  row_spec(0, background = "#c7eae5") |>
  column_spec(1:5, color = "black", background = "white") 
temperature density specific_weight viscosity bulk_modulus
0 [°C] 1000 [kg/m^3] 9.809 [kN/m^3] 17.30 [uP] 2.02 [GPa]
10 [°C] 1000 [kg/m^3] 9.807 [kN/m^3] 13.10 [uP] 2.10 [GPa]
20 [°C] 998 [kg/m^3] 9.793 [kN/m^3] 10.20 [uP] 2.18 [GPa]
30 [°C] 996 [kg/m^3] 9.768 [kN/m^3] 8.17 [uP] 2.25 [GPa]
40 [°C] 992 [kg/m^3] 9.734 [kN/m^3] 6.70 [uP] 2.28 [GPa]

The entries in the data frame are still numeric but are of the “Units” class, enabling math operations among values with compatible units. See the units website for details.

< details> < summary>R code
str(DT)
Classes 'data.table' and 'data.frame':  5 obs. of  5 variables:
 $ temperature    : Units: [°C] num  0 10 20 30 40
 $ density        : Units: [kg/m^3] num  1000 1000 998 996 992
 $ specific_weight: Units: [kN/m^3] num  9.81 9.81 9.79 9.77 9.73
 $ viscosity      : Units: [uP] num  17.3 13.1 10.2 8.17 6.7
 $ bulk_modulus   : Units: [GPa] num  2.02 2.1 2.18 2.25 2.28
 - attr(*, ".internal.selfref")=<externalptr> 

If I were to refine this table further, I would report the numerical values without labels in each cell, moving the unit labels to a sub-header row. Possible future work.

< section id="potential-revision-1" class="level3">

Potential revision

Incorporate tools from the units package to create a new function (perhaps format_units()) that would convert basic units to display units that can substitute for powers-of-ten notation.

< section id="closing" class="level2">

Closing

The new formatdown package formats numbers in powers-of-ten notation for inline math markup. A new argument is already in the works for managing the math delimiters. Potential new features include a fixed power-of-tens option as well as replacing powers-of-ten notation with deliberate manipulation of physical units.

< section id="additional-software-credits" class="level3 appendix">

Additional software credits

  • checkmate for internal function argument checks
  • wrapr for internal function authoring tools
  • units for managing units of physical quantities
To leave a comment for the author, please follow the link and comment on their blog: Layton R blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.