Overview: Sorting Version Codes
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Problem statement
Different versions of software are traditionally described using version codes. These codes contain usually numerical values and subsequent versions increment these codes, allowing them to be sorted. The standard is to use “semantic versioning”, where codes are in the form of “major.minor.patch”; the bigger the update, the more important number is incremented.
However, not all software obey semantic versioning standard.
Suppose we have a vector of version codes:
codes <- c("1.2", "1.10", "1.1")
We’d expect them to be ordered as 1.1, 1.2, 1.10. If you try to order version codes using base sort() functionality…
sort(codes) ## [1] "1.1" "1.10" "1.2"
…oh. Where did something go wrong?
This is what’s called “alphabetical sorting”. It checks characters one by one, including digits. “1” is before “2”, so “10” goes before “2”, no matter the value. Thus, it cannot compare numbers, if they have a different number of digits.
Numerous software uses the following trick to have it sort correctly: it pads the numbers with leading 0’s until a certain length. For example, your old camera could name photos like 00034.jpg, 00164.jpg, etc. But this is not the approach to use here. We need a different one.
Sorting version codes requires using something called “natural sort”, where numbers are detected and ordered according to their value (with letters still sorted alphabetically, thus placing e.g. “a99” before “b51”). This is a more intuitive sorting for humans than pure alphabetical sort.
Solutions
Several packages allow natural sorting of character vectors. There’s even a base R solution! However, not all natural sorting algorithms are adapted to handle version codes, which (usually) have a particular structure with dots and/or dashes. Let’s have a look at the options.
Using base R
There’s a facility in base R allowing the user to handle (numeric) version codes. While there is no simple “sort these strings like they were version codes” function, you can wrap your vector in numeric_version() and use associated methods, like comparison operators (<, ==, >=…), as well as sort() method for this class. This is what it looks like:
sort(numeric_version(codes)) ## [1] '1.1' '1.2' '1.10'
There are two main problems with this approach, however. But first, let’s quote the documentation:
Numeric versions are sequences of one or more non-negative integers, usually (…) represented as character strings with the elements of the sequence concatenated and separated by single . or - characters.
This means that version codes cannot have any string components or non-standard separators. As described in the semantic versioning standard, letters are allowed in pre-release identifiers and build metadata, so the base solution cannot even handle all valid cases of semantic versioning, much less the variety of version codes that don’t fit these rules.
Using gtools
gtools is a collection of functions to help with simple tasks when writing R packages. It has a mixedsort() function that detects embedded numbers in strings. However…
gtools::mixedsort(codes) ## [1] "1.10" "1.1" "1.2"
…it is not suited to sorting version codes. The use case here is different, it’s detecting numbers like these in “Aspirin 50mg” and “Aspirin 100mg” (examples taken from the documentation). Dots in version codes are treated as decimal separators, that’s why "1.10" and "1.1" are considered equal.
Using naturalsort
naturalsort isn’t written specifically for sorting version codes either, but…
naturalsort::naturalsort(codes) ## [1] "1.1" "1.2" "1.10"
…it gets the job done. The difference lies in not treating dots as decimal separators (in fact, naturalsort doesn’t recognize decimals at all). Instead, the code detects numbers and non-numbers, then splits the strings into continuous pieces of numbers and non-numbers. If we were to For example, "1.10-a" would be split like that:
split_like_naturalsort("1.10-a")
## [1] "1"  "."  "10" "-a"
These pieces are then used to sort version codes either alphabetically or by value, depending on whether it’s a number or not. Accidentally, this works well for almost all version codes. The only case where it’s working a little weird is when there’s an inconsistency in separators:
weird_codes <- c("1.3-6", "1.3-2", "1.3.4")
naturalsort::naturalsort(weird_codes)
## [1] "1.3-2" "1.3-6" "1.3.4"
I like that naturalsort has decreasing and na.last parameters, implemented just like in base sort(). There’s also a naturalfactor() function that could be useful for long vectors of repeated version codes.
naturalsort isn’t developed since 2016, but this shouldn’t be considered a problem, since the package is in an already stable state. There’s one pet peeve of mine about this package and it’s the implicit coercion of parameters. Let’s say we accidentally made this call:
naturalsort::naturalsort(codes,
                         decreasing = c("true", "dat"))
## [1] "1.10" "1.2"  "1.1"
Turns out c("true", "dat") was coerced to logical, then the first element was taken. It should raise an error saying that the arguments are wrong, but – instead – it tries to work at all costs, over-interpreting the parameters. A similar thing would happen with a vector of logical values.
Using stringr/stringi
You’ve probably already installed stringr and/or stringi if you’ve been using R for a while. Both these packages can do way, way more to strings than just some natural sorting, but we’ll focus on just that. There’s a parameter called numeric in stri_sort(), where we can pass TRUE to sort digits numerically:
# There's analogous stringr::str_sort() function stringi::stri_sort(codes, numeric = TRUE) ## [1] "1.1" "1.2" "1.10"
The overall implementation seems to base on the same idea of separating numbers and non-numbers, and it has the same behavior on inconsistent separators:
stringi::stri_sort(weird_codes, numeric = TRUE) ## [1] "1.3-2" "1.3-6" "1.3.4"
There are decreasing and na_last parameters just like in naturalsort() and base sort(); and again, there’s silent argument coercion I despise. At least I can give half a point for raising a warning when only the first element is used.
Using versionsort
versionsort is our solution to this problem. I’ve implemented, documented, tested, and submitted it to CRAN within 24 hours when working on implementing a feature in deepdep; obviously, it has changed a bit since then.
There’s a ver_sort() function that’s of the main interest here:
versionsort::ver_sort(codes) ## [1] "1.1" "1.2" "1.10"
It works a little differently than other sorts I’ve mentioned before, as it splits the string on separators first (which are sequences of anything that is not a number or a letter), only then separating numbers and non-numbers (i.e. letters). All separators are equal, which gives a little more predictable behavior with inconsistent separators:
versionsort::ver_sort(weird_codes) ## [1] "1.3-2" "1.3.4" "1.3-6"
versionsort is still a Work-In-Progress, which means that it lacks decreasing and na_last parameters, but this whole analysis serves partially to define requirements for future development (more in soon-to-be-published versionsort Dev Plan).
There’s one convenient utility versionsort has, though – the pair of functions named ver_latest() and ver_oldest(). They return max() and min() for version codes, respectively, without the need for sorting the whole vector:
versionsort::ver_latest(codes) ## [1] "1.10" versionsort::ver_oldest(codes) ## [1] "1.1"
Stay tuned for the new, improved versionsort!
Semantic version codes
All these solutions have one thing in common – they don’t detect semantic version codes. There are, however, R packages with this functionality; I’ve found two, described below.
Using semver
semver is the most used R package for handling semantic version codes. It relies heavily on C++ semver implementation under the hood, but it results in a wide range of functionalities: there are many methods for parsed semantic codes, sort() included:
semver_codes <- semver::parse_version(
  c("1.6.0-dev", "1.5.10", "1.5.1", "1.6.0")
)
sort(semver_codes)
## [1] Maj: 1 Min: 5 Pat: 1
## 
## [2] Maj: 1 Min: 5 Pat: 10
## 
## [3] Maj: 1 Min: 6 Pat: 0 Pre: dev
## 
## [4] Maj: 1 Min: 6 Pat: 0
Operators as well:
semver_codes[1:3] < semver_codes[2:4] ## [1] FALSE FALSE TRUE
We can use max() and min() too:
max(semver_codes) ## Maj: 1 Min: 6 Pat: 0
This package has the most functionality of them all…
semver::parse_version("1.0-10")
## Error in parse_ptr(version): invalid character encountered: -
…but is nonetheless limited to semantic version codes.
Using semverutils
Sorting numeric codes is not possible in semverutils. You can only check if a version code is higher than others by calling:
v_code <- semverutils::semVer$new(c("1.6.0-dev"))
v_code$higherThanAll(c("1.5.10", "1.6.0", "1.5.1"))
## [1] TRUE
Unfortunately, the answer isn’t even correct, since "1.6.0-dev" precedes "1.6.0" due to being a pre-release version.
Conclusion
There are a few ways to sort version codes. If your version codes are simple and well-behaved, base::numeric_version() should be enough. If you need to handle semantic version codes, use semver. Otherwise, there is no best solution, but versionsort is the closest you can get, especially after the next big update is published.
Missed a package? Described one incorrectly? Contact us on Twitter or make an issue on Github.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
