Oh, I’m sure it’s probably nothing

Posted on September 4, 2022 by rstats | Emily Riederer in R bloggers | 0 Comments

[This article was first published on rstats | Emily Riederer, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Language interoperability and different ways of enabling “polyglot” workflows have seemed to take centerstage in the data world recently:

Apache Arrow promises a language-independent memory format for interoperability, – RStudio its rebranding as Posit to cement their place as a leader in language-agnostic data tooling,
RStudio simultaneously announced Quarto as an interoperable alternative to RMarkdown which will treat python, Julia, and JS as first-class citizens
dbt has released its beta of python models to extend is previously SQL-focused paradigm

As a general matter, these are all exciting advances with great potential to aid in different workflows when used judiciously. However, it also poses the question: what cognitive burdens do we alleviate and which do we add when our projects begin to leverage multiple languages?

In this post, I recap a quick case study of how incautious null handling risks data analysis validity. Then, taking a step back, I compare how R, python, and SQL behave differently when confront with null values and the implications for analysts switching between languages.

A summary of these different behaviors is provided below:

	R	python	SQL
Column Aggregation	NA	np: NA pd: Value	Value
Row-wise Transformation	NA	NA	NA
Joining	Match by default	Match	No match
Filtering	No match	Match	No match

Case Study

Before comparing different languages, let’s walk through a brief case study to see all the way that “lurking” nulls can surprise a junior analyst in any one language and observe a few different “contours” of the problem space.

Consider two tables in a retailer’s database. The spend table reports total sales by month and store identifier (null if online).

#>   STORE_ID MONTH AMT_SPEND
#> 1        1     1 100.72247
#> 2        2     1  99.14809
#> 3       NA     1 100.13906
#> 4        1     2  99.06847
#> 5        2     2 100.45664
#> 6       NA     2  99.58250
#> 7        1     3  99.21077
#> 8        2     3 101.70533
#> 9       NA     3        NA

Similarly, the returns table reports returned sales at the same grain.

#>   STORE_ID MONTH AMT_RETURN
#> 1        1     1         NA
#> 2        2     1  10.044820
#> 3       NA     1  10.144207
#> 4        1     2  10.172516
#> 5        2     2  10.219208
#> 6       NA     2  10.095349
#> 7        1     3   9.974947
#> 8        2     3  10.072654
#> 9       NA     3   9.953221

In both cases, nulls are used in the 'AMT_*' fields to denote zeros for the respective month x store_id combinations`.

To calculate something as simple as the average gross spend per store across months, an analyst might attempt to write:

select 
  store_id, 
  avg(amt_spend)
from spend
group by 1
order by 1

store_id	avg(amt_spend)
NA	99.86078
1	99.66723
2	100.43669

However, because SQL silently drops nulls in column aggregations, the online spend is not appropriately “penalized” for its lack of March spend. The averages across all three stores look nearly equal.

Not only is this answer “wrong”, it can also be thought of as fundamentally changing the computand (a word I just made up. In statistics, we talk about estimands as “the conceptual thing we are trying to estimate with an estimator”. Here, we aren’t estimating anything – just computing. But, there’s still a concentual “thing we are trying to measure” and in this case, it’s our tools and not our methods that are imposing assumptions on that) to one that answers a fundamentally different question:

Instead of measuring “average monthly spend in Q1 by store”, we’re measuring “averaging monthly spend in Q1 by store conditional on there being spend”.

To obtain the correct result, one would write:

select 
  store_id, 
  -- wrong answers
  avg(amt_spend) as wrong1,  
  sum(amt_spend) / count(amt_spend) as wrong2,
  -- right answers
  sum(amt_spend) / count(1) as right1,
  avg(coalesce(amt_spend, 0)) as right2
from spend
group by 1
order by 1

store_id	wrong1	wrong2	right1	right2
NA	99.86078	99.86078	66.57385	66.57385
1	99.66723	99.66723	99.66723	99.66723
2	100.43669	100.43669	100.43669	100.43669

With a better understand of gross sales, the analyst might next proceed to compute net sales.

This first requires joining the spend and returns tables. Naively, they might attempt:

select 
  spend.*,
  returns.amt_return
from 
  spend
  inner join
  returns 
  on
  spend.store_id = returns.store_id and
  spend.month = returns.month

STORE_ID	MONTH	AMT_SPEND	amt_return
1	1	100.72247	NA
2	1	99.14809	10.044820
1	2	99.06847	10.172516
2	2	100.45664	10.219207
1	3	99.21077	9.974947
2	3	101.70533	10.072654

However, this once again fails. Why? Although SQL handled nulls “permissively” when aggregating a column, it took a stricted stance when making the comparison on spend.store_id = returns.store_id in the join clause. SQL doesn’t recognize different nulls as equal. To the extent than null means “I dunno” versus “The field is not relevant to this observation”, it’s reasonable that SQL should find it hard to decide whether two “I dunno”s are equal.

Once again, this isn’t a “random” or inconsequential error. Continuing to use this corrupted dataset changes the computand from “net sales by month” to “net sales by month at physical retail locations”.

To remedy this, we can force store_id to take on a value:

select
  spend.*,
  returns.amt_return
from 
  spend
  inner join
  returns 
  on
  coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
  spend.month = returns.month

STORE_ID	MONTH	AMT_SPEND	amt_return
1	1	100.72247	NA
2	1	99.14809	10.044820
NA	1	100.13906	10.144207
1	2	99.06847	10.172516
2	2	100.45664	10.219207
NA	2	99.58250	10.095349
1	3	99.21077	9.974947
2	3	101.70533	10.072654
NA	3	NA	9.953221

And next we proceed with computing sales by month net of returns across all stores:

select
  spend.month, 
  sum(amt_spend - amt_return) as net_spend
from 
  spend
  inner join
  returns 
  on
  coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
  spend.month = returns.month
group by 1
order by 1

month	net_spend
1	179.0981
2	268.6205
3	180.8685

However, by now, you should not be surprised that this result is also incorrect. If we inspect the sequence of computations, we realize that SQL is also stricter in its null handing in rowwise computations than column-wise aggregations. The subtraction of amt_spend and amt_return obliterates the total when either is null. So, we fail to include the gross spend at Store 1 in January simply because there were no returns (and vice versa for Internet sales in March).

select
  spend.month, 
  spend.store_id,
  amt_spend,
  amt_return,
  amt_spend - amt_return as net_spend
from 
  spend
  inner join
  returns 
  on
  coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
  spend.month = returns.month

month	store_id	amt_spend	amt_return	net_spend
1	1	100.72247	NA	NA
1	2	99.14809	10.044820	89.10327
1	NA	100.13906	10.144207	89.99485
2	1	99.06847	10.172516	88.89595
2	2	100.45664	10.219207	90.23743
2	NA	99.58250	10.095349	89.48715
3	1	99.21077	9.974947	89.23582
3	2	101.70533	10.072654	91.63268
3	NA	NA	9.953221	NA

A few ways to get the correct answer are shown below:

select
  spend.month, 
  sum(coalesce(amt_spend,0) - coalesce(amt_return,0)) as right1,
  sum(amt_spend) - sum(amt_return) as right2
from 
  spend
  inner join
  returns 
  on
  coalesce(spend.store_id, 999) = coalesce(returns.store_id, 999) and
  spend.month = returns.month
group by 1
order by 1

month	right1	right2
1	279.8206	279.8206
2	268.6205	268.6205
3	170.9153	170.9153

Observations

The preceding example hopefully illustrates a few points:

Nulls can cause issues in the most basic of analyses
Beyond causing random or marginal errors, null handling changes the questions being answered
Even within a language, null handling may feel inconsistent (w.r.t. strictness) across different operations

So, with that, let’s compare languages!

Comparison

Below, we compare how R, SQL, and python handle column aggregation, rowwise transformation, joining, and filtering.

Aggregation

SQL, as we saw before, simply ignores nulls in aggregation functions.

select 
  sum(x) as sum_x, 
  sum(if(x is null,1,0)) as n_null_x
from tbl

sum_x	n_null_x
3	1

Built by and for statistician’s, R is scandalized at the very idea of attempting to do math with null columns. For aggregation functions, it returns NA as a form of protest should any entry of the vector provided be null. (This can be overridden with the na.rm parameter.)

x <- c(1,2,NA)
sum(x)
#> [1] NA
df <- data.frame(x = x)
dplyr::summarize(df, x = sum(x))

x
NA

When it comes to python, well, it depends. Base and numpy operations act more like R whereas pandas aggregation acts more like python.

import pandas as pd
import numpy as np
x = [1,2,np.nan]
y = [3,4,5]
df = pd.DataFrame({'x':x,'y':y})
sum(x)

#> nan

np.sum(x)

#> nan

df.agg({'x': ['sum']})

#>        x
#> sum  3.0

Transformation

No one allows its

df <- data.frame(x = c(1,2,NA), y = 3:5)
dplyr::mutate(df, z = x-y)

x	y	z
1	3	-2
2	4	-2
NA	5	NA

x	y	z
1	3	-2
2	4	-2
NA	5	NA

select *, x-y as z
from tbl

x	y	z
1	3	-2
2	4	-2
NA	5	NA

np.array(x) - np.array(y)

#> array([-2., -2., nan])

df.assign(z = lambda d: d.x - d.y)

#>      x  y    z
#> 0  1.0  3 -2.0
#> 1  2.0  4 -2.0
#> 2  NaN  5  NaN

Joining

As we saw in the case study, SQL does not match on nulls.

Consider tbl1 and tbl2 as shown below:

select * from tbl1

A	B	X
1	NA	TRUE

select * from tbl2

A	B	Y
1	NA	FALSE

Attempts to join return no results:

select tbl1.*, tbl2.Y 
from 
  tbl1 inner join tbl2 
  on 
  tbl1.A = tbl2.A and 
  tbl1.B = tbl2.B

A	B	X	y

In contrast, default behavior for base R’s merge and dplyr does match on nulls. (Although, either behavior can be altered with the incomparables or na_matches arguments, respectively.)

df1 <- data.frame(A = 1, B = NA, X = TRUE)
df2 <- data.frame(A = 1, B = NA, Y = FALSE)
merge(df1, df2, by = c("A", "B"))
dplyr::inner_join(df1, df2, by = c("A", "B"))

A	B	X	Y
1	NA	TRUE	FALSE

A	B	X	Y
1	NA	TRUE	FALSE

Similarly, pandas also matches on nulls for joining.

import numpy as np
import pandas as pd
df1 = pd.DataFrame([[1, np.nan, True]], columns = ['A','B','X'])
df2 = pd.DataFrame([[1, np.nan, False]], columns = ['A','B','Y'])
pd.merge(df1, df2)

#>    A   B     X      Y
#> 0  1 NaN  True  False

R and python's behavior here seems most surprising. One might expect joining to work the same as raw logical evaluation works. However, neither language “likes” null comparison in its raw form. Instead, the default behavior is intentionally altered in these higher-level joining functions.

NA == NA
#> [1] NA

np.nan == np.nan

#> False

Filtering

SQL doesn’t recognize

Neither R nor dplyr give data back, but do so in different ways

pandas does recognize

Using the same tbl1 shown above, we can also confirm that SQL proactively drops nulls in where clauses where they cannot be readily compared to non-null values. This seems quite consistent with its behavior in the joining case.

select A, B, X 
from tbl1 
where B != 1

a	b	x

Both base R and dplyr paradigms follow suit here.

df1 <- data.frame(A = 1, B = NA, X = TRUE)
df1[df1$B != 1,]

	A	B	X
NA	NA	NA	NA

A|B |X |

|–:|:–|:–|

However, bucking the trend, multiple approaches to subsetting pandas data will not drop nulls in filtering comparisons.

df1 = pd.DataFrame([[1, np.nan, True]], columns = ['A','B','X'])
df1[df1.B != 1]

#>    A   B     X
#> 0  1 NaN  True

df1.query('B != 1')

#>    A   B     X
#> 0  1 NaN  True

Conclusion

In data computation and analysis, the devil is often in the details. It’s not breaking news that low-level reasoning on the careful handling of null values can jeopardize the resulting analyses. However, as analysts take on increasingly complex tasks and using a plehora of different tools, it’s more important than ever for both data producers and consumers to consider the choices they are making in encoding and handling these values across the stack.

To leave a comment for the author, please follow the link and comment on their blog: rstats | Emily Riederer.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Oh, I’m sure it’s probably nothing

Case Study

Observations

Comparison

Aggregation

Transformation

Joining

Filtering

Conclusion

Related

Case Study

Observations

Comparison

Aggregation

Transformation

Joining

Filtering

Conclusion

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)