.I in data.table

[This article was first published on Data By John, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In this post I’m using a small extract from the SIMD2020 dataset to figure out what the special operator .I does.

Files and code are on github if you’re interested

# files and code : https://github.com/johnmackintosh/DT_dot_I
DT <- fread("highdata.csv")
lookup <- fread("https://raw.githubusercontent.com/johnmackintosh/ph_lookups/main/dz_intzone_cp_dz_names.csv")

To join these together, I could set a key on Data_Zone and DataZone respectively. However, that would reorder the data, so, for now, I will do the join “by hand”:

DT <- lookup[DT, on = .(DataZone = Data_Zone)]

I’ll maintain the existing order by keying on the overall SIMD rank

setkey(DT, SIMD2020v2_Rank)

I also update some column names

         old = c("SIMD2020v2_Income_Domain_Rank",

         new = c("Income", "Employment", 
                 "Health",   "Education",
                 "Access",  "Crime", 
                 "Housing", "areaname"))

Which row has the highest income ranking?

DT[,.I[Income == max(Income)]]
# [1] 437

The same for the Health ranking, but returning a data.table or vector respectively

# returns data.table
DT[,.I[Health == max(Health)],.SD] 

#      V1
#   <int>
# 1:   424

# returns single element vector
DT[,.I[Health == max(Health)],.SD]$V1
# [1] 424

Return the maximum health ranking by area

DT[,.I[Health == max(Health)], areaname]

                         areaname    V1
                           <char> <int>
 1:                     Inverness   417
 2:                Bute and Cowal   414
 3:                     Caithness   358
 4:                     East Ross   364
 5:        Helensburgh and Lomond   424
 6:          Nairn and Nairnshire   388
 7:      Oban, Lorn and the Isles   363
 8: Mid-Argyll, Kintyre and Islay   295
 9:                      Lochaber   252
10:                      Mid Ross   389
11:                    Sutherland   220
12:  Skye, Lochalsh and West Ross   271
13:       Badenoch and Strathspey   333

Remember - these are row indices, not the actual values

Extract multiple indices

   .(min_health = .I[Health == min(Health)], 
     max_health = .I[Health == max(Health)]),

                        areaname min_health max_health
                           <char>      <int>      <int>
 1:                     Inverness          1        417
 2:                Bute and Cowal          3        414
 3:                     Caithness          4        358
 4:                     East Ross          5        364
 5:        Helensburgh and Lomond          7        424
 6:          Nairn and Nairnshire         12        388
 7:      Oban, Lorn and the Isles         14        363
 8: Mid-Argyll, Kintyre and Islay         21        295
 9:                      Lochaber         36        252
10:                      Mid Ross         37        389
11:                    Sutherland         82        220
12:  Skye, Lochalsh and West Ross         93        271
13:       Badenoch and Strathspey        276        333

What happens if we do something in i, then .I? Specifically, filter for the Inverness area, then find the min and max Health rankings

DT[ areaname == "Inverness",  .(min_health = .I[Health == min(Health)], # still 1
                                max_health = .I[Health == max(Health)])] # now 94

We return a new index for the maximum value which is now row 94, having previously been row 417

If I set the key on DataZone, then the indices change

setkey(DT, DataZone)

DT[,.I[Income == max(Income)]] # now 108, was 437

DT[,.I[Health == max(Health)],.SD]  # now 95, was 424

We can use .I to answer the question here relating to selecting rows based on conditions

We can do it all in data.table 🙂

# recreate the example data

id <- c(rep(102,9),rep(103,5),rep(104,4))
status <- rep(c('single','relationship','relationship','single','single','single'),3)
status <- factor(status, levels = c("single" ,"relationship"), ordered = TRUE)
# setting as factor for display reasons
age <- c(17:19,22,23,26,28,32,33,21:25,21:24)
DT <- data.table(id, status, age)

# this is the desired output
newdata <- DT[c(1,2,7,8,13,14,18),]

Create a new data.table by finding the indices based on the conditions (single and highest age, relationship and lowest age) and use rleid to only pick out where the status changes, for each id.

I also create a rownumber variable.

DT2 <- DT[DT[,.I[status == "single" & age == max(age) |  
                   status == "relationship" & age == min(age)], 
          ][,rn := rleid(status),id

You will see what appears to be a random $V1 at the end of this line:

DT[DT[,.I[status == "single" & age == max(age) |  
                   status == "relationship" & age == min(age)], 

If I take that out, I get the following error:

Error: When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.

This confirms that what we’re doing here is a self join of DT to itself based on the row indices returned by the conditions we specify.

When the $V1 is returned, we get the following results:

     id       status   age
   <num>        <ord> <num>
1:   102       single    17
2:   102 relationship    18
3:   102       single    28
4:   102 relationship    32
5:   103       single    24
6:   103 relationship    25
7:   104 relationship    21
8:   104       single    24

This gives us far fewer rows than our original dataset, but still more than our desired results.

The row indexed by DT2[,.I[(rn == 1 & status == "relationship")]], returns the row index 7. This is the one we need to remove, because we don’t want to include results where the first status is “relationship”.

This is an anti join, joining DT2 to itself, minus the row indexed above.
I also update the rownumber for each id, so that each status increments appropriately.

DT2 <- DT2[!DT2[,.I[(rn == 1 & status == "relationship")]]
           ][,rn := seq(.N), .(id, status)][]

This returns the following:

     id       status   age    rn
   <num>        <ord> <num> <int>
1:   102       single    17     1
2:   102 relationship    18     1
3:   102       single    28     2
4:   102 relationship    32     2
5:   103       single    24     1
6:   103 relationship    25     1
7:   104       single    24     1

Which matches the desired data perfectly

all.equal(DT2[,1:3], newdata)
# [1] TRUE

Finally, use dcast to pivot_wider, as per OP’s request

      id ~   rn + status, 
      value.var = "age")

Key: <id>
      id 1_single 1_relationship 2_single 2_relationship
   <num>    <num>          <num>    <num>          <num>
1:   102       17             18       28             32
2:   103       24             25       NA             NA
3:   104       24             NA       NA             NA

Key takeaway: .I returns indices, and can be used in circumstances where you might otherwise use which to identify rows meeting certain conditions. You can use it to join a data.table to itself, or anti_join to itself, to return your desired results.

To leave a comment for the author, please follow the link and comment on their blog: Data By John.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)