10 Tips And Tricks For Data Scientists Vol.7

[This article was first published on R – Predictive Hacks, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We have started a series of articles on tips and tricks for data scientists (mainly in Python and R). In case you have missed:

Python

1.Differences Between Numpy Arrays and Python Lists

There are some differences between Numpy Arrays and Python Lists. We will provide some examples of algebraic operators.

‘+’ Operator

import numpy as np
 
alist = [1, 2, 3, 4, 5]   # Define a python list. It looks like an np array
narray = np.array([1, 2, 3, 4]) # Define a numpy array
 
print(narray + narray)
print(alist + alist)

Output:

[2 4 6 8]
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

Note that the ‘+’ operator on NumPy arrays perform an element-wise addition, while the same operation on Python lists results in a list concatenation. Be careful while coding. Knowing this can save many headaches.

‘*’ Operator

It is the same as with the product operator, *. In the first case, we scale the vector, while in the second case, we concatenate three times the same list.

print(narray * 3)
print(alist * 3)

Output:

[ 3  6  9 12]
[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

2.The dot product between numpy arrays

The dot product or scalar product or inner product between two vectors \(\vec a\) and \(\vec a\) of the same size is defined as:
\(\vec a \cdot \vec b = \sum_{i=1}^{n} a_i b_i\)

The dot product takes two vectors and returns a single number.

nparray1 = np.array([0, 1, 2, 3]) # Define an array
nparray2 = np.array([4, 5, 6, 7]) # Define an array

flavor1 = np.dot(nparray1, nparray2) # Recommended way
print(flavor1)

flavor2 = np.sum(nparray1 * nparray2) # Ok way
print(flavor2)

flavor3 = nparray1 @ nparray2         # Geeks way
print(flavor3)

# As you never should do:             # Noobs way
flavor4 = 0
for a, b in zip(nparray1, nparray2):
    flavor4 += a * b
    
print(flavor4)

Output:

38
38
38
38

3.Get the mean and sum by rows or columns of numpy arrays

Another general operation performed on matrices is the sum by rows or columns. Just as we did for the function norm, the axis parameter controls the form of the operation:

  • axis=0 means to sum the elements of each column together.
  • axis=1 means to sum the elements of each row together.
nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix. 
 
sumByCols = np.sum(nparray2, axis=0) # Get the sum for each column. Returns 2 elements
sumByRows = np.sum(nparray2, axis=1) # get the sum for each row. Returns 3 elements
 
print('Sum by columns: ')
print(sumByCols)
print('Sum by rows:')
print(sumByRows)

Output:

Sum by columns: 
[ 6 -6]
Sum by rows:
[0 0 0]

4.How To Get The Index In For Loops In Python

Assume that you have a list and apart from the list element, you want to get the position of the element, i.e. the index. For this case, the enumerate function will do the trick. For example:

mylist = [10,30,100]
 
for index, element in enumerate(mylist):
    print(index, element)

Output:

0 10
1 30
2 100

5.Number Of Decimal Digits In f-Strings

If we want to return a specific number of decimal digits in f-strings we can use the f'{v:.2f}' where v is our variable. For example:

import numpy as np
my_random = np.random.rand(10)
 
for i, n in enumerate(my_random):
    print(f'The {i} number is {n}') 
Number of decimal digits in f-strings 1

Now, if we pass the :.2f we will get 2 decimal places. For example:

for i, n in enumerate(my_random):
    print(f'The {i} number is {n:.2f}') 
Number of decimal digits in f-strings 2

6.How To Generate Sequences In Python

We can generate sequences in Python using the range function but it supports integers only. Of course, we can generate a series of integers and then to divide them elements by a number, like 10, but let’s discuss some more straightforward ways using numpy.

arange

We can generate a series using the numpy arange as follows:

import numpy as np
 
# start point = 0 , end point (not incl) is 1.05 and step is 0.05
np.arange(0, 1.05, 0.05)

Output:

array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])

linspace

With numpy linspace we can specify the starting point, the ending point and the number of points to be returned and then it estimates the steps.

np.linspace(0,1, 21)

Output:

array([0.  , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 ,
       0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1.  ])

7.How To Remove New Line Feeds `\n` From Strings In Python

The new line feeds can be of the form \n or \r or both. We can remove them from a string using the strip() method.

'this is a string\n'.strip()

Output:

'this is a string'

Note that there is also the lstrip() and rstrip() for left and right stripping of the whitespaces in text. We can also be more specific by typing:

'this is a string\n'.strip("\n")

R

8.How To Rename And Relevel Factors

A “special” data structure in R is the “factors”. We are going to provide some examples of how we can rename and relevel the factors. For the next examples, we will work with the following data:

df<-data.frame(ID=c(1:10), Gender=factor(c("M","M","M","","F","F","M","","F","F" )), 
           AgeGroup=factor(c("[60+]", "[26-35]", "[NA]", "[36-45]", "[46-60]", "[26-35]", "[NA]", "[18-25]", "[26-35]", "[26-35]")))

Output:

> df
   ID Gender AgeGroup
1   1      M    [60+]
2   2      M  [26-35]
3   3      M     [NA]
4   4         [36-45]
5   5      F  [46-60]
6   6      F  [26-35]
7   7      M     [NA]
8   8         [18-25]
9   9      F  [26-35]
10 10      F  [26-35]

Rename Factors

Let’s say that I want to convert the empty string of Gender to “U” from the Unknown.

levels(df$Gender)[levels(df$Gender)==""] ="U"

Let’s say that we want to merge the age groups. For instance the new categories will be “[18-35]”, “[35+], “[NA]”

levels(df$AgeGroup)[levels(df$AgeGroup)=="[18-25]"] = "[18-35]"
levels(df$AgeGroup)[levels(df$AgeGroup)=="[26-35]"] = "[18-35]"
 
levels(df$AgeGroup)[levels(df$AgeGroup)=="[36-45]"] = "[35+]"
levels(df$AgeGroup)[levels(df$AgeGroup)=="[46-60]"] = "[35+]"
levels(df$AgeGroup)[levels(df$AgeGroup)=="[60+]"] = "[35+]"

Notice that we could have done it in once, but it is very risky because sometimes we can have different order than what we expected.

levels(df$AgeGroup)<-c("[18-35]","[18-35]","[35+]","[35+]","[35+]", "[NA]")
 

By applying the changed we mentioned before, we get the following data.

> df
   ID Gender AgeGroup
1   1      M    [35+]
2   2      M  [18-35]
3   3      M     [NA]
4   4      U    [35+]
5   5      F    [35+]
6   6      F  [18-35]
7   7      M     [NA]
8   8      U  [18-35]
9   9      F  [18-35]
10 10      F  [18-35]

Relevel Factors

Let’s say that we want the “[NA]” age group to appear first

df$AgeGroup<-factor(df$AgeGroup, c("[NA]", "[18-35]" ,"[35+]"))

Another way to change the order is to use relevel() to make a particular level first in the list. (This will not work for ordered factors.). Let’s day that we want the ‘F’ Gender first.

df$Gender<-relevel(df$Gender, "F")

By applying these changes, we can see how the factors have changed level.

> str(df)
'data.frame':	10 obs. of  3 variables:
 $ ID      : int  1 2 3 4 5 6 7 8 9 10
 $ Gender  : Factor w/ 3 levels "F","U","M": 3 3 3 2 1 1 3 2 1 1
 $ AgeGroup: Factor w/ 3 levels "[NA]","[18-35]",..: 3 2 1 3 3 2 1 2 2 2

9.How To Impute Missing Values In R

In the real data world, it is quite common to deal with Missing Values (known as NAs). Sometimes, there is a need to impute the missing values where the most common approaches are:

  • Numerical Data: Impute Missing Values with mean or median
  • Categorical Data: Impute Missing Values with mode

Let’s give an example of how we can impute dynamically depending on the data type.

library(tidyverse)
 
df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), 
           ColumnB=factor(c("A","B","A","A","","B","A","B","","A")),
           ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")),
           ColumnD=c(NA,20,18,22,18,17,19,NA,17,23)
           )
 
df

We get:

# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
               
 1     1      10 "A"     ""           NA
 2     2       9 "B"     "BB"         20
 3     3       8 "A"     "CC"         18
 4     4       7 "A"     "BB"         22
 5     5      NA ""      "BB"         18
 6     6      NA "B"     "CC"         17
 7     7      20 "A"     "AA"         19
 8     8      15 "B"     "BB"         NA
 9     9      12 ""      ""           17
10    10      NA "A"     "AA"         23

For the Categorical Variables, we are going to apply the “mode” function which we have to build it since it is not provided by R.

getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

Now that we have the “mode” function we are ready to impute the missing values of a dataframe depending on the data type of the columns. Thus, if the column data type is “numeric” we will impute it with the “mean” otherwise with the “mode“. Notice that in our script we take into account the column names and “dplyr” package requires a special notation (!!cols : = !!rlang::sym(colname)) of selecting dynamically the column names.

for (cols in colnames(df)) {
  if (cols %in% names(df[,sapply(df, is.numeric)])) {
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE)))
     
  }
  else {
     
    df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols))))
     
  }
}
 
df

Voilà! The missing values have been imputed!

> df
# A tibble: 10 x 5
      id ColumnA ColumnB ColumnC ColumnD
               
 1     1    10   A       BB         19.2
 2     2     9   B       BB         20  
 3     3     8   A       CC         18  
 4     4     7   A       BB         22  
 5     5    11.6 A       BB         18  
 6     6    11.6 B       CC         17  
 7     7    20   A       AA         19  
 8     8    15   B       BB         19.2
 9     9    12   A       BB         17  
10    10    11.6 A       AA         23 

10.How To Assign Values Based On Multiple Conditions Of Different Columns

In the previous post, we showed how we can assign values in Pandas Data Frames based on multiple conditions of different columns.

Again we will work with the famous titanic dataset and our scenario is the following:

  • If the Age is NAand Pclass=1 then the Age=40
  • If the Age is NAand Pclass=2 then the Age=30
  • If the Age is NAand Pclass=3 then the Age=25
  • Else the Age will remain as is

Load the Data

library(dplyr)

url = 'https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv'

df = read.csv(url, sep="\t")

Use of case_when function of dplyr

For this task, we will use the case_when function of dplyr as follows:

df<-df%>%mutate(New_Column = case_when(
  is.na(Age) & Pclass==1 ~ 40,
  is.na(Age) & Pclass==2 ~ 30,
  is.na(Age) & Pclass==3 ~ 25,
  TRUE~Age
  ))

Let’s have a look at the Age, Pclass and the New_Column that we created.

df%>%select(Age, Pclass, New_Column)

As we can see we get the expected results.

   Age Pclass New_Column
1   22.00      3      22.00
2   38.00      1      38.00
3   26.00      3      26.00
4   35.00      1      35.00
5   35.00      3      35.00
6      NA      3      25.00

To leave a comment for the author, please follow the link and comment on their blog: R – Predictive Hacks.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)