10 Tips And Tricks For Data Scientists Vol.7
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We have started a series of articles on tips and tricks for data scientists (mainly in Python and R). In case you have missed:
Python
1.Differences Between Numpy Arrays and Python Lists
There are some differences between Numpy Arrays and Python Lists. We will provide some examples of algebraic operators.
‘+’ Operator
import numpy as np alist = [1, 2, 3, 4, 5] # Define a python list. It looks like an np array narray = np.array([1, 2, 3, 4]) # Define a numpy array print(narray + narray) print(alist + alist)
Output:
[2 4 6 8] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Note that the ‘+’ operator on NumPy arrays perform an element-wise addition, while the same operation on Python lists results in a list concatenation. Be careful while coding. Knowing this can save many headaches.
‘*’ Operator
It is the same as with the product operator, *
. In the first case, we scale the vector, while in the second case, we concatenate three times the same list.
print(narray * 3) print(alist * 3)
Output:
[ 3 6 9 12] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
2.The dot product between numpy arrays
The dot product or scalar product or inner product between two vectors \(\vec a\) and \(\vec a\) of the same size is defined as:
\(\vec a \cdot \vec b = \sum_{i=1}^{n} a_i b_i\)
The dot product takes two vectors and returns a single number.
nparray1 = np.array([0, 1, 2, 3]) # Define an array nparray2 = np.array([4, 5, 6, 7]) # Define an array flavor1 = np.dot(nparray1, nparray2) # Recommended way print(flavor1) flavor2 = np.sum(nparray1 * nparray2) # Ok way print(flavor2) flavor3 = nparray1 @ nparray2 # Geeks way print(flavor3) # As you never should do: # Noobs way flavor4 = 0 for a, b in zip(nparray1, nparray2): flavor4 += a * b print(flavor4)
Output:
38 38 38 38
3.Get the mean and sum by rows or columns of numpy arrays
Another general operation performed on matrices is the sum by rows or columns. Just as we did for the function norm, the axis parameter controls the form of the operation:
- axis=0 means to sum the elements of each column together.
- axis=1 means to sum the elements of each row together.
nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix. sumByCols = np.sum(nparray2, axis=0) # Get the sum for each column. Returns 2 elements sumByRows = np.sum(nparray2, axis=1) # get the sum for each row. Returns 3 elements print('Sum by columns: ') print(sumByCols) print('Sum by rows:') print(sumByRows)
Output:
Sum by columns: [ 6 -6] Sum by rows: [0 0 0]
4.How To Get The Index In For Loops In Python
Assume that you have a list and apart from the list element, you want to get the position of the element, i.e. the index. For this case, the enumerate function will do the trick. For example:
mylist = [10,30,100] for index, element in enumerate(mylist): print(index, element)
Output:
0 10 1 30 2 100
5.Number Of Decimal Digits In f-Strings
If we want to return a specific number of decimal digits in f-strings we can use the f'{v:.2f}'
where v is our variable. For example:
import numpy as np my_random = np.random.rand(10) for i, n in enumerate(my_random): print(f'The {i} number is {n}')
Now, if we pass the :.2f
we will get 2 decimal places. For example:
for i, n in enumerate(my_random): print(f'The {i} number is {n:.2f}')
6.How To Generate Sequences In Python
We can generate sequences in Python using the range
function but it supports integers only. Of course, we can generate a series of integers and then to divide them elements by a number, like 10, but let’s discuss some more straightforward ways using numpy.
arange
We can generate a series using the numpy arange
as follows:
import numpy as np # start point = 0 , end point (not incl) is 1.05 and step is 0.05 np.arange(0, 1.05, 0.05)
Output:
array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])
linspace
With numpy linspace
we can specify the starting point, the ending point and the number of points to be returned and then it estimates the steps.
np.linspace(0,1, 21)
Output:
array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])
7.How To Remove New Line Feeds `\n` From Strings In Python
The new line feeds can be of the form \n
or \r
or both. We can remove them from a string using the strip()
method.
'this is a string\n'.strip()
Output:
'this is a string'
Note that there is also the lstrip()
and rstrip()
for left and right stripping of the whitespaces in text. We can also be more specific by typing:
'this is a string\n'.strip("\n")
R
8.How To Rename And Relevel Factors
A “special” data structure in R is the “factors”. We are going to provide some examples of how we can rename and relevel the factors. For the next examples, we will work with the following data:
df<-data.frame(ID=c(1:10), Gender=factor(c("M","M","M","","F","F","M","","F","F" )), AgeGroup=factor(c("[60+]", "[26-35]", "[NA]", "[36-45]", "[46-60]", "[26-35]", "[NA]", "[18-25]", "[26-35]", "[26-35]")))
Output:
> df ID Gender AgeGroup 1 1 M [60+] 2 2 M [26-35] 3 3 M [NA] 4 4 [36-45] 5 5 F [46-60] 6 6 F [26-35] 7 7 M [NA] 8 8 [18-25] 9 9 F [26-35] 10 10 F [26-35]
Rename Factors
Let’s say that I want to convert the empty string of Gender to “U” from the Unknown.
levels(df$Gender)[levels(df$Gender)==""] ="U"
Let’s say that we want to merge the age groups. For instance the new categories will be “[18-35]”, “[35+], “[NA]”
levels(df$AgeGroup)[levels(df$AgeGroup)=="[18-25]"] = "[18-35]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[26-35]"] = "[18-35]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[36-45]"] = "[35+]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[46-60]"] = "[35+]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[60+]"] = "[35+]"
Notice that we could have done it in once, but it is very risky because sometimes we can have different order than what we expected.
levels(df$AgeGroup)<-c("[18-35]","[18-35]","[35+]","[35+]","[35+]", "[NA]")
By applying the changed we mentioned before, we get the following data.
> df ID Gender AgeGroup 1 1 M [35+] 2 2 M [18-35] 3 3 M [NA] 4 4 U [35+] 5 5 F [35+] 6 6 F [18-35] 7 7 M [NA] 8 8 U [18-35] 9 9 F [18-35] 10 10 F [18-35]
Relevel Factors
Let’s say that we want the “[NA]” age group to appear first
df$AgeGroup<-factor(df$AgeGroup, c("[NA]", "[18-35]" ,"[35+]"))
Another way to change the order is to use relevel()
to make a particular level first in the list. (This will not work for ordered factors.). Let’s day that we want the ‘F’ Gender first.
df$Gender<-relevel(df$Gender, "F")
By applying these changes, we can see how the factors have changed level.
> str(df) 'data.frame': 10 obs. of 3 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 $ Gender : Factor w/ 3 levels "F","U","M": 3 3 3 2 1 1 3 2 1 1 $ AgeGroup: Factor w/ 3 levels "[NA]","[18-35]",..: 3 2 1 3 3 2 1 2 2 2
9.How To Impute Missing Values In R
In the real data world, it is quite common to deal with Missing Values (known as NAs). Sometimes, there is a need to impute the missing values where the most common approaches are:
- Numerical Data: Impute Missing Values with mean or median
- Categorical Data: Impute Missing Values with mode
Let’s give an example of how we can impute dynamically depending on the data type.
library(tidyverse) df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), ColumnB=factor(c("A","B","A","A","","B","A","B","","A")), ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")), ColumnD=c(NA,20,18,22,18,17,19,NA,17,23) ) df
We get:
# A tibble: 10 x 5 id ColumnA ColumnB ColumnC ColumnD <int> <dbl> <fct> <fct> <dbl> 1 1 10 "A" "" NA 2 2 9 "B" "BB" 20 3 3 8 "A" "CC" 18 4 4 7 "A" "BB" 22 5 5 NA "" "BB" 18 6 6 NA "B" "CC" 17 7 7 20 "A" "AA" 19 8 8 15 "B" "BB" NA 9 9 12 "" "" 17 10 10 NA "A" "AA" 23
For the Categorical Variables, we are going to apply the “mode” function which we have to build it since it is not provided by R.
getmode <- function(v){ v=v[nchar(as.character(v))>0] uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] }
Now that we have the “mode” function we are ready to impute the missing values of a dataframe depending on the data type of the columns. Thus, if the column data type is “numeric” we will impute it with the “mean” otherwise with the “mode“. Notice that in our script we take into account the column names and “dplyr” package requires a special notation (!!cols : = !!rlang::sym(colname)) of selecting dynamically the column names.
for (cols in colnames(df)) { if (cols %in% names(df[,sapply(df, is.numeric)])) { df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE))) } else { df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols)))) } } df
Voilà! The missing values have been imputed!
> df # A tibble: 10 x 5 id ColumnA ColumnB ColumnC ColumnD <dbl> <dbl> <fct> <fct> <dbl> 1 1 10 A BB 19.2 2 2 9 B BB 20 3 3 8 A CC 18 4 4 7 A BB 22 5 5 11.6 A BB 18 6 6 11.6 B CC 17 7 7 20 A AA 19 8 8 15 B BB 19.2 9 9 12 A BB 17 10 10 11.6 A AA 23
10.How To Assign Values Based On Multiple Conditions Of Different Columns
In the previous post, we showed how we can assign values in Pandas Data Frames based on multiple conditions of different columns.
Again we will work with the famous titanic
dataset and our scenario is the following:
- If the
Age
isNA
andPclass
=1 then the Age=40 - If the
Age
isNA
andPclass
=2 then the Age=30 - If the
Age
isNA
andPclass
=3 then the Age=25 - Else the
Age
will remain as is
Load the Data
library(dplyr) url = 'https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv' df = read.csv(url, sep="\t")
Use of case_when function of dplyr
For this task, we will use the case_when
function of dplyr
as follows:
df<-df%>%mutate(New_Column = case_when( is.na(Age) & Pclass==1 ~ 40, is.na(Age) & Pclass==2 ~ 30, is.na(Age) & Pclass==3 ~ 25, TRUE~Age ))
Let’s have a look at the Age, Pclass and the New_Column that we created.
df%>%select(Age, Pclass, New_Column)
As we can see we get the expected results.
Age Pclass New_Column 1 22.00 3 22.00 2 38.00 1 38.00 3 26.00 3 26.00 4 35.00 1 35.00 5 35.00 3 35.00 6 NA 3 25.00
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.