# 10 Tips And Tricks For Data Scientists Vol.7

**R – Predictive Hacks**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We have started a series of articles on tips and tricks for data scientists (mainly in **Python **and **R**). In case you have missed:

## Python

**1.Differences Between Numpy Arrays and Python Lists**

There are some differences between Numpy Arrays and Python Lists. We will provide some examples of algebraic operators.

**‘+’ Operator**

import numpy as np alist = [1, 2, 3, 4, 5] # Define a python list. It looks like an np array narray = np.array([1, 2, 3, 4]) # Define a numpy array print(narray + narray) print(alist + alist)

**Output:**

[2 4 6 8] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

Note that the **‘+’ **operator on NumPy arrays perform an element-wise addition, while the same operation on Python lists results in a list concatenation. Be careful while coding. Knowing this can save many headaches.

**‘*’ Operator**

It is the same as with the product operator, `*`

. In the first case, we scale the vector, while in the second case, we concatenate three times the same list.

print(narray * 3) print(alist * 3)

**Output:**

[ 3 6 9 12] [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

** 2.The dot product between numpy arrays**

The dot product or scalar product or inner product between two vectors \(\vec a\) and \(\vec a\) of the same size is defined as:

\(\vec a \cdot \vec b = \sum_{i=1}^{n} a_i b_i\)

The dot product takes two vectors and returns a single number.

nparray1 = np.array([0, 1, 2, 3]) # Define an array nparray2 = np.array([4, 5, 6, 7]) # Define an array flavor1 = np.dot(nparray1, nparray2) # Recommended way print(flavor1) flavor2 = np.sum(nparray1 * nparray2) # Ok way print(flavor2) flavor3 = nparray1 @ nparray2 # Geeks way print(flavor3) # As you never should do: # Noobs way flavor4 = 0 for a, b in zip(nparray1, nparray2): flavor4 += a * b print(flavor4)

**Output:**

38 38 38 38

**3.Get the mean and sum by rows or columns of numpy arrays**

Another general operation performed on matrices is the sum by rows or columns. Just as we did for the function norm, the **axis** parameter controls the form of the operation:

**axis=0**means to sum the elements of each column together.**axis=1**means to sum the elements of each row together.

nparray2 = np.array([[1, -1], [2, -2], [3, -3]]) # Define a 3 x 2 matrix. sumByCols = np.sum(nparray2, axis=0) # Get the sum for each column. Returns 2 elements sumByRows = np.sum(nparray2, axis=1) # get the sum for each row. Returns 3 elements print('Sum by columns: ') print(sumByCols) print('Sum by rows:') print(sumByRows)

**Output:**

Sum by columns: [ 6 -6] Sum by rows: [0 0 0]

**4.How To Get The Index In For Loops In Python**

Assume that you have a list and apart from the list element, you want to get the position of the element, i.e. the index. For this case, the **enumerate **function will do the trick. For example:

mylist = [10,30,100] for index, element in enumerate(mylist): print(index, element)

**Output:**

0 10 1 30 2 100

**5.Number Of Decimal Digits In f-Strings**

If we want to return a specific number of decimal digits in f-strings we can use the `f'{v:.2f}'`

where **v** is our variable. For example:

import numpy as np my_random = np.random.rand(10) for i, n in enumerate(my_random): print(f'The {i} number is {n}')

Now, if we pass the `:.2f`

we will get 2 decimal places. For example:

for i, n in enumerate(my_random): print(f'The {i} number is {n:.2f}')

**6.How To Generate Sequences In Python**

We can generate sequences in Python using the `range`

function but it supports integers only. Of course, we can generate a series of integers and then to divide them elements by a number, like 10, but let’s discuss some more straightforward ways using numpy.

**arange**

We can generate a series using the `numpy arange`

as follows:

import numpy as np # start point = 0 , end point (not incl) is 1.05 and step is 0.05 np.arange(0, 1.05, 0.05)

**Output:**

array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])

**linspace**

With `numpy linspace`

we can specify the starting point, the ending point and the number of points to be returned and then it estimates the steps.

np.linspace(0,1, 21)

**Output:**

array([0. , 0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55, 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95, 1. ])

**7.How To Remove New Line Feeds `\n` From Strings In Python**

The new line feeds can be of the form `\n`

or \`r`

or both. We can remove them from a string using the `strip()`

method.

'this is a string\n'.strip()

**Output:**

'this is a string'

Note that there is also the `lstrip()`

and `rstrip()`

for left and right stripping of the whitespaces in text. We can also be more specific by typing:

'this is a string\n'.strip("\n")

## R

**8.How To Rename And Relevel Factors**

A “special” data structure in R is the “factors”. We are going to provide some examples of how we can rename and relevel the factors. For the next examples, we will work with the following data:

df<-data.frame(ID=c(1:10), Gender=factor(c("M","M","M","","F","F","M","","F","F" )), AgeGroup=factor(c("[60+]", "[26-35]", "[NA]", "[36-45]", "[46-60]", "[26-35]", "[NA]", "[18-25]", "[26-35]", "[26-35]")))

**Output:**

> df ID Gender AgeGroup 1 1 M [60+] 2 2 M [26-35] 3 3 M [NA] 4 4 [36-45] 5 5 F [46-60] 6 6 F [26-35] 7 7 M [NA] 8 8 [18-25] 9 9 F [26-35] 10 10 F [26-35]

**Rename Factors**

Let’s say that I want to convert the empty string of Gender to **“U”** from the Unknown.

levels(df$Gender)[levels(df$Gender)==""] ="U"

Let’s say that we want to merge the age groups. For instance the new categories will be **“[18-35]”, “[35+], “[NA]”**

levels(df$AgeGroup)[levels(df$AgeGroup)=="[18-25]"] = "[18-35]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[26-35]"] = "[18-35]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[36-45]"] = "[35+]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[46-60]"] = "[35+]" levels(df$AgeGroup)[levels(df$AgeGroup)=="[60+]"] = "[35+]"

Notice that we could have done it in once, but it is very risky because sometimes we can have different order than what we expected.

levels(df$AgeGroup)<-c("[18-35]","[18-35]","[35+]","[35+]","[35+]", "[NA]")

By applying the changed we mentioned before, we get the following data.

> df ID Gender AgeGroup 1 1 M [35+] 2 2 M [18-35] 3 3 M [NA] 4 4 U [35+] 5 5 F [35+] 6 6 F [18-35] 7 7 M [NA] 8 8 U [18-35] 9 9 F [18-35] 10 10 F [18-35]

**Relevel Factors**

Let’s say that we want the **“[NA]”** age group to appear first

df$AgeGroup<-factor(df$AgeGroup, c("[NA]", "[18-35]" ,"[35+]"))

Another way to change the order is to use `relevel()`

to make a particular level first in the list. (This will not work for ordered factors.). Let’s day that we want the ‘F’ Gender first.

df$Gender<-relevel(df$Gender, "F")

By applying these changes, we can see how the factors have changed level.

> str(df) 'data.frame': 10 obs. of 3 variables: $ ID : int 1 2 3 4 5 6 7 8 9 10 $ Gender : Factor w/ 3 levels "F","U","M": 3 3 3 2 1 1 3 2 1 1 $ AgeGroup: Factor w/ 3 levels "[NA]","[18-35]",..: 3 2 1 3 3 2 1 2 2 2

**9.How To Impute Missing Values In R**

In the real data world, it is quite common to deal with Missing Values (known as NAs). Sometimes, there is a need to impute the missing values where the most common approaches are:

**Numerical Data**: Impute Missing Values with**mean**or**median****Categorical Data**: Impute Missing Values with**mode**

Let’s give an example of how we can impute dynamically depending on the data type.

library(tidyverse) df<-tibble(id=seq(1,10), ColumnA=c(10,9,8,7,NA,NA,20,15,12,NA), ColumnB=factor(c("A","B","A","A","","B","A","B","","A")), ColumnC=factor(c("","BB","CC","BB","BB","CC","AA","BB","","AA")), ColumnD=c(NA,20,18,22,18,17,19,NA,17,23) ) df

We get:

# A tibble: 10 x 5 id ColumnA ColumnB ColumnC ColumnD1 1 10 "A" "" NA 2 2 9 "B" "BB" 20 3 3 8 "A" "CC" 18 4 4 7 "A" "BB" 22 5 5 NA "" "BB" 18 6 6 NA "B" "CC" 17 7 7 20 "A" "AA" 19 8 8 15 "B" "BB" NA 9 9 12 "" "" 17 10 10 NA "A" "AA" 23

For the Categorical Variables, we are going to apply the “mode” function which we have to build it since it is not provided by R.

getmode <- function(v){ v=v[nchar(as.character(v))>0] uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] }

Now that we have the “mode” function we are ready to impute the missing values of a dataframe depending on the data type of the columns. Thus, if the column data type is “numeric” we will impute it with the “**mean**” otherwise with the “**mode**“. Notice that in our script we take into account the column names and “dplyr” package requires a special notation (**!!cols : =** **!!rlang::sym(colname)**) of selecting dynamically the column names.

for (cols in colnames(df)) { if (cols %in% names(df[,sapply(df, is.numeric)])) { df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), is.na(!!rlang::sym(cols)), mean(!!rlang::sym(cols), na.rm=TRUE))) } else { df<-df%>%mutate(!!cols := replace(!!rlang::sym(cols), !!rlang::sym(cols)=="", getmode(!!rlang::sym(cols)))) } } df

**Voilà**! The missing values have been imputed!

> df # A tibble: 10 x 5 id ColumnA ColumnB ColumnC ColumnD1 1 10 A BB 19.2 2 2 9 B BB 20 3 3 8 A CC 18 4 4 7 A BB 22 5 5 11.6 A BB 18 6 6 11.6 B CC 17 7 7 20 A AA 19 8 8 15 B BB 19.2 9 9 12 A BB 17 10 10 11.6 A AA 23

**10.How To Assign Values Based On Multiple Conditions Of Different Columns**

In the previous post, we showed how we can assign values in Pandas Data Frames based on multiple conditions of different columns.

Again we will work with the famous `titanic`

dataset and our scenario is the following:

- If the
`Age`

is`NA`

and`Pclass`

=1 then the Age=40 - If the
`Age`

is`NA`

and`Pclass`

=2 then the Age=30 - If the
`Age`

is`NA`

and`Pclass`

=3 then the Age=25 - Else the
`Age`

will remain as is

**Load the Data**

library(dplyr) url = 'https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/ff414a1bcfcba32481e4d4e8db578e55872a2ca1/titanic.csv' df = read.csv(url, sep="\t")

**Use of case_when function of dplyr**

For this task, we will use the `case_when`

function of `dplyr`

as follows:

df<-df%>%mutate(New_Column = case_when( is.na(Age) & Pclass==1 ~ 40, is.na(Age) & Pclass==2 ~ 30, is.na(Age) & Pclass==3 ~ 25, TRUE~Age ))

Let’s have a look at the Age, Pclass and the New_Column that we created.

df%>%select(Age, Pclass, New_Column)

As we can see we get the expected results.

Age Pclass New_Column 1 22.00 3 22.00 2 38.00 1 38.00 3 26.00 3 26.00 4 35.00 1 35.00 5 35.00 3 35.00 6 NA 3 25.00

**leave a comment**for the author, please follow the link and comment on their blog:

**R – Predictive Hacks**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.