Build a data frame from vectors

[This article was first published on Quantargo Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Tabular data is the most common format used by data scientists. In R, tables are respresented through data frames. They can be inspected by printing them to the console.

  • Understand why data frames are important
  • Interpret console output created by a data frame
  • Create a new data frame using the data.frame() function
  • Define vectors to be used for single columns
  • Specify names of data frame columns
data.frame(___ = ___, 
           ___ = ___, 
           ...)

Introduction to Data Frames

In analysis and statistics, tabular data is the most important data structure. It is present in many common formats like Excel files, comma separated values (CSV) or databases. R integrates tabular data objects as first-class citizens into the language through data frames. Data frames allow users to easily read and manipulate tabular data within the R language.

Let’s take a look at a data frame object named Davis, from the package carData, which includs height and weight measurements for 200 men and women:

Davis
  sex weight height repwt repht
1   M     77    182    77   180
2   F     58    161    51   159
3   F     53    161    54   158
 [ reached 'max' / getOption("max.print") -- omitted 197 rows ]

From the printed output we can see that the data frame spans over 200 rows (3 printed, 197 omitted) and 5 columns. In the example above, each row contains data of one person through attributes, which correspond to the columns sex, weight, height, reported weight repwt and reported height repht.

For example, the first row in the table specifies a Male weighing 77kg and has a height of 182cm. The reported weights are very close with 77kg and 180cm, respectively.

The rows in a data frame are further identified by row names on the left which are simply the row numbers by default. In the case of the Davis dataset above the row names range from 1 to 200.

Quiz: Data Frame Output

      rank discipline yrs.since.phd yrs.service  sex salary
1     Prof          B            19          18 Male 139750
2     Prof          B            20          16 Male 173200
3 AsstProf          B             4           3 Male  79750
 [ reached 'max' / getOption("max.print") -- omitted 394 rows ]

The data frame above shows the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

Which answers about the data frame printed above are correct?
  • The data frame has 3 rows.
  • The data frame has 394 rows.
  • The data frame has 397 rows.
  • The data frame has 6 attributes.
  • The attribute names contain Prof and AsstProf
Start Quiz

Quiz: Data Frame Output (2)

      rank discipline yrs.since.phd yrs.service  sex salary
1     Prof          B            19          18 Male 139750
2     Prof          B            20          16 Male 173200
3 AsstProf          B             4           3 Male  79750
 [ reached 'max' / getOption("max.print") -- omitted 394 rows ]

The data frame above shows the nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S.

Which answers about the first three faculty members are correct?
  • All three are male.
  • The salaries of all three members are about the same.
  • The Professor in row three is most probably be the oldest.
  • All shown professors are from the same discipline.
  • The highest salary amongst the three Professors is $139,750.
Start Quiz

Creating Data Frames

data.frame(___ = ___, 
           ___ = ___, 
           ...)

Data frames hold tabular data in various columns or attributes. Each column is represented by a vector of different data types like numbers or characters. The data.frame() function supports the construction of data frame objects by combining different vectors to a table. To form a table, vectors are required to have equal lengths. A data frame can also be seen as a collection of vectors connected together to form a table.

Let’s create our first data frame with four different persons including their ids, names and indicators if they are female or not. Each of these attributes is created by a different vector of different data types (numeric, character and logical). The attributes are finally combined to a table using the data.frame() function:

data.frame(
  c(1, 2, 3, 4),
  c("Louisa", "Jonathan", "Luigi", "Rachel"),
  c(TRUE, FALSE, FALSE, TRUE)
)
  c.1..2..3..4. c..Louisa....Jonathan....Luigi....Rachel..
1             1                                     Louisa
2             2                                   Jonathan
3             3                                      Luigi
4             4                                     Rachel
  c.TRUE..FALSE..FALSE..TRUE.
1                        TRUE
2                       FALSE
3                       FALSE
4                        TRUE

The resulting data frame stores the values of each vector in a different column. It has four rows and three columns. However, the column names printed on the first line seem to include the column values separated by dots which is a very strange naming scheme!

Column names can be included into the data.frame() construction as argument names preceding the values of column vectors. To improve the column naming of the previous data frame we can write

data.frame(
  id = c(1, 2, 3, 4),
  name = c("Louisa", "Jonathan", "Luigi", "Rachel"),
  female = c(TRUE, FALSE, FALSE, TRUE)
)
  id     name female
1  1   Louisa   TRUE
2  2 Jonathan  FALSE
3  3    Luigi  FALSE
4  4   Rachel   TRUE

The resulting data frame includes the column names needed to see the actual meaning of the different columns.

Exercise: Creating Your First Data Frame

weekday temperature hot
Monday 28 FALSE
Tuesday 31 TRUE
Wednesday 25 FALSE

Let’s create a data frame as shown above using the data.frame() function. The resulting data frame should consist of the three columns weekday, temperature and hot:

  1. The first column named weekday contains the weekday names "Monday", "Tuesday", "Wednesday".
  2. The second column named temperature contains the temperatures (in degrees Celsius) as 28, 31, 25.
  3. The third column named hot contains the logical values FALSE, TRUE, FALSE.

Store the final data frame in the variable temp and print its output to the console:

Start Exercise

Quiz: Which statements are true about this data frame?

price <- c(28, 31, 25)
data.frame(
  weekday = c("Monday", "Tuesday", "Wednesday", "Thursday"),
  price = price,
  expensive = price > 30
)
Which statements are true about the data frame above?
  • The data.frame() function will fail because the column expensive is no vector.
  • The data.frame() function will not fail
  • The data.frame() function fails because the lengths of the vectors are different
  • The command would work if weekday had the values c("Monday", "Tuesday", "Wednesday")
Start Quiz

Build a data frame from vectors is an excerpt from the course Introduction to R, which is available for free at https://www.quantargo.com

VIEW FULL COURSE

To leave a comment for the author, please follow the link and comment on their blog: Quantargo Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)