<- c(1, 2, 3, 4, 5)
numbers numbers
- 1
- Create a vector of numbers from 1 to 5.
- 2
-
Display the
numbers
vector.
[1] 1 2 3 4 5
vector
and tibble
in RA data structure is an orderly, efficient way to store and retrieve data. While it may feel a bit abstract if youβre new to coding, think of these structures as the βnounsβ in your programming language (with data wrangling serving as the βverbsβ).
R has provided many data structures. Fortunately, nine times out of ten, youβll only need to worry about two: vector
and tibble
.
A vector is a one-dimensional arrayβlike a row of seats in a movie theater that only allows one type of audience member, be they numeric, character, or logical. We create them with the c()
function. For example:
<- c(1, 2, 3, 4, 5)
numbers numbers
numbers
vector.
[1] 1 2 3 4 5
c()
stand for
c()
means βcombine.β Yes, itβs a slightly cryptic name for what youβll soon be typing all the time, but youβll get used to it.
Vectors form the backbone of R scripts. Theyβre how you can store the number of penguins wandering around Antarctica, keep track of class attendance, or tally cars on the highway. A key detail: everything in a vector must share the same type. Some common types are:
numeric
(like 3.14 or 42)character
(like βHelloβ or βWorldβ)logical
(TRUE or FALSE)factor
(categorical data)Below, we make a few different vectors to illustrate these types:
<- c(1.1, 2.2, 3.3, 4.4, 5.5)
numbers <- c("a", "b", "c")
characters <- c(TRUE, FALSE, TRUE)
logicals <- factor(c("a", "b", "c")) factors
R automatically assigns the data type for each vector, which can save time but also lead to (unwanted) surprises. For instance, sometimes 1 appears as a character rather than a numberβsomething that can introduce annoying bugs. Avoid headaches by checking data types with class()
. For example:
class(numbers)
[1] "numeric"
A horror story (not in R but in Excel) is that the auto-convert feature messes up gene names like SEPT4
to 4-Sept
(datatype from character to datetime). This has affected a ton of genetic papers. So, always be aware of the data types!
We can also extract data from a vector using the square brackets []
. For example:
1]
numbers[2] characters[
numbers
vector.
characters
vector.
[1] 1.1
[1] "b"
Note that the index starts from 1, not 0. This is a common source of confusion for those coming from languages like Python or C. I personally prefer the 1-based index, as this is more intuitive.
A tibble
(brought to you by the tidyverse
) is essentially a set of vectors bound together in columnsβlike a multi-row, multi-column theater, where each column houses a single data type.
As a simple xample,
library(tidyverse)
tibble(
x = c(1, 2, 3),
y = c("a", "b", "c"),
z = c(TRUE, FALSE, TRUE)
)
tidyverse
package, as tibble
is from it.
tibble
There is another equivalent way of writing it rowwisely. For example,:
tribble(
~x, ~y,
1, "a",
2, "b",
3, "c"
)
tibble
).
This is perfect when typing in smaller datasets by hand.
Tibbles come with plenty of benefits over base Rβs classic data frame
βparticularly in how they track and preserve data types. Youβll notice each column explicitly labeled as dbl
(double) or chr
(character), which can save you from those frustrating mysteries where your numeric data gets disguised as text.
Tibbles also enforce a neat and consistent data layout, called Tidy Data
, which is a godsend when youβre dealing with other peopleβs messy spreadsheets. Quoting from Hadley Wickham (link):
Tidy datasets are easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.
We will learn more about working with tibble
more in the next sections, but for now, letβs move on to our example dataset.
Create a tibble with the following columns:
name
with your name and surnameage
with your ageis_student
with a logical valuepenguins
Weβll soon learn how to import data (another comedic ordeal in itself), but for now, letβs borrow the penguins
dataset from the palmerpenguins
package. It covers three penguin speciesβAdelie, Chinstrap, and Gentooβand gives us some useful variables to play with. Who doesnβt love penguins?
First things first, let us load the penguins
dataset.
# install.packages("palmerpenguins")
library(palmerpenguins)
data(package = "palmerpenguins")
palmerpenguins
package, you can install it by uncommenting this line.
palmerpenguins
package.
penguins
dataset.
As a first rule, always, always take a look at your data. The simplest way is simply to print it out:
penguins
penguins
dataset.
This prints about the first couple rows of the dataset (another reason to love tibble
over data frame
in base R).
You may notice some NA
values in the dataset. These are missing values, which are common in real-world datasets. Weβll learn how to handle them later.
If youβd like a better view of your dataset, try the skim()
function from the skimr
package:
::p_load(skimr)
pacmanskim(penguins)
skimr
package.
penguins
dataset.
Name | penguins |
Number of rows | 344 |
Number of columns | 8 |
_______________________ | |
Column type frequency: | |
factor | 3 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
species | 0 | 1.00 | FALSE | 3 | Ade: 152, Gen: 124, Chi: 68 |
island | 0 | 1.00 | FALSE | 3 | Bis: 168, Dre: 124, Tor: 52 |
sex | 11 | 0.97 | FALSE | 2 | mal: 168, fem: 165 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
bill_length_mm | 2 | 0.99 | 43.92 | 5.46 | 32.1 | 39.23 | 44.45 | 48.5 | 59.6 | βββββ |
bill_depth_mm | 2 | 0.99 | 17.15 | 1.97 | 13.1 | 15.60 | 17.30 | 18.7 | 21.5 | β β βββ |
flipper_length_mm | 2 | 0.99 | 200.92 | 14.06 | 172.0 | 190.00 | 197.00 | 213.0 | 231.0 | ββββ β |
body_mass_g | 2 | 0.99 | 4201.75 | 801.95 | 2700.0 | 3550.00 | 4050.00 | 4750.0 | 6300.0 | βββββ |
year | 0 | 1.00 | 2008.03 | 0.82 | 2007.0 | 2007.00 | 2008.00 | 2009.0 | 2009.0 | βββββ |
Think of skim()
as your quick backstage pass, telling you how many rows, columns, missing values, and data types youβre dealing with. Trust me, a few seconds spent peeking at your data can save hours of confusion down the line.
iris
is a widely-used dataset of plant traits. It is a default dataset in R. Take a look at the iris
dataset using the skim()
function.