This is a practice data analysis for comparison of Palmer Penguin Species and their features using R.
The initial step is to setup the environment for analysis by loading
the tidyverse
and palmerpenguins
packages.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(palmerpenguins)
Now let’s get an overview of the ‘penguins’ dataset. For that we can
use str()
and head()
functions.
This will the datatypes and the column names along with other details.
str(penguins)
## tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
## $ year : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
head(penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex
## <fct> <fct> <dbl> <dbl> <int> <int> <fct>
## 1 Adelie Torge… 39.1 18.7 181 3750 male
## 2 Adelie Torge… 39.5 17.4 186 3800 fema…
## 3 Adelie Torge… 40.3 18 195 3250 fema…
## 4 Adelie Torge… NA NA NA NA <NA>
## 5 Adelie Torge… 36.7 19.3 193 3450 fema…
## 6 Adelie Torge… 39.3 20.6 190 3650 male
## # … with 1 more variable: year <int>
Now that we have seen an overview of the dataset, we can work on getting some insights on the features of the penguin species.
Let’s determine the maximum and average of flipper length and body mass of each species of penguins
Let’s analyze the flipper lengths first for each species:
flipperlength_summary <- penguins %>%
group_by(species) %>%
drop_na() %>%
summarize(max(flipper_length_mm), min(flipper_length_mm), mean(flipper_length_mm)) %>%
rename('max' = 'max(flipper_length_mm)', 'min' = 'min(flipper_length_mm)', 'avg' = 'mean(flipper_length_mm)')
flipperlength_summary
## # A tibble: 3 × 4
## species max min avg
## <fct> <int> <int> <dbl>
## 1 Adelie 210 172 190.
## 2 Chinstrap 212 178 196.
## 3 Gentoo 231 203 217.
Now let’s analyze the body mass next:
bodymass_summary <- penguins %>%
group_by(species) %>%
drop_na() %>%
summarize(max(body_mass_g), min(body_mass_g), mean(body_mass_g)) %>%
rename('max' = 'max(body_mass_g)', 'min' = 'min(body_mass_g)', 'avg' = 'mean(body_mass_g)')
flipperlength_summary
## # A tibble: 3 × 4
## species max min avg
## <fct> <int> <int> <dbl>
## 1 Adelie 210 172 190.
## 2 Chinstrap 212 178 196.
## 3 Gentoo 231 203 217.
Now that we have analyzed the flipper length and body mass of each species, let’s visualize and compare the two features using a scatter plots.
First let’s create a plot which compares the flipper length and body mass for entire dataset regardless of the species.
ggplot(data = penguins) + geom_point(mapping = aes(x = body_mass_g, y = flipper_length_mm))
## Warning: Removed 2 rows containing missing values (geom_point).
This shows that there is a relation between body mass and flipper length. Let’s see if there is a trend using a simple smooth trend line.
ggplot(data = penguins) + geom_point(mapping = aes(x = body_mass_g, y = flipper_length_mm)) + geom_smooth(mapping = aes(x = body_mass_g, y = flipper_length_mm))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
The line shows there is a relationsship between the features but not exactly a linear relationship. Now let’s compare how the flipper length and body mass is related in each species.
ggplot(data = penguins) + geom_point(mapping = aes(x = body_mass_g, y = flipper_length_mm, color = species))
## Warning: Removed 2 rows containing missing values (geom_point).
From this plot we can identify that the Gentoo species is clearly stands out from other two species with longer flipper length and high body mass.
Let’s look at them seperately in multiple plots
ggplot(data = penguins) + geom_point(mapping = aes(x = body_mass_g, y = flipper_length_mm, color = species)) + facet_wrap(~species)
## Warning: Removed 2 rows containing missing values (geom_point).
From this analysis using the provided data sample, we can infer that Adelie is the smallest penguin species and Gentoo is the largest among the three species. More data samples and further analysis is needed to get a more insight into this relation and determine if there is any direct relation between the features of the penguin species.