Machine Learning Penguins
What is a penguin’s favourite type of lettuce?
Iceberg!
<!DOCTYPE html>
🐧 Introduction To Applied Python: Penguins Analysis 🐧 ¶
Hello!
Today, we will be evaluating the Palmer Penguins data set collected by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
Using visual and mathmatical analysis alongside machine learning models, we can create a predictive model that will be able to evaluate the species of a penguin based off limited information.
Outlined are the following steps that we will be taking on our computational analysis journey!
- Importing Data and Modules
- Exploratory Analysis
- Cleaning and Splitting of Data
- Modeling of Machine Learning
- Visualization and Testing of Machine Learning Models
Initial Set Up: Data and Package Importing ¶
As always, we must import the libraries and modules neccessary for visualization and analysis.
The following imports will help to produce and troubleshoot our programs. While you may not see some of them used right now, its imperative to import everything into your environment when you start to keep things organized and avoid bugs!
import matplotlib.pyplot as plt
import random
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn import tree, preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
#read in data
url = 'https://philchodrow.github.io/PIC16A/datasets/palmer_penguins.csv'
penguins = pd.read_csv("palmer_penguins.csv")
#taking a sneak-peek at the data
penguins.head(3)
studyName | Sample Number | Species | Region | Island | Stage | Individual ID | Clutch Completion | Date Egg | Culmen Length (mm) | Culmen Depth (mm) | Flipper Length (mm) | Body Mass (g) | Sex | Delta 15 N (o/oo) | Delta 13 C (o/oo) | Comments | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PAL0708 | 1 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A1 | Yes | 11/11/07 | 39.1 | 18.7 | 181.0 | 3750.0 | MALE | NaN | NaN | Not enough blood for isotopes. |
1 | PAL0708 | 2 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N1A2 | Yes | 11/11/07 | 39.5 | 17.4 | 186.0 | 3800.0 | FEMALE | 8.94956 | -24.69454 | NaN |
2 | PAL0708 | 3 | Adelie Penguin (Pygoscelis adeliae) | Anvers | Torgersen | Adult, 1 Egg Stage | N2A1 | Yes | 11/16/07 | 40.3 | 18.0 | 195.0 | 3250.0 | FEMALE | 8.36821 | -25.33302 | NaN |
Looking at the first few rows of the dataframe allows us to take a quick look at what type of data we are working with.
Variable Name | Description |
---|---|
'studyName' | Name of study |
'Sample Number' | Sample number |
'Species Region' | Region where species originates from |
'Island' | Island where sample lives |
'Stage' | Stage of life of sample |
'Individual ID' | ID of sample |
'Clutch Completion' | Y/N |
'Date Egg' | Date of egg |
'Culmen Length (mm)' | Measure of culmen |
'Culmen Depth (mm)' | Measure of culmen |
'Flipper Length (mm)' | Measure of flipper |
'Body Mass (g)' | Mass measure |
'Sex' | M/F |
'Delta 15 N (o/oo)' | Measure of isotope |
'Delta 13 C (o/oo)' | Measure of isotope |
'Comments' | Extra comments |
Exploratory Analysis ¶
As seen in the glimpse of the dataset, there is a large amount of data we need to parse through such that we find those relevant to our analysis ¶
In order to prepare our data for visual analysis and machine learning, we must first undergo exploratory analysis to understand the penguins dataset better and evaluate what types of data we should be moving forward with.
In this instance, there are two goals we need to achieve in exploratory analysis:
- Mathmatically evaluate quantative data by qualitative features
- Visually evaluate data
#isolate columns of interest from the penguins dataset
explore = penguins[["Species","Island","Culmen Depth (mm)","Body Mass (g)","Sex",
"Culmen Length (mm)","Flipper Length (mm)","Delta 15 N (o/oo)",
"Delta 13 C (o/oo)"]]
#functions should always have a description, such that when the help method is invoked on your custom functions,
#there will be informative information on what the function does.
def infomaker(df):
"""
Function will take the penguins dataframe and make a new dataframe containing the means of numerical columns
"""
info = df[["Culmen Depth (mm)","Body Mass (g)","Sex", "Culmen Length (mm)","Flipper Length (mm)","Delta 15 N (o/oo)","Delta 13 C (o/oo)"]].mean()
return info
#apply infomaker function on groups of the explore dataframe based on "Species","Island","Sex"
show = explore.groupby(["Species","Island","Sex"]).apply(infomaker)
display(show)
fig, ax = plt.subplots(1)
fig.set_size_inches(10,10)
plt.title("Culmen Depth and Body Mass by Species and Island")
sns.scatterplot(
x = "Culmen Depth (mm)",
y = "Body Mass (g)",
hue = "Species", #hue will automtically parse the different unique ID's in the category
style = "Island", #style will automatically parse the different unique ID's in the category
palette = "deep",
s = 200,
alpha = 0.7,
data = explore)
plt.legend(loc="lower right", bbox_to_anchor=(1.5, 0))
plt.tight_layout
plt.show()
fig, ax = plt.subplots(1)
fig.set_size_inches(14,8)
plt.title("Spread of Species on Islands")
plt.ylabel("Density")
sns.histplot(data=penguins,
x="Island",
hue = "Species",
multiple = "dodge",
palette = "pastel",
shrink = 0.7)
plt.tight_layout
plt.show()
Culmen Depth (mm) | Body Mass (g) | Culmen Length (mm) | Flipper Length (mm) | Delta 15 N (o/oo) | Delta 13 C (o/oo) | |||
---|---|---|---|---|---|---|---|---|
Species | Island | Sex | ||||||
Adelie Penguin (Pygoscelis adeliae) | Biscoe | FEMALE | 17.704545 | 3369.318182 | 37.359091 | 187.181818 | 8.774242 | -25.920176 |
MALE | 19.036364 | 4050.000000 | 40.590909 | 190.409091 | 8.872945 | -25.917227 | ||
Dream | FEMALE | 17.618519 | 3344.444444 | 36.911111 | 187.851852 | 8.914803 | -25.736636 | |
MALE | 18.839286 | 4045.535714 | 40.071429 | 191.928571 | 8.984427 | -25.759120 | ||
Torgersen | FEMALE | 17.550000 | 3395.833333 | 37.554167 | 188.291667 | 8.663160 | -25.738735 | |
MALE | 19.391304 | 4034.782609 | 40.586957 | 194.913043 | 8.919919 | -25.835347 | ||
Chinstrap penguin (Pygoscelis antarctica) | Dream | FEMALE | 17.588235 | 3527.205882 | 46.573529 | 191.735294 | 9.250962 | -24.565405 |
MALE | 19.252941 | 3938.970588 | 51.094118 | 199.911765 | 9.464535 | -24.527679 | ||
Gentoo penguin (Pygoscelis papua) | Biscoe | . | 15.700000 | 4875.000000 | 44.500000 | 217.000000 | 8.041110 | -26.184440 |
FEMALE | 14.237931 | 4679.741379 | 45.563793 | 212.706897 | 8.193405 | -26.197205 | ||
MALE | 15.718033 | 5484.836066 | 49.473770 | 221.540984 | 8.303429 | -26.170608 |
Summary from Exploratory Analysis ¶
The summary table of mean values shows that the Delta 15 N and Delta 13 C values are not significantly different in any species and sex. On the other hand, the other four features show promising differences between species, such that the Gentoo penguins seem to be marginally larger in all aspects compared to Adelie and Chinstraps.
From the scatterplot, we can see that when we plot the species on the two axis of body mass and culmen depth, the distinction between Gentoo and the other two species is very apparent, however, the distinction between the Adelie and Chinstrap species is not so apparent within the parameters we chose.
From the histogram, it can be seen that Adelie is a generalist species, in that they can be found on all three islands (Torgersen, Biscoe, and Dream). The Chinstrap and Gentoo species are considered to be specialist species in that each of the species can be found on only one island. The Chinstrap species can only be found on the Dream island, while the Gentoo species can only be found on the Biscoe island. Therefore, an unknown penguin's island is integral for our model to identify its species ID.
Feel free to return back to this section, and evaluate different sets of variables to see what other relationships you can find!
Data Cleaning ¶
Before we use the data from the penguins dataset, we need to clean the dataset. In order to avoid polluting the test set and potentially removing valuable data points, we will be splitting the data into 70/30 train and test sets before cleaning.
Note. It is common practice (practically standard) to use train-test-split in machine learning, as it is a simple but powerful way to ensure that your machine learning model is properly trained.
def convert(df_original):
"""
This function will take in the penguins dataframe, select specified columns to create a new dataframe.
Then, it will be cleansed of any N/A values, have its qualitative features go through labelencoder transformation,
and finally split the data into the predictor and target variable
"""
df = df_original.copy()
#isolate columns found relevant from exploratory analysis
df = df[["Species", "Island","Culmen Depth (mm)","Body Mass (g)", "Culmen Length (mm)","Flipper Length (mm)"]]
df = df.dropna()
#LabelEncoder will change labels into numerical values in alphabetical order (of labels)
le = preprocessing.LabelEncoder()
df['Species'] = le.fit_transform(df['Species'])
#Adelie = 0, Chinstrap = 1, Gentoo = 2
df['Island'] = le.fit_transform(df['Island'])
#Biscoe Island = 0, Dream = 1, Torgersen = 2
#splitting data
X = df.drop(['Species'], axis = 1)
y = df['Species']
return(X, y)
train, test = train_test_split (penguins,test_size = .3)
train_X, train_y = convert(train)
test_X, test_y = convert(test)
Now that we have split and cleaned our data, lets look at the training data that we will feed into our machine learning models
train_X.head()
Island | Culmen Depth (mm) | Body Mass (g) | Culmen Length (mm) | Flipper Length (mm) | |
---|---|---|---|---|---|
164 | 1 | 17.3 | 3700.0 | 47.0 | 185.0 |
242 | 0 | 14.5 | 4400.0 | 46.5 | 213.0 |
179 | 1 | 19.0 | 3800.0 | 49.5 | 200.0 |
162 | 1 | 17.8 | 3800.0 | 46.6 | 193.0 |
2 | 2 | 18.0 | 3250.0 | 40.3 | 195.0 |
train_y.head()
164 1 242 2 179 1 162 1 2 0 Name: Species, dtype: int32
As seen above, the first 5 indices of train_X now contain only relevant, numerical data while the species ID of the samples has been dropped. train_Y contains identical indexing as train_X, but only includes the target data: species ID.
MODELING ¶
For our modeling component, we will use the following two ML classifiers from the sklearn module:
- Logistic Regression
- Support Vector Machines
These two machine learning classifiers will enable us to predict the species of a penguin based of some marked characteristics of the penguin.
Note. There are tons of different machine learning classifiers in the sklearn module, as well as many that can be found outside the module! Feel free to explore and try out different classifiers to see which perform the best!
Feature Selection ¶
Before we can begin fitting our machine learning models, we must first select for the features that will give us the highest accuracy. As a limit, we will be ultimately feeding the machine learning model 1 qualitative and 2 quantitative features.
The list down below contains the total permutations of feature combinations we will test on the machine learning classifiers to determine with permutation yields the highest percentile of accuracy.
Note: For our qualatative feature I have explicitly chosen "Island", due to prior exploratory analysis. Out of the qualitative features, there was a noticable pattern where the island in which the sample was living on was highly correlated with what species the sample would be. Unlike sex, which is idealistically 50/50 split in every species, I made the assumption that the island would have a more predictive power on species ID. Like many other of the sections before - feel free to use whatever features interest you!
combos = [
["Island", "Body Mass (g)", "Flipper Length (mm)"],
["Island", "Body Mass (g)", "Culmen Depth (mm)"],
["Island", "Body Mass (g)", "Culmen Length (mm)"],
["Island", "Culmen Depth (mm)", "Flipper Length (mm)"],
["Island", "Culmen Depth (mm)", "Culmen Length (mm)"],
["Island", "Culmen Length (mm)", "Flipper Length (mm)"],
]
To find what combinations fit best with our models, we must first create our ML classifiers
LR = LogisticRegression()
vecml = SVC()
Next, we must create a function that will evaluate the best combo for our classifiers
def cross_val_checker(c):
"""
This function will take advantage of a validation method called cross validation, which will
provide a more robust evaluation of a ML classifier's accuracy based on the combos list.
The two returned values will be a the highest accuracy and the combination that yieled that accuracy
"""
best_perm=-np.inf
N=len(combos)
scores=np.zeros(N)
#iterate over the amount of combos
for d in range(1,N+1):
cols = combos[d-1]
#store the cross_val_score
scores[d-1] =cross_val_score(c,train_X[cols],train_y,cv=5).mean()
#update statement for the best score/column combo
if scores[d-1]>best_perm:
best_perm = scores[d-1]
best_cols = cols
return best_perm, best_cols
Before we check the columns, we will use a nifty function to pass ConvergenceWarnings that will appear while we evaluate the columns
simplefilter("ignore", category=ConvergenceWarning)
Now that we have created a function that will select the best features for us, we can use it in our ML Classifiers to find the best fitted combos for them
LR_best_perm, LR_best_cols = cross_val_checker(LR)
vecml_best_perm, vecml_best_cols = cross_val_checker(vecml)
print(LR_best_cols, vecml_best_cols)
['Island', 'Culmen Depth (mm)', 'Culmen Length (mm)'] ['Island', 'Culmen Depth (mm)', 'Culmen Length (mm)']
From our cross validation method we found that the combination of:
['Island', 'Culmen Depth (mm)', 'Culmen Length (mm)']
to yield the most accurate score out of all the possible combinations available for both the Logistic Regression model and the Support Vector Machine model.
Now that we have chosen our features for our machine learning models, we can start fitting and testing our models!
Logistic Regression Classifier ¶
Now that we've done all the preparation work, lets get cracking at see our hard work turn into fruition!
# best permutation score and corresponding columns that produced the score
LR_best_perm, LR_best_cols
(0.9749113475177305, ['Island', 'Culmen Depth (mm)', 'Culmen Length (mm)'])
Given this, we can now use LR_best_cols as our features to train the classifier on.
To find the best version of our classifier, we will use GridSearchCV from sklearn which will evaluate a param_grid that contains certain parameters we want to compare within the LR() classifier. In this way, we can optimize the classifier to give us the best output possible.
Using GridSearchCV, we can find out which parameters give us the highest accuracy.
from sklearn.model_selection import GridSearchCV
#param_grid contains a dictionary of parameters we want to evaluate that exist in LR
param_grid = [
{
'solver' : ['newton-cg','lbfgs','liblinear','sag','saga'],
'max_iter' : [100, 1000, 5000]
}
]
LR_clf = GridSearchCV(LR, param_grid = param_grid, cv=3, scoring='accuracy')
LR_clf.fit(train_X[LR_best_cols], train_y)
LR_clf.best_params_
{'max_iter': 100, 'solver': 'newton-cg'}
Now that we know which parameters can give us higher accuracy, we can now fit and test our data using our classifier.
LR_best = LogisticRegression(max_iter= 100, solver= 'lbfgs')
LR_fit = LR_best.fit(train_X[LR_best_cols], train_y)
LR_best.score(test_X[LR_best_cols],test_y)
0.9805825242718447
With all parameters optimized for highest accuracy, our machine learning model was able to predict the test set with a 98% accuracy!
Support Vector Classifier ¶
Lets see the performance on our second machine learning classifier!
# best permutation score and corresponding columns that produced the score
vecml_best_perm, vecml_best_cols
(0.8827127659574469, ['Island', 'Culmen Depth (mm)', 'Culmen Length (mm)'])
Like LR, we can go through the same workflow to evaluate which parameters give us the highest accuracy.
param_grid = [
{
'kernel' : ["linear","poly","rbf","sigmoid"],
'gamma' : ['scale','auto'],
'max_iter' : [1000, 5000, 10000, -1]
}
]
clf = GridSearchCV(vecml, param_grid = param_grid, cv=5, scoring='accuracy')
clf.fit(train_X[vecml_best_cols], train_y)
clf.best_params_
{'gamma': 'scale', 'kernel': 'linear', 'max_iter': 1000}
Given these parameters:
'gamma': 'auto', 'poly': 'linear', 'max_iter': 1000
we can now fit and test our Support Vector Classification!
vecml_best = SVC(gamma = "auto", kernel = "linear", max_iter = 1000)
veclml_bestfitted = vecml_best.fit(train_X[vecml_best_cols],train_y)
veclml_bestfitted.score(test_X[vecml_best_cols],test_y)
1.0
Using the same protocols as the LR model, we were able to get 100% accuracy from the Vecml predictive model!
Testing on Unforseen Data ¶
A confusion matrix is a nxn matrix that lets see the accuracy of the predictions our model vs. the "real" target variable
#implementing confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
#Logistic Regression Confusion Matrix
LR_pred_y = LR_best.predict(test_X[LR_best_cols])
#call the confusion matrix
LR_cm = confusion_matrix(test_y, LR_pred_y, labels=LR_fit.classes_)
#display the confusion matrix
LR_disp = ConfusionMatrixDisplay(confusion_matrix=LR_cm, display_labels=LR_fit.classes_)
LR_disp.plot(cmap=plt.cm.magma)
plt.title("LR")
plt.show()
#Support Vector Classification Confusion Matrix
SVC_pred_y = veclml_bestfitted.predict(test_X[vecml_best_cols])
SVC_cm = confusion_matrix(test_y, SVC_pred_y, labels=veclml_bestfitted.classes_)
SVC_disp = ConfusionMatrixDisplay(confusion_matrix=SVC_cm, display_labels=veclml_bestfitted.classes_)
SVC_disp.plot(cmap=plt.cm.magma)
plt.title("SVC")
plt.show()
Take some time to digest these matrices, you will find that its a really nice visual way to evaluate your machine learning classifier!
In the above confusion matrices, we can see what the how the predictions compare to the actual data and if the predictions were correct. Likewise, the confusion matrices enable us to see where the false postive/negatives occur for the species types. Because we have three categories, the matrix is in a 3x3 fashion. The diagonal cross from the top left corner to the bottom right represents the predictions that are correct.
Decision Regions ¶
Decision regions plots are 2D regions that display the predictions of penguins species based on our models¶
Buckle up, here comes a complicated section! In order to display our decision regions accurately, we will create a function that can take any classifier and output its decision region. Because we have three features that we are comparing, and one of those features is our qualitative measure, we split our data based on the islands and create three decision region graphs based on that.
def plot_regions(c,X,y, num_features):
'''
This function takes in 4 user-given parameters:
c, the ML classifier; X, a dataframe of parameter variables; y, a dataframe of target variables;
and num_features, the number of features observing
This function will then fit the data, graph the data, and graph
the ML classifier's decision regions on top of the data,
showing its accuracy and boundaries
'''
#fit our classifier that is fed into the function
c.fit(X,y)
#for labeling
penguin_species = ["Adelie", "Chinstrap", "Gentoo"]
islands = ["Biscoe", "Dream", "Torgersen"]
#for contourf lines & plotting
levels = [-1,0,1,2]
fig, ax = plt.subplots(1, num_features, figsize = (18, 5))
color_map = ["b", "g", "r"]
for i in range(num_features):
#island is 0,1,2 ("Biscoe", "Dream", "Torgersen")
island = (X["Island"] == i)
#x0,x1 store the depth/length of the specific island
x0 = X[X["Island"] == i]["Culmen Depth (mm)"]
x1 = X[X["Island"] == i]["Culmen Length (mm)"]
#grid x/y contains many digits of the least/greatest values of
#the species's depth/length
grid_x=np.linspace(x0.min(),x0.max(),501)
grid_y=np.linspace(x1.min(),x1.max(),501)
xx,yy=np.meshgrid(grid_x,grid_y)
#flatten out xx and yy into 1D arrays
XX=xx.ravel()
YY=yy.ravel()
#ZZ is a dummy var that mirrors the island chosen
ZZ = np.ones(XX.size)*i
#we will then use the data on the fitted classifier to predict on every single point of the grid
p=c.predict(np.c_[ZZ,XX,YY])
#reshape predict array to match grid array
p=p.reshape(xx.shape)
ax[i].contourf(xx,yy,p,levels = levels, cmap="jet",alpha=.2)
for j in range(num_features):
#loc stores the target variable dataframe that matches the island
#we are looking at
loc = (y[X["Island"] == i] == j)
ax[i].scatter(x0[loc], x1[loc], c=color_map[j], alpha = .75, label = penguin_species[j])
ax[i].set(xlabel = "Culmen Depth (mm)", ylabel = "Culmen Length (mm)", title = "Island: " + islands[i])
#Add legend, title to the figure
plt.legend(loc="best", bbox_to_anchor= (1.5, 0.35), fontsize = 15, markerscale = 1.25, title = "Species ID")
fig.suptitle("Decision Regions for Model: " + str(c))
plt.tight_layout()
To enable for more data points, we will be using the entirety of our predictor and target variables to give a more descriptive decision region
#Calling decision region creation for LR
X,y = convert(penguins)
X = X[LR_best_cols]
plot_regions(LR_best, X, y, 3)
#Calling decision region creation for SVC
X,y = convert(penguins)
X = X[vecml_best_cols]
plot_regions(vecml_best, X, y, 3)
While the decision regions look extremely similar for both the LR and SVC, we can see that there are slight differences in the way that LR predicts data and how the SVC predicts data. Mostly, there is a notable difference in the decision regions for Island: Biscoe and Island: Dream.
LR contains all three penguin predictions, as noted by the red, green, and blue regions in the background of the scatter. SVC does not have this pattern, but only two colors. Nevertheless, we can see that both classfiers have linear decision boundaries, unlike other classifiers such as neural networks (MLP's) and random forest classifiers.
Additionally, we can see that on the Torgersen island, both models have their general line in the same location, but a slight difference makes it such that the LR classifier would have included an extra Adelie penguin in the Chinstrap region.
Pat yourself on the back! That was not a simple task. Lets wrap up.
Discussion & Wrap Up
Lets recap.
The accuracy of both the Logistic Regression model and the Support Vector Classification are both above 95% post-parameter optimization. This means that the models are fairly accurate in predicting the species of the penguins based on the optimized features that fit onto the machine learning models.
We can visualize this accuracy in the confusion matrices that we built, where both models made similar prediction mistakes. Nevertheless, the magnitude of mistakes were very minor in relation to the grand scope of things.
Even though the accuracy is high, we must still consider that a classifier that is too accurate is one that has the potential to be overfitted. In the case of new, unforseen data, the classifiers we created may not be flexible enough and may inaccuractely predict the species of the samples more often in comparison to a less, overfit classifier.
The decision regions show us another visual representation of the possible mistakes that our classifiers can make during prediction. As seen in the above regions, there are certain outliers in the data that can be mispredicted. Such an example is the Biscoe decision region done by the Logistic Regression model. The decision region was not able to filter out the Chinstrap species from its prediction from this region, even though the exploratory analysis showed that there were no Chinstraps that exist on the island.
Likewise, the SVC model was not without mistakes, in the Dream decision region the SVC classifier mispredicted some Chinstrap penguins to be Adelie penguins. As we have discussed before, this inflexibility may be due to the fact that our models are overfitted.
I hope that you were able to get a glimpse on how python can be used in the real world, by real scientists, to do all sorts of work! Good job on making it to the end.