__author__ = "Chris Tran"
__email__ = "tranduckhanh96@gmail.com"
__website__ = "chriskhanhtran.github.io"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
In this project I am going to utilize Principal Components Analysis (PCA) to transform the breast cancer dataset and then use the Support Vector Machine model to predict whether a patient has breast cancer.
I have taken a great Machine Learning course by Jose Portilla on Udemy and now I want to apply what I have learnt so far to perform a comprehensive exploratory data analysis on this data and to predict breast cancer based on feature variables.
The Breast Cancer Wisconsin (Diagnostic) Data Set is obtained from UCI Machine Learning Repository. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
The dataset is also available in the Scikit Learn library. I will use Scikit Learn to import the dataset and explore its attributes.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
These are the elements of the dataset:
cancer.keys()
The dataset has 569 instances and 30 numeric variables:
print(cancer.DESCR[27:3130])
df_features = pd.DataFrame(cancer.data, columns = cancer.feature_names)
df_features.info()
As the data is clean and has no missing value, the cleaning step can be skipped.
Target variable:
cancer.target_names
df_target = pd.DataFrame(cancer.target, columns=['target'])
df_target['target'].value_counts()
According to the dataset's description, the distribution of the target variable is: 212 - Malignant, 357 - Benign. Thus, 'benign' and 'maglinant' are presented as 1 and 0, respectively.
Let's merge the features and the target variable together:
df = pd.concat([df_features, df_target], axis=1)
For the purpose of exploratory data analysis, I will transform the target variable into text.
df['target'] = df['target'].apply(lambda x: "Benign"
if x == 1 else "Malignant")
df.head(5)
df.describe()
# Set style
sns.set_style('darkgrid')
df['target'].value_counts()
plt.figure(figsize=(8, 6))
sns.countplot(df['target'])
plt.xlabel("Diagnosis")
plt.title("Count Plot of Diagnosis")
Now I will take a look at the distribution of each feature and see how they are different between 'benign' and 'malignant'. To see the distribution of multiple variables, we can use violin plot, swarm plot or box plot. Let's try each of these plots.
To visualize distributions of multiple features in one figure, first I need to standardize the data:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_features)
features_scaled = scaler.transform(df_features)
features_scaled = pd.DataFrame(data=features_scaled,
columns=df_features.columns)
df_scaled = pd.concat([features_scaled, df['target']], axis=1)
Then I "unpivot" the dataframe from wide format to long format using pd.melt function:
df_scaled_melt = pd.melt(df_scaled, id_vars='target',
var_name='features', value_name='value')
df_scaled_melt.head(3)
There are 30 features so I will create a violin plot, a swarm plot and a box plot for each batch of 10 features:
def violin_plot(features, name):
"""
This function creates violin plots of features given in the argument.
"""
# Create query
query = ''
for x in features:
query += "features == '" + str(x) + "' or "
query = query[0:-4]
# Create data for visualization
data = df_scaled_melt.query(query)
# Plot figure
plt.figure(figsize=(12, 6))
sns.violinplot(x='features',
y='value',
hue='target',
data=data,
split=True,
inner="quart")
plt.xticks(rotation=45)
plt.title(name)
plt.xlabel("Features")
plt.ylabel("Standardize Value")
def swarm_plot(features, name):
"""
This function creates swarm plots of features given in the argument.
"""
# Create query
query = ''
for x in features:
query += "features == '" + str(x) + "' or "
query = query[0:-4]
# Create data for visualization
data = df_scaled_melt.query(query)
# Plot figure
plt.figure(figsize=(12, 6))
sns.swarmplot(x='features', y='value', hue='target', data=data)
plt.xticks(rotation=45)
plt.title(name)
plt.xlabel("Features")
plt.ylabel("Standardize Value")
def box_plot(features, name):
"""
This function creates box plots of features given in the argument.
"""
# Create query
query = ''
for x in features:
query += "features == '" + str(x) + "' or "
query = query[0:-4]
# Create data for visualization
data = df_scaled_melt.query(query)
# Plot figure
plt.figure(figsize=(12, 6))
sns.boxplot(x='features', y='value', hue='target', data=data)
plt.xticks(rotation=45)
plt.title(name)
plt.xlabel("Features")
plt.ylabel("Standardize Value")
violin_plot(df.columns[0:10], "Violin Plot of the First 10 Features")
swarm_plot(df.columns[10:20], "Swarm Plot of the Next 10 Features")
box_plot(df.columns[20:30], "Box Plot of the Last 10 Features")
The violin plot is very efficient in comparing distributions of different variables. The classification becomes clear in the swarm plot. Finally, the box plots are useful in comparing median and detecing outliers.
From above plots we can draw some insights from the data:
mean radius
, mean area
, mean concave points
, worst radius
, worst perimeter
, worst area
, worst concave points
.mean smoothness
, mean symmetry
, mean fractual dimension
, smoothness error
. These features are weak in classifying data.mean perimeter
vs. mean area
, mean concavity
vs. mean concave points
, and worst symmetry
vs. worst fractal dimension
. We should not include all these hightly correlated varibles in our predicting model.As discussed above, some dependent variables in the dataset might be highly correlated with each other. Let's explore the correlation of three examples above.
def correlation(var):
"""
1. Print correlation
2. Create jointplot
"""
# Print correlation
print("Correlation: ", df[[var[0], var[1]]].corr().iloc[1, 0])
# Create jointplot
plt.figure(figsize=(6, 6))
sns.jointplot(df[(var[0])], df[(var[1])], kind='reg')
correlation(['mean perimeter', 'mean area'])
correlation(['mean concavity', 'mean concave points'])
correlation(['worst symmetry', 'worst fractal dimension'])
Two pairs of 3 examples are actually highly correlated. Let's create a heat map to see the overall picture on correlation.
# Create correlation matrix
corr_mat = df.corr()
# Create mask
mask = np.zeros_like(corr_mat, dtype=np.bool)
mask[np.triu_indices_from(mask, k=1)] = True
# Plot heatmap
plt.figure(figsize=(15, 10))
sns.heatmap(corr_mat, annot=True, fmt='.1f',
cmap='RdBu_r', vmin=-1, vmax=1,
mask=mask)
From the heat map, we can see that many variables in the dataset are highly correlated. What are variables having correlation greater than 0.8?
plt.figure(figsize=(15, 10))
sns.heatmap(corr_mat[corr_mat > 0.8], annot=True,
fmt='.1f', cmap=sns.cubehelix_palette(200), mask=mask)
Well, we have some work to do with feature selection.
I will use Univariate Feature Selection (sklearn.feature_selection.SelectKBest) to choose 5 features with the k highest scores. I choose 5 because from the heatmap I could see about 5 groups of features that are highly correlated.
from sklearn.feature_selection import SelectKBest, chi2
feature_selection = SelectKBest(chi2, k=5)
feature_selection.fit(df_features, df_target)
selected_features = df_features.columns[feature_selection.get_support()]
print("The five selected features are: ", list(selected_features))
X = pd.DataFrame(feature_selection.transform(df_features),
columns=selected_features)
X.head()
Let's create a pairplot to see how different these features are in 'malignant' and in 'benign'.
sns.pairplot(pd.concat([X, df['target']], axis=1), hue='target')
from sklearn.model_selection import train_test_split
y = df_target['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
from sklearn.metrics import confusion_matrix, classification_report
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification Report:\n", classification_report(y_test, y_pred))
The accuracy rate is approximately 97%. The model only makes 5 wrong predictions out of 188. Our chosen features are pretty good in identifying cancer.
from sklearn.decomposition import PCA
Principal component analysis (PCA) performs linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space.
PCA transforms the data into features that explain the most variance in the data.
For a better performance of PCA, we first need to scale our data so that each features has a single unit variance. I have done this step in EDA.
features_scaled.head(5)
X_scaled = features_scaled
It's difficult to visualize high-dimensional data like our original data. I will use PCA to find the two principal components and visualize the data in this two-dimensional space.
pca = PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
Let's visualize our data based on these two principle components:
plt.figure(figsize=(8, 8))
sns.scatterplot(X_pca[:, 0], X_pca[:, 1], hue=df['target'])
plt.title("PCA")
plt.xlabel("First Principal Component")
plt.xlabel("Second Principal Component")
We can use the two principal components to clearly separace our data between Malignant and Benign.
X = X_pca
y = df_target['target']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
from sklearn.svm import SVC
After transforming the data into 2 principal components, I will use SVM model to predict cancer.
GridSearch
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1, 10, 100, 1000],
'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
'kernel': ['rbf']}
grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5)
grid.fit(X_train, y_train)
grid.best_params_
y_pred = grid.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification Report:\n",classification_report(y_test, y_pred))
The accuracy rate is 95%. This model made a bit more wrong predictions than the Random Forest model. However, with PCA, we can reduce the number of dimensions in our data.
In the first part of this project, I performed exploratory data analysis to better understand each of 30 original features and how they might be associated with cancer.
Next, I selected 5 best features for my model using univariate feature selection, and performed Random Forst classifier. The accuracy rate of this model is 97%.
In addition, I used PCA to find the two principal components and created visualization based on these two variables. The visualization shows that with only two variables, we can clearly separate the data between cancer and no cancer. Finally, I preformed Support Vector Machines model to predict cancer based on PCA. The accuracy rate of this model is 95%.
In fact, this data set is quite easy for machine learning models to classify. However, my purpose of doing this project is to learn how to mine data by exploring each feature, select features for my model, and perform various machine learning models.
I hope you enjoy this project. If you have any questions, please feel free to contact me. Thanks for reading!
Reference:
Python for Data Science and Machine Learning Course by Jose Portilla on Udemy