import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve
# Generate synthetic classification data
42)
np.random.seed(= np.random.rand(100, 1)
X = (X > 0.5).astype(int).flatten()
y
# Split the data into training and testing sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Train a logistic regression model
= LogisticRegression()
logreg
logreg.fit(X_train, y_train)
# Predict probabilities on the test set
= logreg.predict_proba(X_test)[:, 1] y_proba
Introduction
Classification, a cornerstone of machine learning, empowers systems to make informed decisions based on input features. From determining whether an email is spam to diagnosing diseases, classification algorithms play a pivotal role in automating decision-making processes.
Types of Classification Algorithms
There are several classification algorithms, each suited to different types of problems:
Logistic Regression: Ideal for binary classification tasks.
Decision Trees: Effective for both binary and multiclass classification.
Support Vector Machines (SVM): Robust for linear and nonlinear classification.
Let’s implement a simple classification model using logistic regression in Python:
Data Visualization
I use the following colors in all of my blogs data visualizations
# Define specific colors (same as CSS from quarto vapor theme)
= '#1b133a'
background = '#ea39b8'
pink = '#6f42c1'
purple = '#32fbe2' blue
Receiver Operating Characteristic (ROC) Curve
ROC curves visualize the trade-off between true positive rate (sensitivity) and false positive rate. The area under the ROC curve (AUC-ROC) is a valuable metric for model performance.
import matplotlib.pyplot as plt
# Visualize the ROC curve
= roc_curve(y_test, y_proba)
fpr, tpr, _ = plt.subplots()
fig, ax =blue, lw=2, label='ROC curve')
ax.plot(fpr, tpr, color0, 1], [0, 1], color=purple, lw=2, linestyle='--', label='Random Guess')
ax.plot(['False Positive Rate', color=blue)
ax.set_xlabel('True Positive Rate', color=blue)
ax.set_ylabel('Receiver Operating Characteristic (ROC) Curve', color=purple)
ax.set_title(='lower right')
ax.legend(loc='x', colors=blue)
ax.tick_params(axis='y', colors=blue)
ax.tick_params(axis
ax.set_facecolor(pink)
fig.set_facecolor(background) plt.show()
Precision-Recall (PR) Curve
PR curves focus on the trade-off between precision and recall, particularly valuable in imbalanced datasets.
# Visualize the Precision-Recall curve
= precision_recall_curve(y_test, y_proba)
precision, recall, _ = plt.subplots()
fig, ax =blue, lw=2, label='Precision-Recall curve')
ax.plot(recall, precision, color'Recall (Sensitivity)', color=blue)
ax.set_xlabel('Precision', color=blue)
ax.set_ylabel('Precision-Recall Curve', color=purple)
ax.set_title(='lower left')
ax.legend(loc='x', colors=blue)
ax.tick_params(axis='y', colors=blue)
ax.tick_params(axis
ax.set_facecolor(pink)
fig.set_facecolor(background) plt.show()
Confusion Matrix
The confusion matrix provides a detailed understanding of a classification model’s performance, breaking down predictions into true positives, true negatives, false positives, and false negatives.
from sklearn.metrics import confusion_matrix
# Generate predictions
= logreg.predict(X_test)
y_pred
# Calculate confusion matrix
= confusion_matrix(y_test, y_pred)
cm
# Visualize the confusion matrix
= plt.subplots()
fig, ax = ax.imshow(cm, interpolation='nearest', cmap=plt.cm.cool)
cax 'Confusion Matrix', color=purple)
ax.set_title(
plt.colorbar(cax)= ['Class 0', 'Class 1']
classes = np.arange(len(classes))
tick_marks
ax.set_xticks(tick_marks)
ax.set_yticks(tick_marks)=45, color=blue)
ax.set_xticklabels(classes, rotation=blue)
ax.set_yticklabels(classes, color'Predicted label', color=blue)
ax.set_xlabel('True label', color=blue)
ax.set_ylabel(
fig.set_facecolor(background) plt.show()
In conclusion, classification in machine learning is a powerful tool for automating decision-making processes. By implementing and evaluating classification models, we gain insights into their performance through metrics like ROC curves, PR curves, and confusion matrices. These visualizations provide a nuanced understanding of a model’s strengths and weaknesses, facilitating informed decision-making in real-world applications.