Data Preparation and Validation Techniques

Data Preparation and Validation Techniques#

Machine Learning Methods#

Module 2: Data Preparation and Validation Techniques#

Instructor: Farhad Pourkamali#

Overview#

Data Loading, Understanding, and Filtering (https://youtu.be/FvlAINfSLyg)
- Understanding data structure (input features, target variables, and types of data)
- Descriptive statistics
- Data visualization for understanding distributions and relationships (e.g., histograms, scatter plots, correlation matrices)
Feature Engineering (https://youtu.be/CFA877KhPl0)
- Feature scaling (standardization, normalization)
- Feature transformation (e.g., polynomial features)
Data Splitting and Cross-Validation (https://youtu.be/brqCqB3xfX8)
- Train-Test Split
- Validation Set (Role of a validation set in model tuning)
- K-Fold Cross-Validation (Description and implementation)
- Leave-One-Out Cross-Validation (LOOCV)
Pipelines for Automated Preprocessing (https://youtu.be/1sm0ZcVZK8c)
- Integrating feature engineering and predictive modeling in sklearn pipelines

1. Data Loading, Understanding, and Filtering#

pandas.read_csv is a function to load CSV (Comma Separated Values) files into a Pandas DataFrame for analysis and manipulation.
Here’s the general syntax for the pandas.read_csv function (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):

pandas.read_csv(
    filepath_or_buffer,        # Filepath or URL to the CSV file
    sep=',',                   # Delimiter (default is ',')
    header='infer',            # Row number to use as header (default is the first row)
    index_col=None,            # Column(s) to use as the index
    usecols=None,              # Columns to read (list of column names or indices)
)

import numpy as np 
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/farhad-pourkamali/MATH4388Online/refs/heads/main/data/Auto.csv")

df

The index rows of a DataFrame can be accessed using the .index attribute
The column names of a DataFrame can be accessed using the .columns attribute

# Check the index
print("Index Rows:")
print(df.index)

# Check column names
print("Column Names:")
print(df.columns)

Large datasets can have thousands of rows and columns.
Displaying the entire DataFrame (df) floods the output, making it hard to read or find relevant information.
df.head() returns only the first 5 rows by default.

df.head()

The .info() method in Pandas is used to provide a concise summary of a DataFrame, giving key information about the structure and contents of the data. It is especially useful for getting an overview of large datasets.

df.info() # Do you see any problems? 

df["horsepower"].iloc[0]

type(df["horsepower"].iloc[0])

# convert a column with str values to floats
# Convert non-numeric values (e.g., 'abc', '?', 'N/A') to NaN

df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')

df.head()

df.info()

# Count the number of NaN values

print(df['horsepower'].isnull().sum())

# Remove rows with at least one NaN

df_cleaned = df.dropna()

df_cleaned.info()

# Drop the column in-place
df_cleaned.drop(columns=["name"], inplace=True)

df_cleaned

The .describe() method provides a statistical summary of numerical columns in a DataFrame.

df_cleaned.describe()

import matplotlib.pyplot as plt

# customization 
plt.rc('font', size=14)
plt.rc('axes', labelsize=14, titlesize=14)
plt.rc('legend', fontsize=14)
plt.rc('xtick', labelsize=10)
plt.rc('ytick', labelsize=10)

# histogram
df_cleaned.hist(bins=10, figsize=(12, 8))

plt.tight_layout()
plt.show()

Seaborn (https://seaborn.pydata.org/index.html) is a Python library built on top of Matplotlib, designed to simplify the creation of complex and attractive statistical plots with minimal code.
Seaborn works seamlessly with Pandas DataFrames.

import seaborn as sns

sns.pairplot(df_cleaned, diag_kind='hist', corner=True)

plt.show()

For two vectors \(\mathbf{x}\) and \(\mathbf{y}\) (or columns of a DataFrame), the Pearson correlation coefficient \(\rho\) is given by

\[ \rho = \frac{\sum_{i=1}^{n} (x_i - \bar{\mathbf{x}})(y_i - \bar{\mathbf{y}})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{\mathbf{x}})^2} \cdot \sqrt{\sum_{i=1}^{n} (y_i - \bar{\mathbf{y}})^2}} \]

\(\rho=1\): Perfect positive linear relationship
\(\rho=-1\): Perfect negative linear relationship
\(\rho=0\): No linear relationship

# Compute the correlation matrix
correlation_matrix = df_cleaned.corr()

# Plot the heatmap
plt.figure(figsize=(8, 5))  

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

plt.title("Correlation Matrix")

plt.show()

# Use value_counts with normalize=True to calculate percentages
percentages = df_cleaned['origin'].value_counts(normalize=True) * 100

# Display the percentages
print("\nPercentages of each group:")
print(percentages)

# Use value_counts with normalize=True to calculate percentages
percentages = df_cleaned['cylinders'].value_counts(normalize=True) * 100

# Display the percentages
print("\nPercentages of each group:")
print(percentages)

Next, we write a function that keeps cars with 4, 6, and 8 cylinders, selects appropriate columns for input features and output labels, and returns them as DataFrames.

def filter_and_select_features(data):
    """
    Filters the dataset for cars with 4, 6, and 8 cylinders, 
    selects input features and output labels, and returns them as DataFrames.

    Parameters:
        data (pd.DataFrame): The input DataFrame containing car data.

    Returns:
        X (pd.DataFrame): DataFrame containing input features.
        y (pd.DataFrame): DataFrame containing the output labels (mpg).
    """
    # Step 1: Filter rows for cars with 4, 6, and 8 cylinders
    filtered_data = data[data['cylinders'].isin([4, 6, 8])]
    
    # Step 2: Select columns for input features 
    input_columns = ["horsepower", "weight"]
    X = filtered_data[input_columns]
    
    # Step 3: Select the "mpg" column as the output label
    y = filtered_data[["mpg"]]
    
    return X, y

X, y = filter_and_select_features(df_cleaned)

X.head()

y.head()

# 3D Plot
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot(111, projection='3d')

# Extract data
x1 = X["horsepower"]
x2 = X["weight"]
x3 = y["mpg"]

# Create scatter plot
ax.scatter(x1, x2, x3, c='r', marker='o')

# Set labels
ax.set_xlabel("horsepower")
ax.set_ylabel("weight")
ax.set_zlabel("mpg")

ax.view_init(elev=20, azim=5)

plt.tight_layout()

plt.show()

Hence, the regression problem can be viewed as

\[ \text{mpg} = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{weight} \]

Multicollinearity occurs when two or more predictor variables in a regression model are highly correlated
- Makes it difficult to assess the individual effect of each predictor on the dependent variable
- Reduces model interpretability and reliability
Here are several possible ways to address multicollinearity
- Remove one or more of the highly correlated predictors
- Combine correlated features into a single feature using aggregation or transformation
- Apply penalties to large coefficients, reducing the impact of multicollinearity
- Use models that handle multicollinearity well (e.g., Decision Trees)

2. Feature Engineering#

When features are in different ranges (e.g., weights and horsepowers), machine learning methods can behave poorly due to the dominance of larger-scaled features.
Techniques to address scale issues:
- Standardization: Rescales features to have a mean of 0 and standard deviation of 1. Useful for algorithms assuming Gaussian-like distributions or when feature magnitudes vary greatly.
- Normalization: Rescales features to be in the range \([0, 1]\). Useful for algorithms sensitive to magnitude but not distribution.
Equations (\(\mu\) represents the mean and \(\sigma\) is the standard deviation)

\[ z = \frac{x - \mu}{\sigma} \]

\[ x' = \frac{x - \min(x)}{\max(x) - \min(x)} \]

from sklearn.preprocessing import StandardScaler

# Standardization
scaler = StandardScaler()
standardized = scaler.fit_transform(X)

X = pd.DataFrame(standardized, columns=X.columns)

X

# Can we verify?

print("means:\n", X.sum(axis=0))

print("standard deviations:\n", X.std(axis=0))

Polynomial Transformation#

Polynomial transformation increases the dimensionality of your feature space by adding higher-degree combinations of features.
- This can help capture complex, nonlinear relationships between the features and the target variable.
Given two features, \(x_1\) (horsepower) and \(x_2\) (weight), a polynomial transformation of degree \(d\) creates new features by including all combinations of \(x_1\) and \(x_2\) up to the given degree.
- Degree 1: \(x_1\) and \(x_2\)
- Degree 2: \(x_1\), \(x_2\), \(x_1^2\), \(x_2^2\), \(x_1x_2\)
Note that we can include “1”, which corresponds to the bias term (intercept)

from sklearn.preprocessing import PolynomialFeatures

# Polynomial transformation of degree 2
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(X)

# Create a DataFrame for better readability
feature_names = poly.get_feature_names_out(input_features=X.columns)

poly_X = pd.DataFrame(poly_features, columns=feature_names)

poly_X

Hence, the regression model works in the following form

\[ y = \beta_0 + \beta_1 \cdot \text{horsepower} + \beta_2 \cdot \text{weight} + \beta_3 \cdot \text{horsepower}^2 + \beta_4 \cdot \text{horsepower} \cdot \text{weight} + \beta_5 \cdot \text{weight}^2 \]

3. Data Splitting and Cross-Validation#

Train-Test Split: This is a crucial step in machine learning where you divide your data set into two parts:
- Training set: Used to train the machine learning model.
- Testing set: Used to evaluate the trained model’s performance on unseen data.
Random Sampling:
- Each data point has an equal chance of being assigned to either the training or testing set.
- Simple and often sufficient for large, well-balanced data sets.
Using train_test_split:
- You can easily split your data using the train_test_split method from the sklearn.model_selection module.
- Specify the desired test_size (e.g., 0.25 for a 25% test set) and set a random_state (e.g., random_state=42) to ensure that you get the same split every time you run your code.
  - Your model’s performance might vary significantly depending on the specific random split generated by the chosen random_state.
  - If you only evaluate with one split, you might get an overly optimistic or pessimistic estimate of how well your model generalizes.

from sklearn.model_selection import train_test_split

# Random Sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25,
                                                    random_state=42) 

X_train

y_train

X_test

y_test

Validation Set#

Machine learning models have hyperparameters that need to be adjusted for optimal performance. For example, consider the polynomial transformation of the input features (degree 1 vs. 2)