Data Preprocessing Guide for Beginners in ML

Home » Others » Data Preprocessing Guide for Beginners in ML

Data Preprocessing Guide for Beginners in ML

Before machine learning (ML) models can generate predictions or insights, the raw data must first be cleaned, organized, and transformed into a suitable format for the model. This process is known as data preprocessing. It is the foundation of every successful ML project. It ensures that the model learns from high-quality, consistent, and well-structured input rather than noisy, incomplete, or biased information.

data-preprocessing-article-thumbnail

In this hands-on guide, we’ll walk through how to transform a raw Kindle eBook dataset from Kaggle into machine learning-ready data using Google Colab, a free cloud-based environment that allows you to write and execute Python code directly in your browser. Through a series of step-by-step demonstrations, you’ll learn how to inspect, clean, and engineer features that will later improve model accuracy and interpretability.

Data preprocessing is where most of the real work and real impact happen. In fact, many data scientists spend up to 80% of their project time cleaning and preparing data. By mastering these techniques, you gain control over the quality of your models and ensure that your results reflect genuine patterns rather than artifacts of messy data.

By the end of this guide, you’ll understand:

  • Why preprocessing is essential before training an ML model.
  • How to perform each key step, beginning with identifying and understanding the problem, followed by handling missing values and scaling features.
  • How to build a reproducible preprocessing pipeline in Google Colab.

Whether you’re a beginner exploring data science or an intermediate practitioner looking to refine your workflow, this tutorial will give you a practical foundation for turning raw data into reliable, ML-ready insights.

Identify the Problem

Let’s first identify the machine learning problem we are trying to solve before jumping into data preprocessing. It’s essential to understand the problem you are trying to solve. Data preprocessing is not a one-size-fits-all process. The steps you take depend heavily on your project’s objectives, the nature of your data, and the insights you hope to uncover.

In this tutorial, we’re working with a Kindle eBook dataset that you can download by clicking the link bellow. To make this data useful for machine learning, we first need to clarify our end goal. In this case, our goal is to preprocess the data to handle a binary classification model that classifies whether a Kindle eBook is likely to become a bestseller or not. 

Kindle eBook dataset

By clearly defining the task as a binary classification problem (bestseller vs. non-bestseller), we can:

  • Determine which preprocessing techniques to apply, such as encoding categorical features, scaling numerical values, and balancing class distribution.
  • Align our feature engineering process with the target outcome.
  • Ensure that our final dataset is structured and meaningful for model training.

By taking the time to define the problem first, you avoid unnecessary work and ensure that your preprocessing pipeline aligns directly with your modeling goals.

Google Colab Setup

To preprocess and analyze our dataset efficiently, we’ll use Google Colab. It provides preconfigured access to common Python data science libraries and GPU/TPU acceleration, making it an ideal tool for beginners and professionals alike.

You can access Colab directly through your browser at colab.research.google.com. If you have a Google account, you can access it via Google Drive → New → More → Google Colaboratory.

Once opened, create a new notebook by clicking this button.

data-preprocessing-article-create-colab-notebook

Next, rename it with ‘kindle_processing.ipynb’.

data-preprocessing-article-colab-notebook-rename

Although Colab comes with most common libraries preinstalled, it’s good practice to confirm and install any missing ones. Paste this code in a cell to check if any of the preinstalled libraries are missing.

Tutorials dojo strip
!pip install pandas numpy matplotlib seaborn scikit-learn

To run the code, click the play button on the upper left of the cell. The code will then execute and will display the output if there is any logs or print statements.

data-preprocessing-article-colab-run-cell-button

To add a new code cell, you need to hover under the cell and two buttons will appear. Clicking on the ‘+ Code’ button will add a new cell bellow it.

data-preprocessing-article-colab-add-code-cell

Next is to import the libraries we’ll use throughout this tutorial:

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from google.colab import files
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from collections import Counter
import seaborn as sns
Upload the dataset to Colab and use pandas to read the csv using this code.
uploaded = files.upload()
file_name = list(uploaded.keys())[0]
df = pd.read_csv(file_name)
Once the cell is running, it will wait for you to upload the csv file so click the ‘Choose Files’ button to open the file chooser to select the dataset file.
data-preprocessing-article-colab-file-upload

Understanding the Dataset

Now, let’s explore the dataset and verify that Colab successfully loaded it. 

Understanding the dataset is a crucial first step before any preprocessing or modeling work. By examining its structure, data types, and content, you gain insights into how the data is organized, what kinds of values it contains, and where potential issues—such as missing values, duplicates, or inconsistent formats—might occur.

This code will print the shape of our dataset.  It will tell us how many rows (records) and columns (features) we have.
# Display the shape of the dataset
print("Dataset Shape:")
print(df.shape)
data-preprocessing-article-dataset-shape
This code will print a summary of the dataset. The output shows that the dataset has 84,086 rows and 16 columns. It lists each column’s name, data type (like numbers, text, or dates), and shows the count for non-null (not missing) values per column. It also shows that the dataset uses about 8.7 MB of memory.
# Display concise summary of the DataFrame
print("Dataset Info:")
print(df.info())
data-preprocessing-article-dataset-info
This code will display the first five rows of the dataset. It will help us visualize the structure of the raw data.
# Display the first few rows
print("First 5 Rows:")
display(df.head())
data-preprocessing-article-dataset-head
 
This code will check for null values and print the count per column. As you can see in the output, the columns author, soldBy, and publishedDate have null values. We’ll need this later when handling missing data.
# Check for null values and display the count per column
print(df.isnull().sum())
data-preprocessing-null-vlaues-check
 
At this point, your Google Colab environment is ready, and the dataset has been successfully loaded and inspected. You now have an overview of the dataset’s size, structure, and completeness.

Handling Missing Data

Missing data is one of the common challenges in data preprocessing. Whether due to human error, system limitations, or incomplete records, missing values can significantly impact the performance and reliability of machine learning models. Understanding why missing data matters and how to handle it properly is essential to ensure your dataset represents accurate and consistent information.
 
Machine learning algorithms typically require complete and consistent data to function effectively. When values are missing, several problems occur:
  • Bias: Models may learn inaccurate patterns if missingness isn’t random.
  • Data loss: Removing rows or columns can reduce the dataset size and diversity.
  • Model errors: Many Algorithms (like linera regression or SVM) can’t handle NaN values and will fail during training.
  • Skewed statistics: Mean, median, or correlation results may be distorted if missing values are ignored.

By addressing missing data early, we reduce the risk of training a model on incomplete or misleading information.

In reference to the code that displayed missing data in the previous section, we will now address the columns with missing values using the following approach:

This code replaces the missing author and soldBy entries with the value 'Unknown'. Doing this ensures that all text data remains complete and can be properly processed later during encoding and scaling steps.

# Fill missing values
# For 'author' and 'soldBy', you could fill with a placeholder like 'Unknown'
df['author'].fillna('Unknown', inplace=True)
df['soldBy'].fillna('Unknown', inplace=True)

We drop rows with missing publishedDate values because the publication date is a key feature that cannot be accurately imputed without introducing bias or incorrect information. Since the number of missing entries is relatively small, dropping these rows is a safer option that preserves data quality.

# For 'publishedDate', convert to datetime and then drop rows with NaT (missing) values
df['publishedDate'] = pd.to_datetime(df['publishedDate'], errors='coerce')
df.dropna(subset=['publishedDate'], inplace=True)

Finally, let’s confirm that no missing values remain:

print("nNull values count per column after handling:")
print(df.isnull().sum())
print("nDataset Info:")
print(df.info())
data-preprocessing-article-null-values-cleaned

Dropping Irrelevant Columns

After we addressed the missing values in our dataset, it’s time to remove irrelevant columns that may introduce unnecessary noise to our data. Not all features in a dataset contribute meaningfully to a machine learning model. Some columns may contain identifiers, redundant information, or attributes that have little to no predictive value. Including these irrelevant columns can increase noise, slow down computation, and even reduce the model accuracy by introducing misleading patterns.

Before building our model, it’s important to identify and remove these non-essential features. Generally, we consider dropping a column if it meets one or more of the following criteria:

  • Identifiers: Columns like id, url, or asin (Amazon Standard Identification Number) serve as unique labels but have no predictive relationship with the target variable.
  • Duplicates or redundant data: If two columns convey the same or overlapping information, keeping both adds unnecessary complexity.
  • High missing values: Columns with excessive missing data are often less useful than those with complete information.
  • Irrelevant metadata: Attributes like image URLs, hyperlinks, or system-generated tags rarely help in prediction.

Let’s apply this to our Kindle eBook Dataset.

In this code, we remove columns such as url, image, and asin because they act as identifiers or metadata that don’t contribute to predicting whether a book will become a bestseller. By doing this, we simplify the dataset, reduce noise, and make the preprocessing pipeline more efficient. From this stage onward, the df_cleaned variable will serve as our primary dataset, replacing the original df variable for all subsequent steps.

# Columns identified as irrelevant for predicting bestseller status
irrelevant_columns = ['asin', 'imgUrl', 'productURL', 'category_id']
# Drop the specified columns from the DataFrame
df_cleaned = df.drop(columns=irrelevant_columns)

To confirm the dataset structure after dropping these columns, we can recheck the shape and column names:

print("Shape of DataFrame after dropping irrelevant columns:")
print(df_cleaned.shape)
print("nFirst 5 rows of DataFrame after dropping irrelevant columns:")
display(df_cleaned.head())
data-preprocessing-article-dropped-columns
By removing irrelevant or redundant columns, we reduce dimensionality and ensure that only meaningful features remain for model training. This step not only improves computational efficiency but also enhances the model’s ability to focus on features that truly influence the prediction of a Kindle eBook’s bestseller potential.

Feature Engineering

After dropping irrelevant columns, we can now engineer features that help our model identify patterns more effectively and generate more accurate predictions. Feature engineering is often considered the “art” of machine learning because the quality of features often has a greater impact on model performance than the choice of algorithm or hyperparameter tuning.

It involves transforming raw data into meaningful variables during data preprocessing that better represent the underlying problem, improving a model’s ability to learn and generalize to unseen data.

Feature engineering also improves model interpretability by creating clearer, more meaningful inputs instead of relying on raw, unprocessed data. Some benefits of feature engineering are the following:

  • Maximizes Model Performance: It extracts the hidden predictive signal from raw data, allowing models to learn more effectively and achieve higher accuracy.
  • Encodes Domain Expertise: It translates your real-world knowledge of the subject (e.g., book market dynamics) into quantifiable features that the algorithm can utilize.
  • Normalizes and Cleans Data: It handles inherent data issues by transforming skewed distributions, managing outliers, and converting messy text/date fields into numerical formats models require.
  • Feature engineering makes model predictions easier to understand and explain by using clear, meaningful inputs instead of complicated raw data.

In this step, we will use our domain knowledge to creat new, highly informative variables from existing ones. 

In this code, we convert the publishedDate column into a datetime type so we can extract the day, month, and year into a separate column. We drop the original publishedDate column to avoid redundancy.

# Convert 'publishedDate' to datetime if it's not already
df_cleaned['publishedDate'] = pd.to_datetime(df['publishedDate'], errors='coerce')
# Extract year, month, and day from 'publishedDate'
df_cleaned['publishedYear'] = df['publishedDate'].dt.year
df_cleaned['publishedMonth'] = df['publishedDate'].dt.month
df_cleaned['publishedDay'] = df['publishedDate'].dt.day
# Drop the original 'publishedDate' column as features have been extracted
df_cleaned = df.drop(columns=['publishedDate'])

This code calculates the length of the title and author strings. These features provide simple but useful measures of a book’s characteristics. For instance, title length may serve as a proxy for genre or marketing style, while author length can hint at collaborations or brand recognition — subtle signals the model can learn from.

# Calculate length of 'title' and 'author'
df_cleaned['title_length'] = df['title'].apply(lambda x: len(x) if isinstance(x, str) else 0)
df_cleaned['author_length'] = df['author'].apply(lambda x: len(x) if isinstance(x, str) else 0)

Next code will calculate the interaction of stars and reviews. Multiplying them together gives a combined score that is a much stronger indicator of a book’s true market reception than either factor alone. This allows the model to differentiate between a book with 5 stars and 10 reviews versus one with 4.5 stars and 10,000 reviews.

# Create an interaction term between 'stars' and 'reviews'
df_cleaned['stars_reviews_interaction'] = df_cleaned['stars'] * df_cleaned['reviews']

Now let’s display the dataset after we engineered new features.

print("DataFrame after feature engineering and dropping publishedDate:")
display(df_cleaned.head())
data-preprocessing-article-feature-engineering
 
Through feature engineering, we transformed raw data into richer, more informative features that capture time-based, text-based, and interaction patterns. These engineered variables enhance the model’s ability to detect meaningful relationships in the data, setting a strong foundation for accurate prediction of whether a Kindle eBook will become a bestseller. However, keep in mind that feature engineering is not a required step and can be skipped for simple machine learning problems. Make sure to understand the problem beforehand if your model can benefit from this technique.

Train & Test Split

It is essential evaluate how a machine learning model perform on unseed data. This is achieved by splitting the dataset into two parts: a training set and a test set. 

The training set is used to teach the model the patterns and relationships within the data, while the test set is kept aside to evaluate the model’s performance on data it has never seen before. This ensures the model’s results are generizable nad not just memorized from the training data, a problem known as overfitting.

Splitting the dataset allows for an unbiased assessment of the model’s accuracy, precision, and other metrics. A common split ration is 80% for training and 20% for testing, though this may vary depending on dataset size and complexity.

For this guide, we will split our Kindle eBook data into 80% traing set and 20% test set:

This code will set the isBestSeller column as our taget variable or the ‘y’.  We drop the target variable in our ‘X’ or the dataset and set it into a new dataset.  Notice that we also droped the title and author columns since we have already feature engineered them into author_length and title_length, this will reduce the noise in our data.

# If column exists, set target properly
target_col = 'isBestSeller' 
# Features (X) and Target (y) 
X = df_cleaned.drop(columns=['title', 'author', target_col])
y = df[target_col]

Now this code will split our dataset into 80% train set and 20% test set. The random state parameter ensures reproducibility of results and stratify parameter maintain class distribution between train and test sets.

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Let’s now display the size of train and test sets.

# Shape of train-test X
print("X train shape:", X_train.shape)
print("X test shape:", X_test.shape)
#Shape of train-test y
print("y train shape:", y_train.shape)
print("y test shape:", y_test.shape)
data-preprocessing-article-final-processed-sets

By splitting the data at this stage, we ensure that all subsequent preprocessing steps, such as encoding or scaling, are applied properly using only the training data. This prevents data leakage and ensuring fair evaluation during model testing.

Preprocessing Pipeline

In machine learning, data preprocessing is often the most time-consuming part of the workflow. You can apply transformations one step at a time, but this ad-hoc approach can get messy. It also increases the chance of errors and makes the process harder to reproduce, especially when working with datasets that include numerical, categorical, and boolean features.

A preprocessing pipeline provides a structured framework to handle all transformations consistently. Using tools like Scikit-learn’s Pipeline and ColumnTransformer, we can define exactly how each feature type should be processed, for example:

  • Scale numeric features
  • Encode categorical variables
  • Pass boolean columns as-is

Benefits of a preprocessing pipeline include:

  • Consistency: Ensures the same transformations are applied to both training and test data.Reproducibility: Makes your workflow easy to rerun or share with others.
  • Leakage prevention: Fit transformations on training data only, avoiding accidental use of test data information.
  • Free AWS Courses
  • Clean, maintainable code: Reduces repetitive commands and keeps your notebook organized.
  • Extensibility: Pipelines can later include models, allowing end-to-end workflows from raw data to predictions.

Encoding

Most machine learning algorithms work with numerical data. However, many real-world datasets. like the Kindle eBook dataset we are using, contain categorical variables, such as author, soldBy, or category, which represent text labels rather than numbers. To make these values usable for model training, we need to encode them into numerical form.

Encoding categorical data is a critical preprocessing step because models interpret numeric values mathematically. Without encoding, algorithms cannot compute relationships or similarities between text-based categories.

Why encoding is important:

  • Model Compatibility: Algorithms like Logistic Regression, Support Vector Machines, and most neural networks require numerical input.
  • Improved performance: Encoding turns qualitative information into quantitative signals that can be learned.
  • Feature interpretability: Encoded values make it easier to analyze variable importance and relationships.

There are several techniques for converting categorical data into numerical form. The choice of method depends on the number and type of categories present. In this  guide, One-Hot Encoding is used for nominal features like soldBy and category_name, while Boolean variables (e.g., isKindleUnlimited) were retained as binary 0/1 values.

Scaling

After encoding, the next step is scaling the numerical features using techniques like StandardScaler. Scaling standardizes the range of continuous variables, ensuring that features with larger numeric values (such as reviews or price) do not dominate smaller-scaled ones (like stars). This step is especially important for distance-based algorithms or models sensitive to feature magnitude.

Why scaling is important:

  • Algorithm sensitivity: Many models, especially distance-based algorithms (e.g., K-Nearest Neighbors, Support Vector Machines) and gradient-based ones (e.g., Logistic Regression, Neural Networks) are sensitive to feature magnitude.
  • Equal feature contribution: Ensures that all features contribute proportionally to the model, preventing high-valued variables from overpowering smaller-scaled ones.
  • Faster convergence: Standardized features help gradient descent and other optimization processes converge more quickly and smoothly.
  • Improved performance: Properly scaled data leads to more stable learning, balanced weight updates, and higher predictive accuracy.

Together, encoding and scaling form a foundational stage in the data preprocessing pipeline, ensuring that all input features are consistent, comparable, and optimized for model training.

First, let’s identify the data types of the columns in our dataset. This code will set a separate list based on the data type of the column.

# Identify columns by type
categorical_cols = ['soldBy', 'category_name']
boolean_cols = ['isKindleUnlimited', 'isEditorsPick', 'isGoodReadsChoice']
numeric_cols = [
'stars', 'reviews', 'price',
'publishedYear', 'publishedMonth', 'publishedDay',
'title_length', 'author_length', 'stars_reviews_interaction'
]

Next, we will transform the columns. This code creates a ColumnTransformer, a tool from Scikit-learn that applies different preprocessing steps to specific columns. In this case, it uses one-hot encoding for categorical features and standard scaling for numerical variables. These transformations are combined into a single, organized pipeline for easier processing.

# Define preprocessing for different column types
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_cols), # Encode categorical
        ('num', StandardScaler(), numeric_cols), # Scale numeric
        ('bool', 'passthrough', boolean_cols) # Keep booleans as is (0/1)
    ],
    remainder='drop' # Drop any columns not listed above
)

After the data is transformed, we will now apply these changes to our train set using this code. Make sure to use fit_transform() only on the training set to avoid data leakage.

#Full pipeline (preprocessing only, model can be attached later)
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
# Fit the pipeline on training data
X_train_encoded = pipeline.fit_transform(X_train)
X_test_encoded = pipeline.transform(X_test)

Now let’s display the shape of  transformed data.

print("Processed train shape:", X_train_encoded.shape)
print("Processed test shape:", X_test_encoded.shape)
data-preprocessing-article-dataset-split

Conlusion

Preprocessing is the foundation of any machine learning workflow. In this tutorial, we explored the Kindle eBook dataset and prepared it for modeling by: handling missing values, dropping irrelevant columns, engineering new features, encoding categorical variables, scaling numerical features, and building a reproducible pipeline.

Proper preprocessing ensures clean, consistent, and numerical data, prevents data leakage, improves model performance, and makes workflows reproducible and maintainable. Investing time in preprocessing sets the stage for accurate and reliable machine learning models.

What’s next?

With a clean, preprocessed dataset, the next step is to train and evaluate machine learning models. You can start with algorithms like Logistic Regression, Random Forest, or Gradient Boosting to predict whether a Kindle eBook will become a bestseller.

Other possibilities include:

  • Feature selection or dimensionality reduction to improve performance.
  • Hyperparameter tuning to optimize model accuracy.
  • Model interpretation to understand which features most influence predictions.

Tutorials Dojo portal

Learn AWS with our PlayCloud Hands-On Labs

🧑‍💻 CodeQuest – AI-Powered Programming Labs

FREE AI and AWS Digital Courses

Tutorials Dojo Exam Study Guide eBooks

tutorials dojo study guide eBook

FREE AWS, Azure, GCP Practice Test Samplers

Subscribe to our YouTube Channel

Tutorials Dojo YouTube Channel

Join Data Engineering Pilipinas – Connect, Learn, and Grow!

Data-Engineering-PH

Ready to take the first step towards your dream career?

Dash2Career

K8SUG

Follow Us On Linkedin

Recent Posts

 

 

 

Written by: Jaime Lucero

Jaime is a Bachelor of Science in Computer Science major in Data Science student at the University of Southeastern Philippines. His journey is driven by the goal of becoming a developer specializing in machine learning and AI-driven solutions that create meaningful impact.

AWS, Azure, and GCP Certifications are consistently among the top-paying IT certifications in the world, considering that most companies have now shifted to the cloud. Earn over $150,000 per year with an AWS, Azure, or GCP certification!

Follow us on LinkedIn, YouTube, Facebook, or join our Slack study group. More importantly, answer as many practice exams as you can to help increase your chances of passing your certification exams on your first try!

View Our AWS, Azure, and GCP Exam Reviewers Check out our FREE courses

Our Community

~98%
passing rate
Around 95-98% of our students pass the AWS Certification exams after training with our courses.
200k+
students
Over 200k enrollees choose Tutorials Dojo in preparing for their AWS Certification exams.
~4.8
ratings
Our courses are highly rated by our enrollees from all over the world.

What our students say about us?