In recent years, of the ever-evolving landscape of technology, a few of the most prominent buzzwords that continue to capture the imagination of innovators and entrepreneurs alike are Machine Learning (ML) and Artificial Intelligence (AI). With the boom of OpenAI’s ChatGPT last November 2023, as well as the many that followed suit, including major tech companies like Meta with their LLaMa model and Google’s PaLM and newly announced Gemini, ML and AI will most likely be the spotlight for the months, years, and possibly, decades to come.
In a previous article, The AI Revolution: An Overview of Generative AI and Foundation Models, we defined Machine Learning Models as “massive bundles of mathematical equations with parameters that change depending on the amount of training it does from a dataset.” But not all models are as complex by nature as GPT and the others. There are many other applications of ML outside these chatbots that don’t require an understanding of human language and, therefore, wouldn’t need anywhere near as much training time or resources. As we dive into the world of ML models, we will explore how these simpler algorithms have not only fueled academic research but also powered countless real-world applications.
In this article, we will discuss the mathematics behind some of the simplest models available and how functionalities of libraries that have these models can be coded from scratch. Although these models will be much more straightforward, the foundational concepts will remain. Learning about the intricacies of models provides practical application skills and serves as a strong foundation for advanced learning.
Regression
Regression, in the context of machine learning and statistics, is a predictive modeling technique used to analyze the relationship between a dependent (target) variable and one or more independent (predictor) variables. The primary goal of regression is to find a function or model that best describes the relationship between these variables. This model can then be used to predict the value of the target variable based on the values of the predictor variables.
The concept of regression serves as the foundation for both linear and logistic regression, which are two of the most commonly used types of regression analysis in machine learning and statistics. Both are supervised machine learning models, meaning there is a labeled training dataset for predicting outcomes. In this article, we will discuss how linear models can be used in real-life scenarios, and in the next article, we will discuss logistic regression and more complex statistical concepts with machine learning.
Consider a scenario of a medical practitioner who wishes to predict patient health based on various health metrics. For this practitioner, both types of regression can provide insight once applied.
Linear Regression
Linear regression is a specific instance of regression analysis where the relationship between the independent variables and the dependent variable is assumed to be linear. In simpler terms, “a change in one corresponds to a proportional change in the other.”
These relationships happen all the time, including housing data (how “good” the house is versus the price), advertising speed and revenue, the age of a liability (like a vehicle), and its depreciation, among many others.
The fundamental concept of linear regression is that if the input variable(s) value is/are X, based on the data I trained on, what would be the numerical value of the output variable Y?
Mathematical Concepts
Building on the fundamental concept noted above, the mathematical approach to linear regression would be to find the “equation of best fit” – an equation in a plane (or higher dimensional space) that has the least distance (also called residual) to the actual given points.
Consider the medical practitioner example discussed earlier. As the age of a person increases, their resting heart rate decreases linearly. Given a dataset of people’s ages and their resting heart rate, and by plotting these values in a plane, the “equation of best fit” would be a line with a negative slope corresponding to the value of this decrease. This line is determined by the existing data points, and by minimizing the values of the residuals of this line to the points, the “best fit” is attained.
Mathematically speaking, this equation is the point-slope form of a line that many are familiar with from high school algebra.
However, for standardization with other resources, we can rename the coefficients as follows: where
x is the independent variable,
β0 is the y-intercept, and
β1 is a coefficient (also the slope).
These coefficients (generally referred to as weights) are the values that dictate how much a feature of the data affects the prediction. Intuitively, a weight is the weight of the impact of a feature. For example, in predicting the depreciation of a vehicle, the mileage feature of the data would dictate the depreciation, and an increase in this value would strongly affect the output value.
However, it is rarely the case that there is only one variable that dictates the output. These cases fall under multivariate linear regression (having more than one independent variable that dictates the value of the dependent variable) and follow the following general equation, where
x1 to xn are the independent variables;
β0 is the y-intercept; and,
β1 to βn are the coefficients.
For multivariate linear regression, there will be more coefficients involved. In the same depreciation example, the mileage, cost spent on maintenance, and years upon purchase would have different values for these weights.
However, multivariate linear regression requires heavier mathematical concepts, which will be thoroughly discussed in the next series of articles.
The formula for calculating the equation of the regression line is given as the following:
where r is the Pearson Correlation Coefficient;
sy is the standard deviation of the y variable; and,
sx is the standard deviation of the x variable.
The value of the slope β1 relies on the Pearson Correlation Coefficient, which is a coefficient that measures the linear correlation between two variables, indicating the strength and direction of their relationship on a scale from -1 (perfect negative correlation) to 1 (perfect positive correlation). A perfect correlation means a change in the independent variable corresponds to a perfectly corresponding change in the dependent variable. For example, a 50% increase in x would lead to a 50% increase in y.
The formula for the Pearson Correlation Coefficient is given as follows:
Once the slope (β1) has been determined, the remaining value to calculate would be the y-intercept β0, which will dictate where our line with slope β1 is among its family of lines. The formula is as follows.
With all the variables defined and formulas noted, let’s go back to our medical practitioner example. Suppose they are studying the relationship between a patient’s age and their resting heart rate. They have collected data from a group of patients, including their ages and their resting heart rates. They could use linear regression to predict a patient’s resting heart rate based on their age.
Code
The goal of this exercise is to apply the formulas above and develop a simple linear regression function that takes in two arrays as input data and develops the equation for the regression line between them. This code exercise will be done using Python 3.
Function Definition and Imports
For simplicity, the variable X will be the independent variable, Y will be the dependent variable. The use of the numpy library would also prove to be useful here, but only for simplifying the coding procedure, and will not be used for heavier mathematical computations.
Dataset
Let’s start with the dataset. For this example, we can use the following synthetic data for age (X) and resting heart rate (Y):
X = np.array([64, 67, 20, 23, 23, 59, 29, 39, 41, 56, 43, 60, 44, 44, 32, 21, 58, 59, 43, 66]) Y = np.array([75.2, 67.6, 92.0, 84.4, 72.4, 60.2, 67.2, 76.2, 71.8, 57.8, 70.4, 58.8, 73.2, 90.2, 81.6, 88.8, 71.4, 71.2, 69.4, 80.8])
Mean
Calculating the mean is straightforward, and saving them in their variables would be useful for the computations later.
# Calculate the means of X and Y mean_X = np.mean(X) mean_Y = np.mean(Y)
Standard Deviation
The standard deviation dictates how far apart the values of the dataset are. A higher standard deviation means the data is more spread out (farther away from the mean).
The two main types of standard deviation are the population standard deviation, which characterizes the spread of data in an entire population, and the sample standard deviation, which estimates the spread of data in a sample from that population.
This example uses the formula for the sample standard deviation, denoted as s.
# Calculate the standard deviations of X and Y manually std_X = np.sqrt(np.sum((X - mean_X) ** 2) / (len(X) - 1)) std_Y = np.sqrt(np.sum((Y - mean_Y) ** 2) / (len(Y) - 1))
Pearson Correlation Coefficient
Implementing the formula above can be simplified by using features of the numpy library.
# Calculate the Pearson correlation coefficient using the provided formula numerator = np.sum((X - mean_X) * (Y - mean_Y)) denominator = np.sqrt(np.sum((X - mean_X) ** 2) * np.sum((Y - mean_Y) ** 2)) r = numerator / denominator
Beta values
# Calculate the slope (beta_1) beta_1 = r_alternative * (std_Y / std_X) # Calculate the intercept (beta_0) beta_0 = mean_Y - beta_1 * mean_X
Complete Function
By combining all the snippets above, we develop the following function that can:
- Calculate the regression line given two sets of values (X and Y)
- Evaluate the metrics
- Plot the graph
X = np.array([64, 67, 20, 23, 23, 59, 29, 39, 41, 56, 43, 60, 44, 44, 32, 21, 58, 59, 43, 66]) Y = np.array([75.2, 67.6, 92.0, 84.4, 72.4, 60.2, 67.2, 76.2, 71.8, 57.8, 70.4, 58.8, 73.2, 90.2, 81.6, 88.8, 71.4, 71.2, 69.4, 80.8]) def my_linear_regression(X, Y): # Calculate the means of X and Y mean_X = np.mean(X) mean_Y = np.mean(Y) # Calculate the Pearson correlation coefficient using given formula numerator = np.sum((X - mean_X) * (Y - mean_Y)) denominator = np.sqrt(np.sum((X - mean_X) ** 2) * np.sum((Y - mean_Y) ** 2)) r = numerator / denominator # Calculate the standard deviations of X and Y manually std_X = np.sqrt(np.sum((X - mean_X) ** 2) / (len(X) - 1)) std_Y = np.sqrt(np.sum((Y - mean_Y) ** 2) / (len(Y) - 1)) # Calculate the slope (beta_1) beta_1 = r * (std_Y / std_X) # Calculate the intercept (beta_0) beta_0 = mean_Y - beta_1 * mean_X # Calculate the Standard Error residuals = Y - (beta_0 + beta_1 * X) squared_residuals = residuals**2 residual_sum_of_squares = np.sum(squared_residuals) degrees_of_freedom = len(X) - 2 standard_error = np.sqrt(residual_sum_of_squares / degrees_of_freedom) # Print the calculated values print(f"Pearson Correlation Coefficient (alternative formula): {r_alternative}") print(f"Standard Deviation of X (manual): {std_X}") print(f"Standard Deviation of Y (manual): {std_Y}") print(f"Slope (beta_1): {beta_1}") print(f"Y-Intercept (beta_0): {beta_0}") print(f"Standard Error: {standard_error}") # Plotting the data points plt.scatter(X, Y, color='blue', label='Data Points') # Plotting the regression line line = beta_0 + beta_1 * X plt.plot(X, line, color='red', label='Regression Line') # Adding labels, title, and legend plt.xlabel('X') plt.ylabel('Y') plt.title('Linear Regression') plt.legend() # Show plot plt.show() # Example usage my_linear_regression(X, Y)
Running this function with the values listed above prints the following metrics and plots.
Pearson Correlation Coefficient (alternative formula): -0.30975165287436146 Standard Deviation of X (manual): 16.046888532639198 Standard Deviation of Y (manual): 10.881036134873417 Slope (beta_1): -0.21003566647249397 Y-Intercept (beta_0): 83.73002830834638 Standard Error: 10.629380793024694
And with that, we have created a simple linear regression function that finds the equation of best fit and can now be used for predictions.
Final Remarks
The study of Machine Learning is crucial in today’s technology-driven world, as it lies at the heart of numerous advancements and applications in diverse fields ranging from healthcare to finance. By learning ML, individuals gain the ability to analyze vast datasets, uncover hidden patterns, and make data-driven decisions, leading to innovative solutions and improvements in various domains.
Although the current trend in tech is related to these huge and powerful models, engineers, developers, statisticians, and everyone who contributed to building these powerful models would not have gotten this far if not for their strong foundational knowledge.
The next parts of these discussions will have more information and deeper analyses, so stay tuned for that!
Thank you for taking the time to read this article. Happy learning!