Data Encoding and Standardization in Machine Learning

Study Notes

Label encoding assigns unique values to categories based on their order, suitable for ordinal data but potentially misleading for nominal data.
One-hot encoding is preferable for nominal categories, preventing the model from assuming an ordinal relationship and giving equal weight to each category.
Standardization is crucial for machine learning algorithms, ensuring features are centered around zero and have similar variance for efficient model convergence.
Standardization shifts the distribution of each attribute to have a mean of zero and a standard deviation of one, improving interpretability and model performance.
In linear regression, especially with regularization, standardization is essential for accurate coefficient interpretation and model convergence.
Splitting the data into training and testing sets is the initial step in training a linear regression model for performance evaluation.
Evaluation of the model's performance on the testing set involves using metrics such as Mean Squared Error (MSE) and R-squared to determine model fit.
R-squared (R^2) is a key metric for evaluating model fit, measuring the proportion of variance explained by the model compared to the total variance.
R-squared is calculated using the formula 1 - (SSres/SStot), where SSres is the Residual Sum of Squares and SStot is the Total Sum of Squares.
SSres measures the deviation of data points from the regression line, while SStot captures the total variance in the observed data.
A high R-squared value close to 1 indicates the model explains a large portion of the variance, while a value close to 0 signifies poor variance explanation.
R-squared is a gauge of the model's explanatory power, but a high value does not guarantee the model is the best fit for the data, requiring cautious interpretation.