2025-07-25: Feature Engineering with Shallow Features and Methods

Creating synthesized instances with SMOTE (from Figure 3 in Wongvorachan et al.)

Jason Brownlee gave the definition of feature engineering as follows: "feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data". In other words, feature engineering is manually designing what the input Xs should be.

Feature engineering is regarded as key to success in applied machine learning. "Much of the success of machine learning is actually success in engineering features that a learner can understand", as pointed out by Scott Locklin.

There are many application scenarios for feature engineering, including loan application fraud detection/prevention, recommendation system user behavior modeling and disease diagnosis/risk prediction, etc. In a loan application fraud prevention program, data scientists can decide whether a user is reliable with features based on user's basic information, credit history and other information. A recommendation system can analyze a user's behavior features such as materials clicked by the user in the past few months, positive or negative reactions, user type and so on, to decide the user's most interested topics.

A feature is an attribute useful for the modeling task, but not all attributes can be used as features. In general, for most industrial modeling, expert knowledge is important for feature creation. For example, in a loan application fraud prevention program, experiences from the risk control department will be very helpful. There are potentially hundreds of features based on a user's basic information, credit report, and assets, but not all of them will be used in modeling. Expert knowledge can help data scientists quickly perform feature construction and screening.

1. Feature Describing

This step provides a general understanding of the dataset. We explore max, min, mean, standard deviation values of features, understand the tendency, dispersion, or distribution and find missing values, outliers, and duplicate values. This step serves as the preparatory work for the next steps.

2. Feature Processing

The foundation of feature engineering is feature processing, which is time-consuming and directly associated with data quality. It includes operations such as data-cleaning, standardization, and resampling, and aims to transform raw data into a format suitable for model training.

2.1 Data-cleaning

Data-cleaning generally processes missing values, outliers, and inconsistencies to ensure the accuracy of data.

Some features may contain missing values because of the lack of observations. Missing values are typically processed in the following ways:

Drop directly. We can choose to drop the whole sample (row) or the feature (column) containing the missing value.

Fill with other values. We can fill the missing values with a constant such as 0，9999， -9999, -99, etc.

data['feature'] =data['feature'].fillna('-99')

Or we can fill the missing values with the mean, mode, previous value, or next value.

data['feature'] = data['feature'].fillna(data['feature'].mean()))

Fill with interpolation

data['feature'] = data['feature'].interpolate()

Fill with KNN

from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute

dataset = KNN(k=3).complete(dataset)

The most frequently used method is to drop directly or fill with mean values.

Outliers are identified based on interquartile range, mean and standard deviation. In addition, points whose distance from most points is greater than a certain threshold are considered outliers. The main distance measurement methods used are absolute distance (Manhattan distance), Euclidean distance, and Mahalanobis distance.

We need to process outliers to reduce noise and improve data quality. Typical strategies for processing outliers include: directly deleting outliers when they have a significant impact on the analysis results, treating outliers as missing values and using previous methods for missing values to fill them out, or keeping the outliers when they are considered to be important.

Duplicate values refer to identical samples from different sources, which will waste storage space and reduce data processing efficiency. The most common way is to drop duplicates completely or partially based on experience.

2.2 Resampling

Class imbalance refers to the situation where the number of samples in different categories of the training set is significantly different. Machine learning methods generally assume that the number of samples of positive and negative classes is close. However, in the real world, we often observe class imbalance. There are some extreme cases: 2% of credit card accounts are fraudulent every year, and online advertising conversion rate is in the range of 10^-3 to 10^ -6 and so on. Class imbalance can cause prediction results of models to be biased towards the majority class and thus lower prediction power.

We can mitigate class imbalance by oversampling the minority class or undersampling the majority class. When the original dataset is huge, undersampling is a good choice that randomly deletes samples in the majority class to make the number of samples in the two classes equal. When the dataset is small, we prefer to use oversampling. One practice is to resample repeatedly in the minority class to increase the number in the minority class, until it equals the number in the majority class, which has high a risk of over-fitting. A better way is to use SMOTE (Synthetic Minority Over-sampling Technique), in which synthetic instances of the minority class are generated by interpolating feature vectors of neighboring instances, effectively increasing their representation in the training data. To be specific, SMOTE picks a sample point x in the minority class, and randomly picks a point x' from its k nearest neighbors. Then the synthetic instance will be created by the formula x_new = x + (x'-x)*d, where d is in the range [0,1]. Three figures from Wongvorachan et al. shown below demonstrate the three methods more intuitively.

Figure 1. Random oversampling (Figure 1 in Wongvorachan et al.)

Figure 2. Random undersampling (Figure 2 in Wongvorachan et al.)

Figure 3. SMOTE (Figure 3 in Wongvorachan et al.)

Table 1 shows the operating principle, advantages, and drawbacks of each resampling technique. The methods are commonly used, but we also need to emphasize the disadvantages: random oversampling increases the likelihood of overfitting, random undersampling can only keep partial information in the original dataset, SMOTE potentially creates noise in the dataset.

Table 1. The comparison of resampling techniques (Table 1 in Wongvorachan et al.)

A more straightforward way to mitigate class imbalance is Class Weights, which assigns a class weight to each class in the training set. If the number of samples in this class is large, then its weight is low, otherwise the weight is high. There is no need to generate new samples with this method, we just need to adjust the weights in the loss function.

2.3 Feature Transformation

Different features have different scales and ranges. Eliminating scale differences between different features can put data on the same scale and make them numerically comparable.

StandardScaler transforms data into a distribution with a mean of 0 and a standard deviation of 1 by Z-score normalization. Similarly, MinMax scaling normalizes all features to be within 0 and 1. To be specific, StandardScaler obtains the mean and standard deviation of the training data, and then uses these statistics to make a Z-score normalization with the following formula:

$μ$ $μ is$ mean and σ is standard deviation.

MinMaxScaler obtains the maximum and minimum values of the training data, and then transforms data with the following formula:

Feature transformation has different impacts on different models. It has a great impact on SVM (support vector machine) and NN (nearest neighbor) which are based on distances in a Euclidean space, but has little impact on tree models such as random forest or XGBoost.

With a broader definition of feature engineering, the generation of embeddings which represent latent features is also regarded as feature engineering. Latent features follow a different set of methodologies. In this article, we only focus on the narrow definition of feature engineering, where shallow features are selected based on expert knowledge, and data is processed with the methodology discussed above. In real-world practice, especially in industry, successful feature engineering is essential for the model's performance.

- Xin

Search This Blog

Web Science and Digital Libraries Research Group