Feature Selection - Feature Filtering
Too many features?
Features are the building blocks of models. Too many building blocks can be a bad thing, however. This problem is quite apparent when developing natural language processing (NLP) models. NLP features, in many cases, are composed of either word counts or normalized frequencies of words that occur in a corpus of text. If the modeled corpus contains 10,000 unique words, that can translate to 10,000 possible features. Investigating combinations of words that occur simultaneously (n-grams) greatly increases the dimensionality of the feature space to the order of millions. When there is a deluge of features to train a model upon, bad things happen.
- Model training times go through the nose.
- There is a risk of overfitting the training data.
- Prediction performance goes down.
- The model is harder to implement in production systems.
- Curse of Dimensionality
While there is no hard an fast rule for how many features a model should have, feature number typically scales with the amount of data available to be modeled. One general rule that seems to make sense in most situations is having two to three magnitudes more training data than features.
Reducing Features via Filtering Methods
Feature selection is an important part of machine learning processes. One of the most straightforward ways to reduce the number of features going into a model is to compare them directly to the target variable.
The general approach to feature filtering is as follows:
- The relationship of the potential predictors and response are measured using some kind of metric.
- The predictors are ranked from best to worst using the metrics from step 1.
- The $k$ best features are selected from the pool of potential predictors and used to develop the model.
Note that the determination what value to set for $k$ is dependent on a number of factors. It may depend on the amount of data available. It may also depend on the training time of the model. One could also try using different cutoffs for the variable selection metrics used. In the end, it up to the modeler to try and select the best number of features that achieves the modeler’s goals. These filtering techniques are only a starting point in the model development process.
There are a few different cases that one can encounter when measuring the relationship between a potential predictor and target variable. These are defined by whether the response or predictors are continuous or categorical. The following subsections attempt to cover which metrics are useful for these different cases.
I. Both Predictors and Response Are Continuous
If the variables and responses are continuous, Pearson’s correlation is a good place to start to gauge the relative importance of individual predictors. Pearson’s correlation is well-known and most already have a general sense of what it means but a deeper dive into its formulation may provide an even better understanding of how it works.
Pearson’s correlation between predictor $x$ and response $y$ takes the following form:
$$ r_{x,y}=\frac{\mathrm{cov}(x,y)}{\sqrt{\mathrm{var}(x)\mathrm{var}(y)}} $$
Covariance, $\mathrm{cov}$, of $x$ and $y$ is a measure of the joint variability of two random variables. The expanded form of Pearson’s correlation reveals more about this metric:
$$ r_{x,y}=\frac{\sum_{i=1}^n{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum_{i=1}^n{(x-\bar{x})^2}}\sqrt{\sum_{i=1}^n{(y-\bar{y})^2}}} $$
Above, $\bar{x}$ represents the mean of variable $x$ and $\bar{y}$ represents the mean of the response variable. Covariance, in the numerator, is positive when the variables change in a similar fashion. For instance, if some data point, $(x_{i}, y_{i})$, has big values, for both $x_{i}$ and $y_{i}$ relative to the means of each variable, then the contribution of that data point to covariance will be largely positive. On the other hand, if data point $(x_{i}, y_{i})$ has a large value for $x_{i}$ and a small value of $y_{i}$ relative to each variable’s mean, then the contribution of that data point to covariance will be largely negative. The denominator normalizes covariance such that this metric is bounded in the domain of $[-1,1]$.
Using Pearson’s correlation, potential input variables can be ranked according to the strength of relationship with the response variable. To consider variables that have a strong inverse relationship with the target, one can use the absolute value of the correlation coefficient, $|r|$.
Pearson’s correlation is “scale invariant” which means that it yields the same results when the variables are linearly transformed. This means that this metric for $x$ and $y$ will be the same as if $x$ were to be change to $ax$, $x+b$ or $ax+b$ as long as $a>0$. The same holds for making the same types of transformations to $y$ as well. Ultimately, this means that Pearson’s correlation is relatively robust to arbitrary scaling of the variables being tested.
There is one fairly significant limitation with Pearson’s correlation; it can only be used to detect linear dependence of the target on the predictor. Even so, it can be used for non-linear relationships if the variable is transformed and then tested (e.g. testing $x^2$ or $\log{x}$ instead of just $x$).
II. Both Predictors and Response Are Categorical
When both the response and predictor are categorical, there is a cornucopia of useful metrics that can be applied. These metrics include chi-Squared testing, ANOVA, bi-normal separation and many more. For the sake of brevity, only mutual information will be covered in detail.
Mutual information for discrete predictor $X$ and discrete response $Y$ is defined as
$$ I(X;Y)=\sum_{x,y}{P_{XY}(x,y)\log{\frac{P_{XY}(x,y)}{P_{X}(x)P_{Y}(y)}}}. $$
Mutual information describes the dependence of two random variables $X$ and $Y$. It quantifies how much information one would glean from measurement of $Y$ through measurement of $X$ alone.
Mutual information is derived from entropy, an information theoretic quantity. Entropy of random variable $X$, notated as $H(X)$, describes the amount information contained in $X$. In other words, entropy is a ruler by which one can measure how much information is contained in a variable. Entropy of $X$ is the expected value or, more generally, the average, information content in $X$ and is denoted as
$$ H(X)=E[I(X)]=E[-\log_{2}(P(X))]. $$
Information content is simply the negative log probability of random variable X. A less abbreviated version of entropy is
$$ H(X)=-\sum_{i=1}^n{P(x_{i})\log_{2}{P(x_{i})}}. $$
Entropy can be better understood as how “surpising” the contents of a message are. For example, suppose that some process is used to communicate a finite set of messages. Each message $x_{i}$ has a probability of occurring, $P(x_{i})$. For a system in which there is only one message in the set with $P(x_{1})=1$, the entropy will be zero meaning that there process has no intrinsic information value. Whenever one receives a message, the conent of the message is already known before receiving it. However, consider a message set which has two messages with equal probabilities of being sent. The probabilities of these message are $P(x_{1})=0.5$ and $P(x_{2})=0.5$ and the entropy of this scenario is $1$. The surprisingness of all the messages in the second system is more surpising than the messages from one-message-only set by a factor of one. Because the logarithm is base $2$, this means that the information to communicate the each instance of each message in the two-message system is 1 bit.
Mutual information is derived from entropy. Rewriting mutual information as a function of the entropies of $X$ and $Y$,
$$ I(X;Y)=H(X)+H(Y)-H(X,Y), $$
reveals that the mutual information of $X$ and $Y$ is equal to the surprisingness of $X$ plus the surprisingness of $Y$ minus the combined surprisingness $X$ and $Y$. In other words, mutual information is the intersection of the entropies of $X$ and $Y$. It is the amount of information that $X$ has to describe $Y$ or vice versa.
Mutual information is a useful metric for categorical variables because it can describe a variable with any number of categories. For the sake of feature filtering, the number of messages a communication system has is equivalent to the number of categories a particular feature has. Each category in a variable has a an observed number of occurrences which can be converted into a probability and used with mutual information.
III. Continuous Predictors and Categorical Responses or Vise Versa
The easiest approach to this variable selection problem is to appropriately bin the predictors. This renders them as categorical variables and categorical-categorical metrics (from section II above) are there to save the day.
To bin continuous variables, a quick way to start is to merely divide the continuous variable up into quartiles. This is especially applicable if the data is not excessively skewed. A modeler can also use his or her knowledge to divide up continuous variables. For example, if one is binning credit scores, there are standard cutoffs that define ‘Poor’ (580-669) or ‘Exceptional’ (>800).
If better preservation of the information in the continuous variable is desired, one can attempt more advanced binning techniques. One possible technique is to try to minimize entropy information loss in the binning process.
A Cautionary Note
Variable filtering does not always produce good results. One big thing that filtering selection misses out on is the interaction between different predictors. In some situations, the individual predictors that are not very good, but they become quite useful in combination with another predictor. The figure below, borrowed from Guyon and Elisseff, demonstrates this type of situation well. Predictor variables are plotted on the $x$ and $y$ axis of the figure. The distribution plots show that there is little class separation (white vs. black) using either variable. Nevertheless, when looking at the two 2 dimensional representation of the data for the combination of variables, the boundary between the different classes is obvious.
One of the key challenges, when it comes to looking at the interaction between sets of predictors is computational. When there are many predictors to choose from, the combinations of potential predictors explodes. More advanced methods for selecting features beyond feature filtering try to strike a balance between a better picture of predictor-predictor relationships and computational complexity. These techniques, namely wrapper and embedded methods, look at the interaction between potential predictors and the response to select feature subsets.
Conclusion
This mini-review of predictor selection through feature filtering only scratches the surface of the metrics available for this task. The application of this approach, however, should be the same no matter which metric is chosen.
If you have any comments, I would love to hear them in the comment box below.