Dimensionality reduction is a machine learning and statistics technique used to reduce the number of input variables (features) in a dataset while preserving as much important information as possible.
Imagine you are trying to describe a car. You could list its weight, engine size, horsepower, top speed, number of doors, color, and fuel efficiency. However, many of these variables are related (e.g., engine size and horsepower). Dimensionality reduction helps you simplify this list to the most essential “signals” without losing the “story” the data is telling.
Why Use It?
In data science, we often face the “Curvature of Dimensionality.” As the number of features increases, the volume of the space increases so fast that the data becomes sparse. This can lead to:
- Overfitting: The model learns noise rather than the actual pattern.
- Computational Cost: More features require more memory and processing power.
- Visualization Issues: Humans can easily visualize 2D or 3D data, but anything beyond that is nearly impossible to plot.
Common Techniques
Dimensionality reduction is generally split into two categories: Feature Selection (keeping a subset of the original variables) and Feature Extraction (creating new, smaller variables from the old ones).
1. Principal Component Analysis (PCA)
PCA is the most popular linear technique. It transforms the data into a new coordinate system. The first “principal component” captures the maximum variance (spread) in the data.
2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a non-linear technique mainly used for visualization. It is excellent at taking high-dimensional data and projecting it into 2D or 3D while keeping similar data points close together.
3. Linear Discriminant Analysis (LDA)
Unlike PCA, which focuses on variance, LDA focuses on maximising separability between different classes. It is often used as a preprocessing step for classification tasks.
The Benefits at a Glance
| Benefit | Description |
| Data Compression | Reduces storage space and speeds up algorithms. |
| Noise Removal | By discarding dimensions with low variance, you often filter out “random” noise. |
| Better Visualization | Allows complex data to be plotted on a simple X-Y graph. |
| Simpler Models | Leads to models that are easier to interpret and less prone to errors. |
