One way to prevent noise to affect your data driven projects
Noise is so difficult to treat, every data scientist knows that.
The fact is that, as one dear friend of mine loves to say,
“The hardest part of getting what you want is figuring out what it is”
Indeed, we can’t specify what noise really is. As a physicist, I find myself in the situation of studying a dataset and trying to understand if my data has a physical sense. When a clear pattern can’t be identified in a part of my data (or my signal) , I tend to classify that part as “noise”. But, this approach could be dangerous and misleading. Moreover, sometimes you just don’t know what you are talking about or the problem is just too complex for you to expect something before you actually see it
So what do we do?
The first step is trying to figure out what kind of noises can affect your data. Noises are usually subdivided by their colour (aka their dependency on the frequency in the Fourier spectrum). But I want to put myself in the worst situation.
Let’s pretend we know nothing about the noises sources of our system
Then a safe assumption could be that our system is disturbed by a gaussian white noise, that is a noise that lives everywhere in the frequency spectrum with almost the same amplitude and has a gaussian distribution (mean=0).
Ok. Let’s play.
If you think about noise, I’m sure that the image that you picture in your mind is close to this one:
As it has already been said, if you think about it in a monodimensional way it is a signal that lives at all the frequencies, even the highest ones. But your signal is typically band limited. That means that if you have a way to separate the highest frequencies of your signal, you can try to distinguish where the noise really lives.
You actually have this way: it is called wavelet.
I’m far away to be a wavelet hero, and you can get enough information about it by yourself. What we need to know to proceed is that they are able to filter your signal in a similar way of a Fourier Transform, but using different basis functions. In particular, they are able to do that at different levels, and the first levels use a filter with low scale, thus investigating the high frequencies, while the last ones use a filter with high scale, thus obtaining the lowest ones.
Now let’s start coding.
Friends don’t lie 🙂
The example I’ve used is a daily series time series from kaggle. https://www.kaggle.com/sumanthvrao/daily-climate-time-series-data
The mean temperature has been used as an example.
Using the wavelet you get this scenario:
As I’ve said, the first levels detail coefficients are associated with high frequencies, while the lowest are associated with the low ones. The detail coefficients of the first level + the approximation coefficients forms the original signal. The first approximation coefficient may actually be our reconstructed signal, as it is the original signal — what we would like to call “noise”
The “noise” has actually been filtered out, but we must not be too proud of ourselves.
In fact an important check that you MUST do while dealing with noise is to see if the difference between your reconstructed signal and the original one is correlated with the original signal.
Umh. 5% of correlation is not that low. Can we do better than that?
As we said before, the noise we would like to treat is gaussian. But let’s give a look at the first detail coefficient histogram :
- Build the histogram:
2. Make it simmetric:
3. Plot the gaussian fit:
I know, it’s bad, but wait a minute.
If you look closely, the ups and downs in the core of the distribution may be considered as statistical fluctuations. On the other hand, at a certain point of the tail, the distribution is all above our gaussian fit.
This is where we want to attack!
The strategy here is to take all the core of the gaussian down, and extract the important information that are found in the tails.
Let’s see if it works (SPOILER: it does.)
As the gaussian fit function gives you the fitted sigma, it has been used to set the threshold. At a certain point of the gaussian (Fitted Sigma * threshold) let’s set to 0 everything that is in between these value and it’s symmetric one.
Now, as it has already been said, let’s give a look at the correlation (using different thresholds):
And this result is obtained
With otpimal TH=4.
This refinement outclass the naive wavelet method, obtaining an error that is considerably less correlated.
The result is pretty good even in terms of the RMSE. In fact if this method is applied the RMSE is pretty similar (just a little higher) to the RMSE that could have been obtained by just using the first approximation coefficient (taking away all the first detail)
This method is not magic!
Just because it is working surprisingly well on this one, that does not mean that it can be applied on every dataset you have. Some data may not have gaussian white noise, but pink, red, or blue. Plus you could have data with gaussian distribution, so that this method becomes powerless. Plus you could have not band limited signal. And the list goes on.
I just wanted to spot a light on a really efficient method when all the restrictions are applied, that could be used in a part of a really complex Machine Learning algorithm or data driven process.