Sensible sampling and interpolation for Time Series Data

Geno Tech
4 min readJan 11, 2021

The raw data recorded when a change detected in the stream enables more realistic data recording comparison to time based sampling.Thus, sampling and interpolation required for most of the comparative analysis using time series, limit the usage of raw data for extended analysis.

However, you might wonder how it can distort the realistic nature of the data (except the probable anomalies) and leading to false interpretation.Lets focus on our theme.

For the analysis I am using few selected variables from water quality dataset recorded from Baffle Creek and Byrnett River. The raw datasets are sampled in two different frequencies (10 and 30 mins respectively). Although its not a ideal dataset sampled in unequal frequencies, it can demonstrate the impact of sampling for comparative study.

First I will combine the two datasets into single dataset without sampling using panads merge function.

Two datasets resides in the “selected” folder and merged based on the “TIMESTAMP” filed preserving the ascending order.The output has been saved into a CSV file named “combined_buff_burnett.csv”

The Pearson Correlation Coefficient (PCC) calculated for the combined dataset and the heat-map of the results given below.

Figure 1 : PCC in-between two unsampled datasets
Figure 1 : PCC between variables in two un-sampled datasets

From the results, it is obvious that the two datasets may have correlations in-between the variable inside the dateset, thus, not within the two datasets.

In the next step, I have sampled both datasets to 10 minutes sampling rate which ensures there will be a record zero, 10 , 20 ,30 , 40 and 50 minutes of each clock hour.

The median or rolling window of 10 minutes calculated and generated values resampled at each 10 minutes within the clock hour. The process may lead to empty values due to lower frequency of frequent dateset (Byrnett River) and due the missing records in the raw datasets . Cubic interpolation has been used to generate those values.

The Baffle Creek datasets has the highest frequency with 10 minutes gap and selection ensures all the data points utilization. However, the level of interpolation required for Byrnett River datasets (originally 30 minutes sampling rate) is significantly larger.

The heat-map produced by the PCC of the sampled dataset withing every 10 minutes given below.

PCC between variable after sampling two datasets at 10 minutes
Figure 2 : PCC between variable after sampling two datasets at 10 minutes

In the final step, I have sampled both datasets to 30 minutes sampling rate which ensures there will be a record zero and 30 minutes of each clock hour.

The median or rolling window of 30 minutes calculated and generated values resampled at each 30 minutes within the clock hour.Here, the interpolation only required to to missing in the raw datasets.However, from the high frequency Baffle Creek dataset , two samples not utilized for each 30 minutes sampling.

The heat-map produced by the PCC of the sampled dataset withing every 30 minutes given below.

Figure 3 : PCC between variable after sampling two datasets at 10 minutes

Based on the visual comparisons between Figure one with Figure two and three it is clear that the sampling has generated a positive and negative correlations which are does not exist in the raw datasets. Furthermore, during the sampling process of 30 minutes, it has skipped more data from the raw dataset has signify the correlations between variables in between the datasets.

Conclusion

The usage of sampling or interpolation or sampled data collection is essential for comparative studies, thus utilizing them without sense tend to distort that data and provide misleading information to the audience.

I suggest to adopt the trail and error approach while visually comparing the results before you arrive at the optimum sampling rate. Same time, approach fallowed for rolling window and interpolation can be further fine tuned, which will be discussed in another article.

Modern society rapidly shaping by the conclusions of scientific studies based on the data. It is important to further think about to which level data was manipulated by the authors specially in the context of time series data before believe on these conclusions.

Ready to play with data and code :) . Fell free to use Spyder Notebooks on GitHub repository.

--

--

Geno Tech

Software Development | Data Science | AI — We write rich & meaningful content on development, technology, digital transformation & life lessons.