tspDB API Reference

Prediction Index Management

create_pindex ()

This function creates a prediction index on an existing PostgreSQL time series table. Building an index enables predictive queries. see Predictive Queries Section. The table must have a time column chosen through the argument time_column, and one or multiple value columns determined by value_columns.

time_column Must be a column of integers(INT, or BIGINT) or timestamp(TIMESTAMP or TIMESTAMPTZ)

Required Arguments

Name Type Description
table_name text The name of the time series table to be indexed
time_column text The name of the time column to be indexed on. This is the column containing time values, which can be either integers or timestamps
value_columns text array The name of the values column(s) containing time series values to be indexed on.
index_name text The name of the prediction index. This is the index identifier used when executing predictive queries, deleting the index, .. etc.

Optional Arguments

Name Type Description
auto_update boolean Default: true. If true, update the prediction index when new rows are inserted to table_name
agg_interval numeric Default: NULL. Determines the aggregation interval in seconds if the time column is in timestamps. If NULL, it is set to the median of the time difference between the first 100 points.
normalize boolean Default: true. If true, normalize the data before fitting the prediction model. Recommended for indices on multiple time series.
k int Default: NULL. A parameter in the prediction algorithm. It represents the number of singular values to retain in the means prediction model. If NULL, then it is chosen automatically by selecting the $k^{th}$ top singular values that maintain 90% of the signal energy.
T int Default: 2,500,000 A parameter in the prediction algorithm. It denotes the number of entries in each submodel in the means prediction model. T>T_0.
T0 int Default: 100. The minimum number of entries (number of table rows $\times$ number of time series included in the index) for the prediction index to be built.
gamma numeric Default: 0.5. Must be in the interval (0,1], otherwise, it will default to 0.5. A parameter in the prediction algorithm, which determines the number of new entries relative to T after which the last sub-model is fully updated
var_direct boolean Default: true. A parameter in the prediction algorithm. Default value recommended. If true, The variance is estimated directly by subtracting the estimated mean from the observations and then estimating the underlying mean of the difference through matrix estimation. Otherwise, it is estimated through the estimation of the underlying mean of the square of the observations.
col_to_row_ratio int Default: 10. A parameter of the prediction model. It determines the ratio of no. columns to the number of rows in each sub-model's matrix.
L int Default: NULL. If not NULL, col_to_row_ratio is ignored and it determines the number of rows in each sub-model. L should be greater than 2 and less than floor(sqrt(T))
k_var int Default: NULL. A parameter in the prediction algorithm. It represents the number of singular values to retain in the variance prediction model. If NULL, it is chosen automatically by selecting the $k^{th}$ top singular values that maintain 90% of the signal energy. If 0, do not estimate variance.
timescale boolean Default: false. If true, assumes the time-series table is stored as TimescaleDB hypertable, which expedite updating/creating the pindex.

Sample Usage

1 - Create a prediction index pindex1on table measurements on time column time and value columns temprature.

select create_pindex('measurements','time','{"temprature"}','pindex1');


2 - Create a prediction index pindex2on table measurements on time column time and value columns temprature and humidity, but build it on the aggregated data with an agg_interval of 120 seconds

select create_pindex('measurements','time','{"temprature", "humidity"}','pindex2',
agg_interval => 120);


3 - Create a prediction index pindex3on table measurements on time column time and value columns temprature with T = 10000 and without estimating the variance.

select create_pindex('measurements','time','{"temprature", "humidity"}','pindex2',
T => 10000, k_var => 0);


list_pindices()

List all the predictive indices created in your database with some basic information about each index.

Returns

Name Type Description
index_name text The name of the prediction index.
value_columns text array The name of the indexed values column(s).
relation text The name of the indexed time series table.
time_column text The name of the time column indexed on.
initial_timestamp text The first timestamp/integer included in buliding the index.
last_timestamp text The last timestamp/integer included in buliding the index.
agg_interval numeric The aggregation interval in seconds (if the time columns is in timestamps)
uncertainty_quantification boolean true if the variance is estimated by the predicting index.

Sample Usage

To get a table of all prediction indices built in your database:

select * from list_pindices();


Or you can get a specific column for a specific index by writing an SQL query on list_pindices(). For example:

select agg_interval from list_pindices() where index_name = 'pindex1';


delete_pindex()

Delete the predictive index index_name from your database.

Required Arguments

Name Type Description
index_name text name of the predictive index to be deleted

Sample Usage

You can simply delete the created predictive index pindex_1 through

select delete_pindex('pindex_1');


pindices_stat()

Show some statistics regarding the predictive indices. These include in-sample accuracies for imputation and forecasting, out-of-sample accuracy (only when more than 2 sub-models are built), the number of points included in the model. Accuracies are in R2. R2 values typically range between 0 and 1. (One being the perfect accuracy)

Returns

Name Type Description
index_name text Prediction index name.
column_name text A name of a value column indexed by index_name.
number_of_observations numeric The number of datapoints used in the prediction model.
number_of_trained_models numeric The number of sub-models in the prediction model.
imputation_score numeric The in-sample imputation accuracy.
forecast_score numeric The in-sample forecasting accuracy.
test_forecast_score numeric The out-of-sample forecasting accuracy.

Sample Usage

To get a table of the statistics of all prediction indices built in your database:

select * from pindices_stat();


Similarly, you can get it for a specific column and index by

select * from pindices_stat() where index_name  = 'pindex1' and column_name = 'temprature';


update_pindex()

Update the prediction index index_name.

Note: if auto_update was set to true when the index was created, then it will be updated automatically.

Required Arguments

Name Type Description
index_name text Prediction index name.

Sample Usage

To update undex pindex_1:

select  update_pindex('pindex_1');


Predictive Queries

predict ()

Returns the predicted value along with the confidence interval for the value of value_column at time t using index_name

Required Arguments

Name Type Description
table_name text The name of the indexed time series table.
value_column text The name of the indexed value column to be predicted.
t int or Timestamp Index or timestamp indicating the queried time.
index_name text The name of the prediction index.

Optional Arguments

Name Type Description
uq boolean Default: true. If true, return upper and lower bound of the c% confidenc interval
uq_method text Defalut = Gaussian options: {Gaussian, Chebyshev} Uncertainty quantification method used to estimate the confidence interval
c Numeric Default: 95. confidence level for uncertainty quantification, c should be in the range (0,100).

Returns

Name Type Description
prediction numeric Predicted value
LB numeric Lower bound of prediction
UB numeric Upper bound of prediction

Sample Usage

Inline with the examples used in create_pindex() we can predict as follow: 1- To predict the temprature of table measurementsat time t = 100 using pindex1

select predict('measurements','temprature',100,'pindex1');


2- To predict the humidity of table measurementsat time t = 100 using pindex2 with a confidence interval of 98%:

select predict('measurements','humidity',100,'pindex2', c => 98);


predict()

This is the same predict() function, but returns the predicted value for a time range t1 to t2 using index_name along with the confidence interval for the value of value_column

Required Arguments

Name Type Description
table_name text The name of the indexed time series table.
value_column text The name of the indexed value column to be predicted.
t1 int or Timestamp Index or timestamp indicating the start of queried time (inclusive).
t2 int or Timestamp Index or timestamp indicating the end of queried time (inclusive).
index_name text The name of the prediction index.

Optional Arguments

Name Type Description
uq boolean Default: true. If true, return upper and lower bound of the c% confidenc interval
uq_method text Defalut = Gaussian options: {Gaussian, Chebyshev} Uncertainty quantification method used to estimate the confidence interval
c Numeric Default: 95. confidence level for uncertainty quantification, c should be in the range (0,100).

Returns

Name Type Description
prediction numeric Predicted value
LB numeric Lower bound of prediction
UB numeric Upper bound of prediction

Sample Usage

Inline with the examples used in create_pindex() we can predict as follow: 1- To predict the temprature of table measurementsat time range t = 100 to t =110 using pindex1

select predict('measurements','temprature',100,110,'pindex1');


2- To predict the humidity of table measurementsat time range t = 100 to t =110 using Chebyshev as the uncertainity quantification method:

select predict('measurements','humidity',100,110,'pindex2', uq_method => 'Chebyshev');


get_lowerbound()

This method estimated a prediction error lower bound of a discrete time series. As such, the first step is to discretize the given sequence. If the given time series is already discrete, you can ignore the discretization_method. The method then estimates an entropy rate of the time series via Lempel-Ziv compression, and returns the classification error lower bound. The stationarity of the time series is assumed. More detailed explanation is given below, and please visit https://github.com/saeyoung/tslb for more information.

1. Discretization

When the given time series comprises of continuous random variables, we need to discretize the sequence prior to apply this method. See discretization_method.

1. Linear Model

We build a linear model between the compression ratio and the entropy rate. To do so, we sample a number of sequences that are similar to the given time series: that has the same number of observations (sequence length; number_of_observations ) and the same number of alphabets (the number of unique cases that appear in the sequence; k ). samples denote the number of sampled sequences, and each sequence's distribution is known as we set the true distribution for them. Thus, we can obtain 1) the entropy rate from the true distribution and 2) the compression ratio of the sequence via Lempel-Ziv algorithm.

1. Regression

We obtain the compression ratio of the (discretized) time series of interest and calculate the entropy rate using the aforementioned linear model. Finally, we estimate the classification error lower bound using Fano's Inequality. The returned value is the probability of the best classification error.

Required Arguments

Name Type Description
table_name text The name of the time series table.
value_column text The name of the time series column for which a lower bound for prediction error will be calculated.
time_column text The name of the time column in the time series table. This is the column containing time values, which can be either integers or timestamps.

Optional Arguments

Name Type Description
number_of_observations int Default: 1000. Number of observations used to estimate the lower bound. The most recent observations will be selected.
samples int Defalut = 100. Number of samples to construct the linear model.
discretization_method text Defalut = quantization options: {quantization, change, None}. quantization: quantize the sequence into k equilength bins, from min to max. change: create three alphabets, 1) remains the same, 2) increased, and 3) decreased. None skip discretization, which should be used if the time series is already discrete.
k int Default: 3. Number of alphabets in the discretized sequence when discretization_method is set to quantization.

Returns

Name Type Description
lower bound numeric Prediction error lower bound. This is a predicted value for the classification error (probability) of the best estimator based on Fano's Inequality.

Sample Usage

Inline with the examples used before, we can estimate the prediction lower bound for the time series temperature in the table measurements as follow:


select get_lowerbound('measurements','temprature','time');



Similarly, we can change the number of observations used to estimate the lower bound to 2000 through the following query:


select get_lowerbound('measurements','temprature','time', number_of_observations => 2000);



We can convert the time series on a continuous scale into a ternary sequence: increased, decreased, and no change:


select get_lowerbound('measurements','temprature','time', discretization_method => 'change');



Another way to quantize the sequence is to divide the line between the observed minimum and the maximum into k=5 equilength intervals:


select get_lowerbound('measurements','temprature','time', discretization_method => 'quantization', k => 5);