Downsampling and data retention
InfluxDB can handle hundreds of thousands of data points per second. Working with that much data over a long period of time can create storage concerns. A natural solution is to downsample the data; keep the high precision raw data for only a limited time, and store the lower precision, summarized data for much longer or forever.
InfluxDB offers two features - continuous queries (CQ) and retention policies (RP) - that automate the process of downsampling data and expiring old data. This guide describes a practical use case for CQs and RPs and covers how to set up those features in InfluxDB databases.
A continuous query (CQ) is an InfluxQL query that runs automatically and
periodically within a database.
CQs require a function in the
SELECT clause and must include a
GROUP BY time() clause.
A retention policy (RP) is the part of InfluxDB data structure
that describes for how long InfluxDB keeps data.
InfluxDB compares your local server’s timestamp to the timestamps on your data
and deletes data that are older than the RP’s
A single database can have several RPs and RPs are unique per database.
This guide will not go into detail about the syntax for creating and managing CQs and RPs. If you’re new to both concepts, we recommend looking over the detailed CQ documentation and RP documentation.
This section uses fictional real-time data that track the number of food orders
to a restaurant via phone and via website at ten second intervals.
We will store those data in a
in the fields
name: orders ------------ time phone website 2016-05-10T23:18:00Z 10 30 2016-05-10T23:18:10Z 12 39 2016-05-10T23:18:20Z 11 56
Assume that, in the long run, we’re only interested in the average number of orders by phone and by website at 30 minute intervals. In the next steps, we use RPs and CQs to:
- Automatically aggregate the ten-second resolution data to 30-minute resolution data
- Automatically delete the raw, ten-second resolution data that are older than two hours
- Automatically delete the 30-minute resolution data that are older than 52 weeks
We perform the following steps before writing the data to the database
We do this before inserting any data because CQs only run against recent
data; that is, data with timestamps that are no older than
FOR clause of the CQ, or
now() minus the
GROUP BY time() interval if
the CQ has no
1. Create the database
> CREATE DATABASE "food_data"
2. Create a two-hour
DEFAULT retention policy
InfluxDB writes to the
DEFAULT retention policy if we do not supply an explicit RP when
writing a point to the database.
We make the
DEFAULT RP keep data for two hours, because we want InfluxDB to
automatically write the incoming ten-second resolution data to that RP.
CREATE RETENTION POLICY
statement to create a
> CREATE RETENTION POLICY "two_hours" ON "food_data" DURATION 2h REPLICATION 1 DEFAULT
That query creates an RP called
two_hours that exists in the database
two_hours keeps data for a
DURATION of two hours (
2h) and it’s the
RP for the database
The replication factor (
REPLICATION 1) is a required parameter set to 1 for single node instances. For multiple data nodes in a cluster, the replication factor must be evenly divisible into the number of data nodes. For example, a replication factor of 2 works with 2, 4, or 6 data nodes, and so on. A replication factor of 3 works with 3, 6, or 9 data nodes, and so on.
Note: When we created the
food_datadatabase in step 1, InfluxDB automatically generated an RP named
autogenand set it as the
DEFAULTRP for the database. The
autogenRP has an infinite retention period. With the query above, the RP
DEFAULTRP for the
3. Create a 52-week retention policy
Next we want to create another retention policy that keeps data for 52 weeks and is not the
DEFAULT retention policy (RP) for the database.
Ultimately, the 30-minute rollup data will be stored in this RP.
CREATE RETENTION POLICY
statement to create a non-
DEFAULT retention policy:
> CREATE RETENTION POLICY "a_year" ON "food_data" DURATION 52w REPLICATION 1
That query creates a retention policy (RP) called
a_year that exists in the database
a_year setting keeps data for a
DURATION of 52 weeks (
Leaving out the
DEFAULT argument ensures that
a_year is not the
RP for the database
That is, write and read operations against
food_data that do not specify an
RP will still go to the
two_hours RP (the
4. Create the continuous query
Now that we’ve set up our RPs, we want to create a continuous query (CQ) that will automatically and periodically downsample the ten-second resolution data to the 30-minute resolution, and then store those results in a different measurement with a different retention policy.
CREATE CONTINUOUS QUERY
statement to generate a CQ:
> CREATE CONTINUOUS QUERY "cq_30m" ON "food_data" BEGIN SELECT mean("website") AS "mean_website",mean("phone") AS "mean_phone" INTO "a_year"."downsampled_orders" FROM "orders" GROUP BY time(30m) END
That query creates a CQ called
cq_30m in the database
cq_30m tells InfluxDB to calculate the 30-minute average of the two fields
phone in the measurement
orders and in the
It also tells InfluxDB to write those results to the measurement
downsampled_orders in the retention policy
a_year with the field keys
InfluxDB will run this query every 30 minutes for the previous 30 minutes.
Note: Notice that we fully qualify (that is, we use the syntax
"<retention_policy>"."<measurement>") the measurement in the
INTOclause. InfluxDB requires that syntax to write data to an RP other than the
With the new CQ and two new RPs,
food_data is ready to start receiving data.
After writing data to our database and letting things run for a bit, we see
> SELECT * FROM "orders" LIMIT 5 name: orders --------- time phone website 2016-05-13T23:00:00Z 10 30 2016-05-13T23:00:10Z 12 39 2016-05-13T23:00:20Z 11 56 2016-05-13T23:00:30Z 8 34 2016-05-13T23:00:40Z 17 32 > SELECT * FROM "a_year"."downsampled_orders" LIMIT 5 name: downsampled_orders --------------------- time mean_phone mean_website 2016-05-13T15:00:00Z 12 23 2016-05-13T15:30:00Z 13 32 2016-05-13T16:00:00Z 19 21 2016-05-13T16:30:00Z 3 26 2016-05-13T17:00:00Z 4 23
The data in
orders are the raw, ten-second resolution data that reside in the
The data in
downsampled_orders are the aggregated, 30-minute resolution data
that are subject to the 52-week RP.
Notice that the first timestamps in
downsampled_orders are older than the first
This is because InfluxDB has already deleted data from
orders with timestamps
that are older than our local server’s timestamp minus two hours (assume we
SELECT queries at
InfluxDB will only start dropping data from
downsampled_orders after 52 weeks.
- Notice that we fully qualify (that is, we use the syntax
downsampled_ordersin the second
SELECTstatement. We must specify the RP in that query to
SELECTdata that reside in an RP other than the
- By default, InfluxDB checks to enforce an RP every 30 minutes.
ordersmay have data that are older than two hours. The rate at which InfluxDB checks to enforce an RP is a configurable setting, see Database Configuration.
Using a combination of RPs and CQs, we’ve successfully set up our database to automatically keep the high precision raw data for a limited time, create lower precision data, and store that lower precision data for a longer period of time. Now that you have a general understanding of how these features can work together, check out the detailed documentation on CQs and RPs to see all that they can do for you.
Support and feedback
Thank you for being part of our community! We welcome and encourage your feedback and bug reports for InfluxDB and this documentation. To find support, the following resources are available:
InfluxDB Cloud and InfluxDB Enterprise customers can contact InfluxData Support.