Scheduling Queries in BigQuery
Last updated
Last updated
It is possible to run queries in an automated fashion through the BigQuery Data Transfer service. This can be used, for example, to clean up data that is being dumped daily into BigQuery into a more convenient format for downstream services.
This is a step-by-step guide on how to schedule queries through the Google Console UI.
In our toy example, we receive daily dumps into a date-sharded BigQuery table, and we want to extract two id's into a separate cumulative table which will contain the entire set of user_id’s present in our data. We write a query which processes one day worth of data, and appends user_id’s we don’t have in our cumulative table yet to it. We schedule the query, and back-fill it on historical data.
Feel free to make use of the official documentation provided by Google:
Navigate to the in the list of API’s, and click the Enable button. If you can see a Manage button instead, the service is already enabled.
The account scheduling or running the query must have the following permissions:
bigquery.jobs.create
bigquery.datasets.update
bigquery.transfers.get
bigquery.transfers.update
The bigquery.admin
role includes these permissions.
Name the new role appropriately, and add the aforementioned permissions.
Navigate to the Scheduled Queries tab of the BigQuery service, and click on Create A Scheduled Query (or create a scheduled query in editor).
Write the query that you wish to be scheduled. You may run it to check whether the output is what you expect; this is separate from scheduling it, and it does not write the result into any destination (unless the query itself does so).
The below is a sample query which selects two id columns from a date-sharded table schedule_test_*
on a particular day, but only such rows that are not already present in (our destination) table mapping_test
:
We subtract one day from the @run_date
, because we want the query to execute on data that was dumped yesterday in relation to the date on which the query will be run (so when it runs on 2022-10-15
, it will process data in the table schedule_test_20221014
. We may test this by replacing @run_date
with DATE('2022-10-15')
to test-run the query in the editor.
Click the Schedule button on the top bar to open a drop-down menu, and select Create new scheduled query.
Name the query appropriately.
Choose how often the query should be run, e.g. Daily.
Set a custom time when the query will be run. Take care of the timezone the time is in, and that the data that the query should process is available at the time that it will be running.
Set the destination dataset and table where the query results will be written, and Append to table
if the results should be appended (and the table is cumulative), or Overwrite table
if the query results should replace the table.
In our case we named the query Cumulative user_id query
, set it to run daily at 6AM UTC, and append the results to the mapping_test
table (since our query outputs only rows which are not already present in it).
You may schedule the query to be run under a service account in the Advanced options, and you may opt to receive email notifications when the query fails under Notification options
You may navigate to the Scheduled queries tab from the left-hand menu. You should see your newly scheduled query.
Clicking on it, you can check its individual runs under the Run History tab, and its configuration under the Configuration tab.
Schedule a back-fill (run the query as if it had been scheduled in the past) by clicking the Schedule Backfill button in the top-right corner, then Run for a date range, and use the calendar to select the start and end date for the historical query runs.
After the operation succeeds, and refreshing the page, you should see the historical query runs under the Run History tab.
After the runs succeed, you may check whether the destination table contains the data you expected.
It is possible to create a custom role with these permissions and assign it to a (service) account so that they may schedule queries, by going to the service’s roles tab and clicking Create Role at the top.
Then, attach the role to the account by going to the , clicking Edit Principal, clicking Add Another Role, and selecting the newly created role.
You may make use of @run_date
, @run_time
to insert the run date or time into the query. Unfortunately, Google Console does not support running a query in the editor with these parameters (even though scheduling the query will work and it will run properly). You may either test the query using the command-line bq
tool, or by simply replacing the special parameters with a sample date and time when test-running the query, then putting the parameters back in when scheduling it.