# Scheduling Queries in BigQuery

It is possible to run queries in an automated fashion through the BigQuery Data Transfer service. This can be used, for example, to clean up data that is being dumped daily into BigQuery into a more convenient format for downstream services.

This is a step-by-step guide on how to schedule queries through the Google Console UI.

In our toy example, we receive daily dumps into a date-sharded BigQuery table, and we want to extract two id's into a separate cumulative table which will contain the entire set of user\_id’s present in our data. We write a query which processes one day worth of data, and appends user\_id’s we don’t have in our cumulative table yet to it. We schedule the query, and back-fill it on historical data.

#### Official Documentation

Feel free to make use of the official documentation provided by Google:

* [Enabling the BigQuery Data Transfer service](https://cloud.google.com/bigquery/docs/enable-transfer-service)
* [Scheduling Queries](https://cloud.google.com/bigquery/docs/scheduling-queries)
* [Formatting the query schedule](https://cloud.google.com/appengine/docs/flexible/scheduling-jobs-with-cron-yaml#formatting_the_schedule)

#### Enable the BigQuery Data Transfer service

Navigate to the [BigQuery Data Transfer API](https://console.cloud.google.com/apis/library/bigquerydatatransfer.googleapis.com) in the list of API’s, and click the **Enable** button. If you can see a **Manage** button instead, the service is already enabled.

#### Permissions

The account scheduling or running the query must have the following permissions:

* `bigquery.jobs.create`
* `bigquery.datasets.update`
* `bigquery.transfers.get`
* `bigquery.transfers.update`

The `bigquery.admin` role includes these permissions.

It is possible to create a custom role with these permissions and assign it to a (service) account so that they may schedule queries, by going to the [Google Console IAM & Admin](https://console.cloud.google.com/iam-admin/roles) service’s roles tab and clicking **Create Role** at the top.

Name the new role appropriately, and add the aforementioned permissions.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2FhCG5uDtVsjOpabr2wR7T%2Fimage.png?alt=media&#x26;token=08e2ba99-a238-4093-b6aa-e8631233788d" alt=""><figcaption></figcaption></figure>

Then, attach the role to the account by going to the [main IAM page](https://console.cloud.google.com/iam-admin/iam), clicking **Edit Principal**, clicking **Add Another Role**, and selecting the newly created role.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2FGBvubLU4ZiVr1dpAB1DL%2Fimage.png?alt=media&#x26;token=d0607f01-7b85-4c66-9454-52eca38867a6" alt=""><figcaption></figcaption></figure>

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2Fr7nMf2gfIJtqm9lE3DKz%2Fimage.png?alt=media&#x26;token=1890c8f5-0d9b-42a2-a94f-cd8257ff3f9c" alt=""><figcaption></figcaption></figure>

#### Writing the query

Navigate to the **Scheduled Queries** tab of the BigQuery service, and click on **Create A Scheduled Query** (or create a scheduled query in editor).

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2FIFRgxS22RNkcG3lIAdab%2Fimage.png?alt=media&#x26;token=c0376b91-dce9-47b2-929c-c655ef36dade" alt=""><figcaption></figcaption></figure>

Write the query that you wish to be scheduled. You may run it to check whether the output is what you expect; this is separate from scheduling it, and it does not write the result into any destination (unless the query itself does so).

You may make use of [special parameters](https://cloud.google.com/bigquery/docs/scheduling-queries#available_parameters) `@run_date`, `@run_time` to insert the run date or time into the query. Unfortunately, Google Console does not support running a query in the editor with these parameters (even though scheduling the query will work and it will run properly). You may either test the query using the command-line `bq` tool, or by simply replacing the special parameters with a sample date and time when test-running the query, then putting the parameters back in when scheduling it.

The below is a sample query which selects two id columns from a date-sharded table `schedule_test_*` on a particular day, but only such rows that are not already present in (our destination) table `mapping_test`:

```sql
WITH id_mapping AS (
SELECT `one_id` `user_id`, `another_id` `real_id`
FROM `custom_data`.`schedule_test_*`
WHERE `_TABLE_SUFFIX` = FORMAT_DATE('%Y%m%d' , DATE_SUB(@run_date, INTERVAL 1 DAY) )
GROUP BY `one_id`, `another_id`
)

SELECT `user_id`, `real_id`
FROM 
    id_mapping
WHERE
    `user_id` IS NOT NULL

EXCEPT DISTINCT
SELECT `user_id`, `real_id` FROM `custom_data.mapping_test`
```

We subtract one day from the `@run_date`, because we want the query to execute on data that was dumped yesterday in relation to the date on which the query will be run (so when it runs on `2022-10-15`, it will process data in the table `schedule_test_20221014`. We may test this by replacing `@run_date` with `DATE('2022-10-15')` to test-run the query in the editor.

#### Scheduling the query

Click the **Schedule** button on the top bar to open a drop-down menu, and select **Create new scheduled query**.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2F5wPCdcZuZejp72g5iDAI%2Fimage.png?alt=media&#x26;token=41c624d1-a4b8-4e1f-bb7b-d9db8d14d27d" alt=""><figcaption></figcaption></figure>

Name the query appropriately.

Choose how often the query should be run, e.g. Daily.

Set a custom time when the query will be run. **Take care of the timezone the time is in**, and that **the data that the query should process is available at the time that it will be running**.

Set the destination dataset and table where the query results will be written, and `Append to table` if the results should be appended (and the table is cumulative), or `Overwrite table` if the query results should replace the table.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2F2EPuKWHnP69xGcFBszOl%2Fimage.png?alt=media&#x26;token=581e8397-a91e-41b3-a693-c8b809f5a263" alt=""><figcaption></figcaption></figure>

In our case we named the query `Cumulative user_id query`, set it to run daily at 6AM UTC, and append the results to the `mapping_test` table (since our query outputs only rows which are not already present in it).

You may schedule the query to be run under a service account in the **Advanced options**, and you may opt to receive email notifications when the query fails under **Notification options**

#### Managing the query & executing a back-fill

You may navigate to the **Scheduled queries** tab from the left-hand menu. You should see your newly scheduled query.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2Famc1AzGTG1tiYTL1mXeF%2Fimage.png?alt=media&#x26;token=9c87e38d-a639-4ebf-8568-e61b0db5fa6d" alt=""><figcaption></figcaption></figure>

Clicking on it, you can check its individual runs under the **Run History** tab, and its configuration under the **Configuration** tab.

Schedule a back-fill (run the query as if it had been scheduled in the past) by clicking the **Schedule Backfill** button in the top-right corner, then **Run for a date range**, and use the calendar to select the start and end date for the historical query runs.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2FhdYBgCFdzUp14wr89oyd%2Fimage.png?alt=media&#x26;token=74aea079-e6aa-4e62-a9dc-30d55ff7e656" alt=""><figcaption></figcaption></figure>

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2FsA9KxVlTg6oFThZKjVLd%2Fimage.png?alt=media&#x26;token=b2ba9f0d-3ae5-43ed-b9ef-8175eab24386" alt=""><figcaption></figcaption></figure>

After the operation succeeds, and refreshing the page, you should see the historical query runs under the **Run History** tab.

<figure><img src="https://3188841492-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpAGtOmjs5vysomqgbUw8%2Fuploads%2FrYnh6UEOHbyOXkmShcyX%2Fimage.png?alt=media&#x26;token=04d1095d-c906-47f0-b1f7-213148ab3df4" alt=""><figcaption></figcaption></figure>

After the runs succeed, you may check whether the destination table contains the data you expected.
