- Similar to how a relational database organizes data into tables within a single database, BigQuery uses datasets to group tables. In a migration project where data from a database is moved to Google Cloud Platform's BigQuery, a dedicated dataset can be created to house all the migrated tables.
View :
- If the underlying table structure changes, views may need to be updated, which can add to maintenance overhead.
- Views are dependent on the underlying tables. If a table is dropped or significantly altered, the view will become invalid.
- Materialized views, which store the result of the query, require additional storage space.
- Overuse of views can lead to a complex database schema, making it difficult to understand and maintain.
you can find the tables name as table id for query in under details tab
- Imagine a sales table. You could partition it by "year" to create separate partitions for each year's data.
- Within each year's partition, you could then cluster the data by "customer ID." This would ensure that all sales records for a specific customer within that year are stored close together.
- When you query sales for a specific customer in a specific year, the system can quickly locate and retrieve the relevant data.
Types of partitioning referred from documentation Introduction to partitioned tables | BigQuery | Google Cloud
This section describes the different ways to partition a table.
Integer range partitioning
You can partition a table based on ranges of values in a specific INTEGER
column. To create an integer-range partitioned table, you provide:
- The partitioning column.
- The starting value for range partitioning (inclusive).
- The ending value for range partitioning (exclusive).
- The interval of each range within the partition.
For example, suppose you create an integer range partition with the following specification:
Argument | Value |
---|---|
column name | customer_ |
start | 0 |
end | 100 |
interval | 10 |
The table is partitioned on the customer_id
column into ranges of interval 10. The values 0 to 9 go into one partition, values 10 to 19 go into the next partition, etc., up to 99. Values outside this range go into a partition named __UNPARTITIONED__
. Any rows where customer_id
is NULL
go into a partition named __NULL__
.
For information about integer-range partitioned tables, see Create an integer-range partitioned table.
Time-unit column partitioning
You can partition a table on a DATE
,TIMESTAMP
, or DATETIME
column in the table. When you write data to the table, BigQuery automatically puts the data into the correct partition, based on the values in the column.
For TIMESTAMP
and DATETIME
columns, the partitions can have either hourly, daily, monthly, or yearly granularity. For DATE
columns, the partitions can have daily, monthly, or yearly granularity. Partition boundaries are based on UTC time.
For example, suppose that you partition a table on a DATETIME
column with monthly partitioning. If you insert the following values into the table, the rows are written to the following partitions:
Column value | Partition (monthly) |
---|---|
DATETIME( | 201901 |
DATETIME( | 201901 |
DATETIME( | 201904 |
In addition, two special partitions are created:
__NULL__
: Contains rows withNULL
values in the partitioning column.__UNPARTITIONED__
: Contains rows where the value of the partitioning column is earlier than 1960-01-01 or later than 2159-12-31.
For information about time-unit column-partitioned tables, see Create a time-unit column-partitioned table.
Ingestion time partitioning
When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions. Partitions boundaries are based on UTC time.
If your data might reach the maximum number of partitions per table when using a finer time granularity, use a coarser granularity instead. For example, you can partition by month instead of day to reduce the number of partitions. You can also cluster the partition column to further improve performance.
An ingestion-time partitioned table has a pseudocolumn named _PARTITIONTIME
. The value of this column is the ingestion time for each row, truncated to the partition boundary (such as hourly or daily). For example, suppose that you create an ingestion-time partitioned table with hourly partitioning and send data at the following times:
Ingestion time | _ | Partition (hourly) |
---|---|---|
2021-05-07 17:22:00 | 2021-05-07 17:00:00 | 2021050717 |
2021-05-07 17:40:00 | 2021-05-07 17:00:00 | 2021050717 |
2021-05-07 18:31:00 | 2021-05-07 18:00:00 | 2021050718 |
Because the table in this example uses hourly partitioning, the value of _PARTITIONTIME
is truncated to an hour boundary. BigQuery uses this value to determine the correct partition for the data.
You can also write data to a specific partition. For example, you might want to load historical data or adjust for time zones. You can use any valid date between 0001-01-01 and 9999-12-31. However, DML statements cannot reference dates prior to 1970-01-01 or after 2159-12-31. For more information, see Write data to a specific partition.
Instead of using _PARTITIONTIME
, you can also use _PARTITIONDATE
. The _PARTITIONDATE
pseudocolumn contains the UTC date corresponding to the value in the _PARTITIONTIME
pseudocolumn.
Select daily, hourly, monthly, or yearly partitioning
When you partition a table by time-unit column or ingestion time, you choose whether the partitions have daily, hourly, monthly, or yearly granularity.
Daily partitioning is the default partitioning type. Daily partitioning is a good choice when your data is spread out over a wide range of dates, or if data is continuously added over time.
Choose hourly partitioning if your tables have a high volume of data that spans a short date range — typically less than six months of timestamp values. If you choose hourly partitioning, make sure the partition count stays within the partition limits.
Choose monthly or yearly partitioning if your tables have a relatively small amount of data for each day, but span a wide date range. This option is also recommended if your workflow requires frequently updating or adding rows that span a wide date range (for example, more than 500 dates). In these scenarios, use monthly or yearly partitioning along with clustering on the partitioning column to achieve the best performance. For more information, see Combining clustered and partitioning tables in this document.
Combining clustered and partitioned tables
You can combine table partitioning with table clustering to achieve finely grained sorting for further query optimization.
A clustered table contains clustered columns that sort data based on user-defined sort properties. Data within these clustered columns are sorted into storage blocks which are adaptively sized based on the size of the table. When you run a query that filters by the clustered column, BigQuery only scans the relevant blocks based on the clustered columns instead of the entire table or table partition. In a combined approach using both table partitioning and clustering, you first segment table data into partitions, then you cluster the data within each partition by the clustering columns.
When you create a table that is clustered and partitioned, you can achieve more finely grained sorting, as the following diagram shows:
Partitioning versus sharding
Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD
.
Partitioning is recommended over table sharding because partitioned tables perform better. With sharded tables, BigQuery must maintain a copy of the schema and metadata for each table. BigQuery might also need to verify permissions for each queried table. This practice also adds to query overhead and affects query performance.
If you previously created date-sharded tables, you can convert them into an ingestion-time partitioned table. For more information, see Convert date-sharded tables into ingestion-time partitioned tables.
Comments
Post a Comment