Big Query

Similar to how a relational database organizes data into tables within a single database, BigQuery uses datasets to group tables. In a migration project where data from a database is moved to Google Cloud Platform's BigQuery, a dedicated dataset can be created to house all the migrated tables.

View :

for example 2 data scentist needs the 2 different different requirement but the data will be resides in the same table at that time will can create the view for write the query for particular requirement and simply share the name of the view for their use .

advandages : Data Abstraction. Security, Improved Performance (in some cases)

disadvantages :

Maintenance:

If the underlying table structure changes, views may need to be updated, which can add to maintenance overhead.

Dependency Issues:

Views are dependent on the underlying tables. If a table is dropped or significantly altered, the view will become invalid.

Storage (Materialized views):

Materialized views, which store the result of the query, require additional storage space.

Complexity:

Overuse of views can lead to a complex database schema, making it difficult to understand and maintain.

you can find the tables name as table id for query in under details tab

usually th table name will be project name followed by dataset name followed by table name

while creating the table there is two mutiple options.

empty table --> normal empty table

google cloud Storage --> that means data is available in google storage buckets or somewhere in gcp and we are going to migrate that datas into the table

upload --> creating the table using local file

Amazon S3 --. for data from amazon storage buckets

Azure blog storage --> for Azure clould

exsiting table /view --> creating the table for exsiting table

while creating the tables give the option called aut-detect schema its automatically detects all the schema for that table is needed and we can also mentioned whether its is partitioned table or not while creating it.

Like how we can store the data in a GCP bucket make using Spark likewise we can load the data into bigquery using pyspark

make sure the schema details are same as of database schema or else its throws some errors .

we can create and write the table form bigquery using python program as well we just have use from google cloud import Bigquery liberay and we need to create the client for using biquery objects and schema refer the example in documentation Using schema auto-detection | BigQuery | Google Cloud

Billing : compute engine (querying ) cost more than storing the tables so be carefull while query write as optimize while writing query

under processing the query we can see how much bytes processed and duration under information

likewise hive optimation techniques

data format

clustering

partitioning

How They Work Together:

Imagine a sales table. You could partition it by "year" to create separate partitions for each year's data.
Within each year's partition, you could then cluster the data by "customer ID." This would ensure that all sales records for a specific customer within that year are stored close together.
When you query sales for a specific customer in a specific year, the system can quickly locate and retrieve the relevant data.

Types of partitioning referred from documentation Introduction to partitioned tables | BigQuery | Google Cloud

This section describes the different ways to partition a table.

Integer range partitioning

You can partition a table based on ranges of values in a specific INTEGER column. To create an integer-range partitioned table, you provide:

The partitioning column.
The starting value for range partitioning (inclusive).
The ending value for range partitioning (exclusive).
The interval of each range within the partition.

For example, suppose you create an integer range partition with the following specification:

Argument	Value
column name	`customer_id`
start	0
end	100
interval	10

The table is partitioned on the customer_id column into ranges of interval 10. The values 0 to 9 go into one partition, values 10 to 19 go into the next partition, etc., up to 99. Values outside this range go into a partition named __UNPARTITIONED__. Any rows where customer_id is NULL go into a partition named __NULL__.

For information about integer-range partitioned tables, see Create an integer-range partitioned table.

Time-unit column partitioning

You can partition a table on a DATE,TIMESTAMP, or DATETIME column in the table. When you write data to the table, BigQuery automatically puts the data into the correct partition, based on the values in the column.

For TIMESTAMP and DATETIME columns, the partitions can have either hourly, daily, monthly, or yearly granularity. For DATE columns, the partitions can have daily, monthly, or yearly granularity. Partition boundaries are based on UTC time.

For example, suppose that you partition a table on a DATETIME column with monthly partitioning. If you insert the following values into the table, the rows are written to the following partitions:

Column value	Partition (monthly)
`DATETIME("2019-01-01")`	`201901`
`DATETIME("2019-01-15")`	`201901`
`DATETIME("2019-04-30")`	`201904`

In addition, two special partitions are created:

__NULL__: Contains rows with NULL values in the partitioning column.
__UNPARTITIONED__: Contains rows where the value of the partitioning column is earlier than 1960-01-01 or later than 2159-12-31.

For information about time-unit column-partitioned tables, see Create a time-unit column-partitioned table.

Ingestion time partitioning

When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions. Partitions boundaries are based on UTC time.

If your data might reach the maximum number of partitions per table when using a finer time granularity, use a coarser granularity instead. For example, you can partition by month instead of day to reduce the number of partitions. You can also cluster the partition column to further improve performance.

An ingestion-time partitioned table has a pseudocolumn named _PARTITIONTIME. The value of this column is the ingestion time for each row, truncated to the partition boundary (such as hourly or daily). For example, suppose that you create an ingestion-time partitioned table with hourly partitioning and send data at the following times:

Ingestion time	`_PARTITIONTIME`	Partition (hourly)
2021-05-07 17:22:00	2021-05-07 17:00:00	`2021050717`
2021-05-07 17:40:00	2021-05-07 17:00:00	`2021050717`
2021-05-07 18:31:00	2021-05-07 18:00:00	`2021050718`

Because the table in this example uses hourly partitioning, the value of _PARTITIONTIME is truncated to an hour boundary. BigQuery uses this value to determine the correct partition for the data.

You can also write data to a specific partition. For example, you might want to load historical data or adjust for time zones. You can use any valid date between 0001-01-01 and 9999-12-31. However, DML statements cannot reference dates prior to 1970-01-01 or after 2159-12-31. For more information, see Write data to a specific partition.

Instead of using _PARTITIONTIME, you can also use _PARTITIONDATE. The _PARTITIONDATE pseudocolumn contains the UTC date corresponding to the value in the _PARTITIONTIME pseudocolumn.

Select daily, hourly, monthly, or yearly partitioning

When you partition a table by time-unit column or ingestion time, you choose whether the partitions have daily, hourly, monthly, or yearly granularity.

Daily partitioning is the default partitioning type. Daily partitioning is a good choice when your data is spread out over a wide range of dates, or if data is continuously added over time.
Choose hourly partitioning if your tables have a high volume of data that spans a short date range — typically less than six months of timestamp values. If you choose hourly partitioning, make sure the partition count stays within the partition limits.
Choose monthly or yearly partitioning if your tables have a relatively small amount of data for each day, but span a wide date range. This option is also recommended if your workflow requires frequently updating or adding rows that span a wide date range (for example, more than 500 dates). In these scenarios, use monthly or yearly partitioning along with clustering on the partitioning column to achieve the best performance. For more information, see Combining clustered and partitioning tables in this document.

Combining clustered and partitioned tables

You can combine table partitioning with table clustering to achieve finely grained sorting for further query optimization.

A clustered table contains clustered columns that sort data based on user-defined sort properties. Data within these clustered columns are sorted into storage blocks which are adaptively sized based on the size of the table. When you run a query that filters by the clustered column, BigQuery only scans the relevant blocks based on the clustered columns instead of the entire table or table partition. In a combined approach using both table partitioning and clustering, you first segment table data into partitions, then you cluster the data within each partition by the clustering columns.

When you create a table that is clustered and partitioned, you can achieve more finely grained sorting, as the following diagram shows:

Comparing tables that are not clustered or partitioned to tables that are clustered and partitioned.

Partitioning versus sharding

Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.

Partitioning is recommended over table sharding because partitioned tables perform better. With sharded tables, BigQuery must maintain a copy of the schema and metadata for each table. BigQuery might also need to verify permissions for each queried table. This practice also adds to query overhead and affects query performance.

If you previously created date-sharded tables, you can convert them into an ingestion-time partitioned table. For more information, see Convert date-sharded tables into ingestion-time partitioned tables.

partition filter

there is the partitioned filter option is available in GCp if we applied while creating the partioned tables.it does not show ay error if we query without any filter (where clause)

IAM permission

Bigquery table editor --> role is needed to create the table in bigquery

Keerthana Blogs

Search This Blog

Big Query

Types of partitioning referred from documentation Introduction to partitioned tables | BigQuery | Google Cloud

Integer range partitioning

Time-unit column partitioning

Ingestion time partitioning

Select daily, hourly, monthly, or yearly partitioning

Combining clustered and partitioned tables

Partitioning versus sharding

partition filter

Labels

Comments

Post a Comment

Popular posts from this blog

session 19 Git Repository

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

Transformation - section 6 - data flow