Skip to main content

Big Query

  1. Similar to how a relational database organizes data into tables within a single database, BigQuery uses datasets to group tables. In a migration project where data from a database is moved to Google Cloud Platform's BigQuery, a dedicated dataset can be created to house all the migrated tables.



View :

for example 2 data scentist needs the 2 different different requirement but the data will be resides in the same table at that time will can create the view for write the query for particular requirement and simply share the name of the view for their use .

advandages : Data Abstraction. Security, Improved Performance (in some cases) 
disadvantages : 
  • Maintenance:
    • If the underlying table structure changes, views may need to be updated, which can add to maintenance overhead.
  • Dependency Issues:
    • Views are dependent on the underlying tables. If a table is dropped or significantly altered, the view will become invalid.
  • Storage (Materialized views):
    • Materialized views, which store the result of the query, require additional storage space.
  • Complexity:
    • Overuse of views can lead to a complex database schema, making it difficult to understand and maintain.


  • you can find the tables name as table id for query in under details tab 



  • usually th table name will be project name followed by dataset name followed by table name 
  • while creating the table there is two mutiple options.

  • empty table --> normal empty table 
  • google cloud Storage --> that means data is available in google storage buckets or somewhere in gcp and we are going to migrate that datas into the table 
  • upload --> creating the table using local file
  • Amazon S3 --. for data from amazon storage buckets 
  • Azure blog storage --> for Azure clould 
  • exsiting table /view --> creating the table for exsiting table 

  • while creating the tables give the option called aut-detect schema its automatically detects all the schema for that table is needed and we can also mentioned whether its is partitioned table or not while creating it.



  • Like how we can store the data in a GCP bucket make using Spark likewise we can load the data into bigquery using pyspark 



  • make sure the schema details are same as of database schema or else its throws some errors .


  • we can  create and write the table form bigquery using python program as well we just have use from google cloud import Bigquery  liberay and we need to create the client for using biquery objects and schema refer the example in documentation Using schema auto-detection  |  BigQuery  |  Google Cloud



  • Billing : compute engine (querying ) cost more than storing the tables so be carefull while query write as optimize while writing query 


  • under processing the query we can see how much bytes processed and duration  under information


  • likewise hive optimation techniques 
  • data format 
  • clustering 
  • partitioning
  • How They Work Together:
    • Imagine a sales table. You could partition it by "year" to create separate partitions for each year's data.  
    • Within each year's partition, you could then cluster the data by "customer ID." This would ensure that all sales records for a specific customer within that year are stored close together.
    • When you query sales for a specific customer in a specific year, the system can quickly locate and retrieve the relevant data.

    This section describes the different ways to partition a table.

    You can partition a table based on ranges of values in a specific INTEGER column. To create an integer-range partitioned table, you provide:

    • The partitioning column.
    • The starting value for range partitioning (inclusive).
    • The ending value for range partitioning (exclusive).
    • The interval of each range within the partition.

    For example, suppose you create an integer range partition with the following specification:

    ArgumentValue
    column namecustomer_id
    start0
    end100
    interval10

    The table is partitioned on the customer_id column into ranges of interval 10. The values 0 to 9 go into one partition, values 10 to 19 go into the next partition, etc., up to 99. Values outside this range go into a partition named __UNPARTITIONED__. Any rows where customer_id is NULL go into a partition named __NULL__.

    For information about integer-range partitioned tables, see Create an integer-range partitioned table.

    You can partition a table on a DATE,TIMESTAMP, or DATETIME column in the table. When you write data to the table, BigQuery automatically puts the data into the correct partition, based on the values in the column.

    For TIMESTAMP and DATETIME columns, the partitions can have either hourly, daily, monthly, or yearly granularity. For DATE columns, the partitions can have daily, monthly, or yearly granularity. Partition boundaries are based on UTC time.

    For example, suppose that you partition a table on a DATETIME column with monthly partitioning. If you insert the following values into the table, the rows are written to the following partitions:

    Column valuePartition (monthly)
    DATETIME("2019-01-01")201901
    DATETIME("2019-01-15")201901
    DATETIME("2019-04-30")201904

    In addition, two special partitions are created:

    • __NULL__: Contains rows with NULL values in the partitioning column.
    • __UNPARTITIONED__: Contains rows where the value of the partitioning column is earlier than 1960-01-01 or later than 2159-12-31.

    For information about time-unit column-partitioned tables, see Create a time-unit column-partitioned table.

    When you create a table partitioned by ingestion time, BigQuery automatically assigns rows to partitions based on the time when BigQuery ingests the data. You can choose hourly, daily, monthly, or yearly granularity for the partitions. Partitions boundaries are based on UTC time.

    If your data might reach the maximum number of partitions per table when using a finer time granularity, use a coarser granularity instead. For example, you can partition by month instead of day to reduce the number of partitions. You can also cluster the partition column to further improve performance.

    An ingestion-time partitioned table has a pseudocolumn named _PARTITIONTIME. The value of this column is the ingestion time for each row, truncated to the partition boundary (such as hourly or daily). For example, suppose that you create an ingestion-time partitioned table with hourly partitioning and send data at the following times:

    Ingestion time_PARTITIONTIMEPartition (hourly)
    2021-05-07 17:22:002021-05-07 17:00:002021050717
    2021-05-07 17:40:002021-05-07 17:00:002021050717
    2021-05-07 18:31:002021-05-07 18:00:002021050718

    Because the table in this example uses hourly partitioning, the value of _PARTITIONTIME is truncated to an hour boundary. BigQuery uses this value to determine the correct partition for the data.

    You can also write data to a specific partition. For example, you might want to load historical data or adjust for time zones. You can use any valid date between 0001-01-01 and 9999-12-31. However, DML statements cannot reference dates prior to 1970-01-01 or after 2159-12-31. For more information, see Write data to a specific partition.

    Instead of using _PARTITIONTIME, you can also use _PARTITIONDATE. The _PARTITIONDATE pseudocolumn contains the UTC date corresponding to the value in the _PARTITIONTIME pseudocolumn.

    When you partition a table by time-unit column or ingestion time, you choose whether the partitions have daily, hourly, monthly, or yearly granularity.

    • Daily partitioning is the default partitioning type. Daily partitioning is a good choice when your data is spread out over a wide range of dates, or if data is continuously added over time.

    • Choose hourly partitioning if your tables have a high volume of data that spans a short date range — typically less than six months of timestamp values. If you choose hourly partitioning, make sure the partition count stays within the partition limits.

    • Choose monthly or yearly partitioning if your tables have a relatively small amount of data for each day, but span a wide date range. This option is also recommended if your workflow requires frequently updating or adding rows that span a wide date range (for example, more than 500 dates). In these scenarios, use monthly or yearly partitioning along with clustering on the partitioning column to achieve the best performance. For more information, see Combining clustered and partitioning tables in this document.

    You can combine table partitioning with table clustering to achieve finely grained sorting for further query optimization.

    A clustered table contains clustered columns that sort data based on user-defined sort properties. Data within these clustered columns are sorted into storage blocks which are adaptively sized based on the size of the table. When you run a query that filters by the clustered column, BigQuery only scans the relevant blocks based on the clustered columns instead of the entire table or table partition. In a combined approach using both table partitioning and clustering, you first segment table data into partitions, then you cluster the data within each partition by the clustering columns.

    When you create a table that is clustered and partitioned, you can achieve more finely grained sorting, as the following diagram shows:

    Comparing tables that are not clustered or partitioned to tables that are clustered and partitioned.

    Table sharding is the practice of storing data in multiple tables, using a naming prefix such as [PREFIX]_YYYYMMDD.

    Partitioning is recommended over table sharding because partitioned tables perform better. With sharded tables, BigQuery must maintain a copy of the schema and metadata for each table. BigQuery might also need to verify permissions for each queried table. This practice also adds to query overhead and affects query performance.

    If you previously created date-sharded tables, you can convert them into an ingestion-time partitioned table. For more information, see Convert date-sharded tables into ingestion-time partitioned tables.

  • there is the partitioned filter option is available in GCp if we applied while creating the partioned tables.it does not show ay error if we query without any filter (where clause)
  • IAM permission 
  • Bigquery table editor --> role is needed to create the table in bigquery

  • Comments

    Popular posts from this blog

    session 19 Git Repository

      🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

    Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

     After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

    Transformation - section 6 - data flow

      Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...