Databricks was built on top of Apache Spark. Spark was built using Java and Scala, and since we are going to use it with Python, we are going for Pyspark.
to make it even better and faster spark has higher level abstracted libraries called modules , spark dataframe , spark SQl, Spark Streaming, Mlib and graph , Pandas API.
Azure Databricks is a unified, open analytics platform for building, deploying, sharing, and maintaining enterprise-grade data, analytics, and AI solutions at scale. The Databricks Data Intelligence Platform integrates with cloud storage and security in your cloud account and manages and deploys cloud infrastructure for you.
the programming language used for Azure is
In Spark, you can indeed read data from various source systems using the .read
attribute of a SparkSession
or a SparkContext
(though SparkSession
is the more modern and recommended approach.
For example, if you want to read a CSV file, you would typically do something like this:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("ReadFile").getOrCreate()
# Read a CSV file
df = spark.read.csv("path/to/your/file.csv")
# To specify options like header and delimiter:
df_with_options = spark.read.option("header", "true").option("delimiter", ",").csv("path/to/your/file.csv")
# You can do something similar for other file formats like JSON, Parquet, etc.
json_df = spark.read.json("path/to/your/file.json")
parquet_df = spark.read.parquet("path/to/your/file.parquet")
# Don't forget to stop the SparkSession when you're done
spark.stop()
The .read
The attribute provides access to DataFrameReader
methods for reading data in different formats. You can also use .option()
it to configure various parameters specific to the data source.
Read CSV File from Azure Data Lake Storage Account
CSV Source File Path : "abfss://working-labs@datalakestorageaccountname.dfs.core.windows.net/bronze/daily-pricing/csv"
JSON Target File Path : "abfss://working-labs@datalakestorageaccountname.dfs.core.windows.net/bronze/daily-pricing/json"
Spark Methods
DataFrameReader:
csv
,option (header,separator)
,schema
SparkDataTypes:
ArrayType
,DoubleType
,IntegerType
,LongType
,StringType
,StructType
,StructField
DataFrameWriter:
json
,mode (overwrite,append)
StructType
In Spark SQL, it is crucial to define the structure and data types of your DataFrames. It provides a way to explicitly specify the schema, which is important for data integrity, performance, and working with complex or unstructured data sources.hink of StructType
as the blueprint or the schema definition for your tabular data in Spark. It tells Spark the names and types of the columns it should expect.
Key Characteristics of StructType
:
- Ordered Collection of Fields: A
StructType
is an ordered collection ofStructField
objects. The order in which you define the fields matters for some operations and for how the data is interpreted. StructField
Components: Each field within aStructType
is defined by aStructField
. AStructField
has three main attributes:name
(String): The name of the column.dataType
(DataType): The data type of the column (e.g.,StringType
,IntegerType
,BooleanType
,DateType
, another nestedStructType
,ArrayType
,MapType
, etc.).nullable
(Boolean): Indicates whether the column can contain null values (True
) or not (False
).metadata
(Map[String, Any], optional): A map to store extra information about the field.
How StructType
is Used in Spark:
-
Defining DataFrame Schemas: When creating DataFrames from sources that don't inherently provide schema information (like CSV or JSON files without schema inference, or from RDDs), you explicitly define the schema using a
StructType
.Pythonfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType schema = StructType([ StructField("name", StringType(), False), StructField("age", IntegerType(), True), StructField("city", StringType(), True) ]) # Assuming you have an RDD called 'data_rdd' df = spark.createDataFrame(data_rdd, schema) df.printSchema()
-
Schema Inference (Implicit): Spark can sometimes infer the schema of your data automatically, especially when reading structured files like Parquet or when you have header rows in CSV files. However, explicitly defining the schema with
StructType
is often recommended for:- Performance: Explicit schemas can sometimes lead to better performance as Spark doesn't need to sample the data to infer types.
- Data Integrity: You can ensure the data types are interpreted correctly, preventing potential data type mismatches and errors.
- Clarity and Maintainability: Explicit schemas make your code more readable and easier to understand.
-
Working with Nested Data:
StructType
allows you to define schemas for nested data structures. AStructField
can have itsdataType
set to anotherStructType
, allowing you to represent hierarchical data.Pythonfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType address_schema = StructType([ StructField("street", StringType(), True), StructField("zipcode", StringType(), True) ]) person_schema = StructType([ StructField("name", StringType(), False), StructField("age", IntegerType(), True), StructField("address", address_schema, True) ]) # ... create DataFrame with this schema ...
-
Defining Complex Data Types: You can also use
StructType
in conjunction with other complex data types likeArrayType
(for lists) andMapType
(for key-value pairs) to create sophisticated schemas.Pythonfrom pyspark.sql.types import StructType, StructField, StringType, ArrayType, IntegerType schema_with_array = StructType([ StructField("name", StringType(), False), StructField("scores", ArrayType(IntegerType()), True) ]) # ... create DataFrame ...
Comments
Post a Comment