Skip to main content

Python - Pandas lib

Pandas is a Python library.

Pandas are used to analyze data.


A Pandas Series is like a column in a table.
a = [172]

myvar = pd.Series(a)

print(myvar[0]) --> output will be 1 (or)

myvar = pd.Series(a, index = ["x""y""z"])

Labels

If nothing else is specified, the values are labeled with their index number. First value has an index 0, second value has index 1 etc.

This label can be used to access a specified value.

if its dict to data frame 

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

output :

day1    420
day2    380
day3    390

DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

A series is like a column, a DataFrame is a whole table.

data = {
  "calories": [420380390],
  "duration"[504045]
}

myvar = pd.DataFrame(data)

print(myvar)

Pandas use the loc attribute to return one or more specified row(s)

print(df.loc[0])

  calories    420
  duration     50
  Name: 0, dtype: int6
to_string() to print the entire DataFrame.

1. Importing pandas:

Python
import pandas as pd
import numpy as np # Often used with pandas

2. Creating DataFrames:

  • From a dictionary:
Python
data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
df = pd.DataFrame(data)
  • From a list of lists:
Python
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['col1', 'col2'])
  • From a CSV file:
Python
df = pd.read_csv('data.csv')
  • From an Excel file:
Python
df = pd.read_excel('data.xlsx')

3. Basic DataFrame Operations:

  • Viewing data:
Python
df.head()       # First 5 rows
df.tail()       # Last 5 rows
df.info()       # DataFrame info
df.describe()   # Summary statistics
df.shape        # (rows, columns)
df.columns      # Column names
df.index        # Index values
  • Selecting data:
Python
df['col1']       # Select column 'col1'
df[['col1', 'col2']] # Select multiple columns
df.loc[0]         # Select row by label (index)
df.iloc[0]        # Select row by integer position
df[df['col1'] > 1] # Boolean indexing (filtering)
  • Adding/removing columns:
Python
df['new_col'] = [4, 5, 6] # Add a new column
df.drop('col1', axis=1)    # Remove column 'col1'
  • Adding/removing rows:
Python
df = df.append({'col1':4, 'col2':'d'}, ignore_index=True) #add row.
df.drop(0) #remove row by index.

4. Data Manipulation:

  • Sorting:
Python
df.sort_values(by='col1')
  • Grouping:
Python
df.groupby('col1').mean()
  • Applying functions:
Python
df['col1'].apply(lambda x: x * 2)
  • Handling missing values:
Python
df.isnull()       # Check for missing values
df.dropna()       # Remove rows with missing values
df.fillna(0)      # Fill missing values with 0
  • String operations (for string columns):
Python
df['col2'].str.upper() #convert to upper case.
df['col2'].str.contains('a') #boolean series if string contains a.
  • Merging/Joining:
Python
pd.merge(df1, df2, on='common_col') # Merge DataFrames
pd.concat([df1,df2]) #combine dataframes vertically
df1.join(df2, on='index', how='left') #join dataframes.

5. Time Series (if applicable):

  • Datetime conversion:
Python
df['date'] = pd.to_datetime(df['date'])
  • Resampling:
Python
df.resample('M', on='date').mean() #resample to monthly data.

6. Saving Data:

  • To CSV:
Python
df.to_csv('output.csv', index=False)
  • To Excel:
Python
df.to_excel('output.xlsx', index=False)

Important Notes:

  • axis=0 refers to rows, and axis=1 refers to columns.
  • inplace=True modifies the DataFrame directly, without creating a copy.
  • Always check the pandas documentation for the most up-to-date information.

Comments

Popular posts from this blog

session 19 Git Repository

  🔁 Steps to Create a Branch in Databricks, Pull from Git, and Merge into a Collaborative Branch Create a New Branch in Databricks: Go to the Repos tab in your workspace. Navigate to the Git-linked repo. Click the Git icon (or three dots ⋮) and choose "Create Branch." Give your branch a name (e.g., feature-xyz ) and confirm. Pull the Latest Changes from Git: With your new branch selected, click the Git icon again. Select “Pull” to bring the latest updates from the remote repository into your local Databricks environment. Make Changes & Commit: Edit notebooks or files as needed in your branch. Use the "Commit & Push" option to push changes to the remote repo. Merge into the Collaborative Branch: Switch to the collaborative branch (e.g., dev or main ) in Git or from the Databricks UI. Click "Pull & Merge" . Choose the branch you want to merge into the collaborative branch. Review the c...

Session 18 monitering and logging - Azure Monitor , Log analytics , and job notification

 After developing the code, we deploy it into the production environment. To monitor and logging the jobs run in the real time systems in azure  we have scheduled the jobs under the workflow , we haven't created any monitoring or any matrics . After a few times, the job failed, but we don't know because we haven't set up any monitoring, and every time we can't navigate to workspace-> workflows, under runs to see to check whether the job has been successfully running or not and in real time there will be nearly 100 jobs or more jobs to run  In real time, the production support team will monitor the process. Under the workflow, there is an option called Job notification. After setting the job notification, we can set a notification to email . if we click the date and time its takes us to the notebook which is scheduled there we can able to see the error where it happens . order to see more details, we need to under Spark tab, where we have the option to view logs ( tha...

Transformation - section 6 - data flow

  Feature from Slide Explanation ✅ Code-free data transformations Data Flows in ADF allow you to build transformations using a drag-and-drop visual interface , with no need for writing Spark or SQL code. ✅ Executed on Data Factory-managed Databricks Spark clusters Internally, ADF uses Azure Integration Runtimes backed by Apache Spark clusters , managed by ADF, not Databricks itself . While it's similar in concept, this is not the same as your own Databricks workspace . ✅ Benefits from ADF scheduling and monitoring Data Flows are fully integrated into ADF pipelines, so you get all the orchestration, parameterization, logging, and alerting features of ADF natively. ⚠️ Important Clarification Although it says "executed on Data Factory managed Databricks Spark clusters," this does not mean you're using your own Azure Databricks workspace . Rather: ADF Data Flows run on ADF-managed Spark clusters. Azure Databricks notebooks (which you trigger via an "Exe...