Skip to main content

Python - Pandas lib

Pandas is a Python library.

Pandas are used to analyze data.


A Pandas Series is like a column in a table.
a = [172]

myvar = pd.Series(a)

print(myvar[0]) --> output will be 1 (or)

myvar = pd.Series(a, index = ["x""y""z"])

Labels

If nothing else is specified, the values are labeled with their index number. First value has an index 0, second value has index 1 etc.

This label can be used to access a specified value.

if its dict to data frame 

calories = {"day1": 420, "day2": 380, "day3": 390}

myvar = pd.Series(calories)

print(myvar)

output :

day1    420
day2    380
day3    390

DataFrames

Data sets in Pandas are usually multi-dimensional tables, called DataFrames.

A series is like a column, a DataFrame is a whole table.

data = {
  "calories": [420380390],
  "duration"[504045]
}

myvar = pd.DataFrame(data)

print(myvar)

Pandas use the loc attribute to return one or more specified row(s)

print(df.loc[0])

  calories    420
  duration     50
  Name: 0, dtype: int6
to_string() to print the entire DataFrame.

1. Importing pandas:

Python
import pandas as pd
import numpy as np # Often used with pandas

2. Creating DataFrames:

  • From a dictionary:
Python
data = {'col1': [1, 2, 3], 'col2': ['a', 'b', 'c']}
df = pd.DataFrame(data)
  • From a list of lists:
Python
data = [[1, 'a'], [2, 'b'], [3, 'c']]
df = pd.DataFrame(data, columns=['col1', 'col2'])
  • From a CSV file:
Python
df = pd.read_csv('data.csv')
  • From an Excel file:
Python
df = pd.read_excel('data.xlsx')

3. Basic DataFrame Operations:

  • Viewing data:
Python
df.head()       # First 5 rows
df.tail()       # Last 5 rows
df.info()       # DataFrame info
df.describe()   # Summary statistics
df.shape        # (rows, columns)
df.columns      # Column names
df.index        # Index values
  • Selecting data:
Python
df['col1']       # Select column 'col1'
df[['col1', 'col2']] # Select multiple columns
df.loc[0]         # Select row by label (index)
df.iloc[0]        # Select row by integer position
df[df['col1'] > 1] # Boolean indexing (filtering)
  • Adding/removing columns:
Python
df['new_col'] = [4, 5, 6] # Add a new column
df.drop('col1', axis=1)    # Remove column 'col1'
  • Adding/removing rows:
Python
df = df.append({'col1':4, 'col2':'d'}, ignore_index=True) #add row.
df.drop(0) #remove row by index.

4. Data Manipulation:

  • Sorting:
Python
df.sort_values(by='col1')
  • Grouping:
Python
df.groupby('col1').mean()
  • Applying functions:
Python
df['col1'].apply(lambda x: x * 2)
  • Handling missing values:
Python
df.isnull()       # Check for missing values
df.dropna()       # Remove rows with missing values
df.fillna(0)      # Fill missing values with 0
  • String operations (for string columns):
Python
df['col2'].str.upper() #convert to upper case.
df['col2'].str.contains('a') #boolean series if string contains a.
  • Merging/Joining:
Python
pd.merge(df1, df2, on='common_col') # Merge DataFrames
pd.concat([df1,df2]) #combine dataframes vertically
df1.join(df2, on='index', how='left') #join dataframes.

5. Time Series (if applicable):

  • Datetime conversion:
Python
df['date'] = pd.to_datetime(df['date'])
  • Resampling:
Python
df.resample('M', on='date').mean() #resample to monthly data.

6. Saving Data:

  • To CSV:
Python
df.to_csv('output.csv', index=False)
  • To Excel:
Python
df.to_excel('output.xlsx', index=False)

Important Notes:

  • axis=0 refers to rows, and axis=1 refers to columns.
  • inplace=True modifies the DataFrame directly, without creating a copy.
  • Always check the pandas documentation for the most up-to-date information.

Comments