All Posts
Mastering Data Analysis with Python Panda
In the realm of data manipulation and analysis in Python, few tools are as indispensable as the Pandas library. Pandas provides powerful data structures and functions to efficiently handle and manipulate structured data, making it a favorite among data scientists, analysts, and programmers alike. In this tutorial, we will delve into the fundamentals of Pandas, exploring its core functionalities and demonstrating how to leverage its capabilities to manipulate, clean, and analyze data effectively.
Project Setup
We will setup jupyter notebook and pandas in a virtualenv.
Create Virtualenv
virtualenv venv
Activate Virtualenv
For Windows
venv\scripts\activate
For Linux source
venv/bin/activate
Install Jupyter Notebook
pip install notebook
Setup Kernal for venv
Install ipykernel
pip install ipykernel
Set venv kernel
python -m ipykernel install --user --name=venv
Install Pandas
pip install pandas
Run Jupyter Notebook
jupyter notebook
Do not forget to change kernel in your jupyter notebook to venv
Basic data structures in pandas
Series - It is a one-dimensional labeled array holding data of any type such as integers, strings, Python objects etc.
Syntax:- Series(data, index=index)
Where,
data can be many different things: a Python dict, an ndarray, a scalar value (like 5).
index is a list of axis labels. If data is an ndarray, index must be the same length as data. If data is a scalar value, an index must be provided. The value will be repeated to match the length of index.
# Example 1
a = pd.Series([10, 20, 30, 40, 50]) # An Array
print(a)
#Example 2
b = pd.Series([10, 20, 30, 40, 50], index=["a", "b", "c", "d", "e"])
print(b)
# Example 3
c = pd.Series({"b":1, "a": 4, "c":7})
print(c)
# Example 4
d = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print(d)
# Example 5
e = pd.Series(5.0, index=["a", "b", "c", "d", "e"])
print("\nData Type:", e.dtype)
print("\nIndex:", e.index)
print("\nValues:", e.values)
print("\nGet Single Value:", e.get("c"))
DataFrame - It is a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.
Syntax:- DataFrame(data, index=index)
Where, data can be Dict of 1D ndarrays, lists, dicts, or Series, 2-D numpy.ndarray, Structured or record ndarray, A Series, Another DataFrame.
You can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame.
# Example 1
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, 35, 32]
}
df = pd.DataFrame(data)
print(df)
# Example 2
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, 35, 32]
}
df = pd.DataFrame(data, index=["a", "b", "c", "d"])
print(df)
# Example 3
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, 35, 32]
}
df = pd.DataFrame(data, index=["a", "b", "c", "d"])
print("\nIndex:", df.index)
print("\nName:\n", df.name)
print("\nName:\n", df.age)
print("\nName:\n", df.name.get("a"))
# Example 4
data = {
"weight": pd.Series([10, 20, 30]),
"price": pd.Series([1.0, 2.0, 3.0, 4.0]),
}
df = pd.DataFrame(data)
print(df)
# Example 5
data = {
"weight": pd.Series([10, 20, 30], index=["a", "b", "c"]),
"price": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
}
df = pd.DataFrame(data)
print(df)
# Example 6
data = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
df = pd.DataFrame(data)
print(df)
# Example 7
data = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
df = pd.DataFrame(data, index=["1st", "2nd"])
print(df)
Inspecting Data or Viewing Data
info() - Provides a concise summary of a DataFrame, including the index dtype, column dtypes, non-null values, and memory usage.
isnull() - This function is used to detect missing values in a DataFrame or Series. It returns a DataFrame or Series of the same shape, with boolean values indicating whether each corresponding element is missing (i.e., True for missing values, False otherwise).
describe() - Generates descriptive statistics for numerical columns, including count, mean, standard deviation, min, max, and quartiles.
head() - Returns the first n rows of the DataFrame (default is 5).
tail() - Returns the last n rows of the DataFrame (default is 5).
duplicated() - This function is used to identify duplicate rows in a DataFrame. It returns a Series of boolean values where True indicates that a row is a duplicate of a previous row, based on certain criteria (default is to check all columns).
sort_index() - Sorts the DataFrame by its index.
sort_values() - Sorts the DataFrame by the values of one or more columns.
data = {
'name': [
'Sonam', 'Raj', 'Rohit', np.nan, 'Amit', 'Neha', 'Vikas', 'Pooja',
'Ajay', 'Raj', 'Suresh', 'Kiran', 'Anil', np.nan, 'Rahul',
'Sneha', 'Vivek', 'Raj', 'Arjun', 'Jyoti'
],
'age': [
28, 24, 35, 32, 27, 26, 30, 29, np.nan, 22, 33, 21, np.nan, 23, 25,
20, 36, 19, np.nan, 18
]
}
df = pd.DataFrame(data)
print("Data:\n", df)
#print(df.info())
#print("\nNull Info:\n", df.isnull())
#print("\nTotal Null Info:\n", df.isnull().sum())
#print("\nStatistical Summary:\n", df.describe())
#print("\nTop 5 Data:\n", df.head())
#print("\nTop 10 Data:\n", df.head(10))
#print("\nBottom 5 Data:\n", df.tail())
#print("\nBottom 10 Data:\n", df.tail(10))
#print("\nDuplicate Data:\n", df["name"].duplicated())
#print("\nTotal Duplicate Data:\n", df["name"].duplicated().sum())
#print("\nSort by Index:\n", df.sort_index())
#print("\nSort by Values:\n", df.sort_values('age'))
Column Selection or Indexing
loc - Accesses a group of rows and columns by labels or a boolean array.
iloc - Accesses a group of rows and columns by integer positions (indexes).
data = {
'name': [
'Sonam', 'Raj', 'Rohit', np.nan, 'Amit', 'Neha', 'Vikas', 'Pooja',
'Ajay', 'Raj', 'Suresh', 'Kiran', 'Anil', np.nan, 'Rahul',
'Sneha', 'Vivek', 'Raj', 'Arjun', 'Jyoti'
],
'age': [
28, 24, 35, 32, 27, 26, 30, 29, np.nan, 22, 33, 21, np.nan, 23, 25,
20, 36, 19, np.nan, 18
],
'roll': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
112, 113, 114, 115, 116, 117, 118, 119, 120],
'city': [
'Bokaro', 'Ranchi', 'Dhanbad', 'Deoghar', 'Giridih', 'Jamshedpur',
'Dumka', 'Delhi', 'Kolkata', 'Mumbai', 'Hazaribagh', 'Ramgarh',
'Gumla', 'Patna', 'Deoghar', 'Dhanbad', 'Chennai', 'Bokaro',
'Ahmedabad', 'Lucknow']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
#print("\nSingle Column Data:\n", df["name"])
#print("\nSingle Column Data:\n", df["age"])
#print("\nMultiple Column Data:\n", df[["name", "city"]])
#print("\nSelect row by integer location\n", df.iloc[2])
#print("\nSelect row by Label\n", df.loc[2])
#print("\nSlice Row:\n", df[4:7])
#print("\nSelect Row by condition - True False:\n", df["age"]>25)
#print("\nDisplay Row by condition - True False:\n", df[df["age"]>25])
Column Insertion
insert() - Inserts a column into the DataFrame at a specified location.
data = {
'name': [
'Sonam', 'Raj', 'Rohit', np.nan, 'Amit', 'Neha', 'Vikas', 'Pooja',
'Ajay', 'Raj', 'Suresh', 'Kiran', 'Anil', np.nan, 'Rahul',
'Sneha', 'Vivek', 'Raj', 'Arjun', 'Jyoti'
],
'age': [
28, 24, 35, 32, 27, 26, 30, 29, np.nan, 22, 33, 21, np.nan, 23, 25,
20, 36, 19, np.nan, 18
],
'roll': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
112, 113, 114, 115, 116, 117, 118, 119, 120],
'city': [
'Bokaro', 'Ranchi', 'Dhanbad', 'Deoghar', 'Giridih', 'Jamshedpur',
'Dumka', 'Delhi', 'Kolkata', 'Mumbai', 'Hazaribagh', 'Ramgarh',
'Gumla', 'Patna', 'Deoghar', 'Dhanbad', 'Chennai', 'Bokaro',
'Ahmedabad', 'Lucknow']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Insert or Add Scalar Value - columns get inserted at the end
df["result"] = "Pass"
#print("\n Result Column Added:\n", df)
# Insert ndarrays - length must be same - columns get inserted at the end
df["marks"] = [234, 453, 556, 676, 676, 453, 232, 234, 234, 123,
124, 567, 231, 867, 234, 566, 756, 678, 453, 354]
#print("\n Marks Column Added using Array:\n", df)
# insert() - It is used to insert column at particular location
hobbies = ["cricket", "vgame", "reading", "singing", "chatting", "computer", "fighting",
"cricket", "vgame", "reading", "singing", "chatting", "computer", "fighting",
"cricket", "vgame", "reading", "singing", "chatting", "computer"]
df.insert(2, "hobbies", hobbies)
#print("Insert Column after 2nd position:\n", df)
# Calculate Column to create New Column
df["fees"] = df["marks"] * 12
#print("\n Fees Column Added:\n", df)
Column Updation
replace() -This function is used to replace values in a DataFrame or Series. It allows you to specify the values to be replaced and the values to replace them with. You can replace single values, lists of values, or use dictionaries for more complex mappings.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, 35, 32],
'roll': [101, 102, 103, 104],
'marks':[111, 222, 333, 444],
'city':['bokaro', 'ranchi', 'dhanbad', 'deoghar'],
'gender':['F', 'M', 'M', 'F']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Update all fields of a column
df["gender"] = "F"
print("\nUpdate Gender Column:\n", df)
# Update specific field of a column
df.loc[df['roll'] == 102, 'marks'] = 999
print("\nUpdate Raj marks :\n", df)
# Replace Values
print("\n Replace Values:\n", df.replace('deoghar', 'dumka'))
Column Deletion
pop() - Removes a specified column from the DataFrame and returns it.
drop() - Drops specified labels from rows or columns.
drop_duplicates() - This function removes duplicate rows from a DataFrame, keeping the first occurrence by default. It returns a DataFrame with the duplicates removed. You can customize which duplicates to drop using parameters like subset and keep.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Raj'],
'age': [28, 24, 35, 32],
'roll': [101, 102, 103, 104],
'marks':[111, 222, 333, 444],
'city':['bokaro', 'ranchi', 'dhanbad', 'deoghar'],
'gender':['F', 'M', 'M', 'F']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Delete single Column
#del df['gender']
#print("\nGender Column Deleted:\n", df)
# Pop single column
#print("\nPOP City:\n", df.pop('city'))
#print("\nData after POP City:\n", df)
# Delete Multiple column
#newdf = df.drop(columns=['name', 'roll'])
#print("\nDrop Name and Roll:\n", newdf)
# Delete Duplicate Values
#print("\n Delete Duplicates:\n", df.drop_duplicates('name'))
Filter
isin() - Checks whether each element in the DataFrame is contained in values.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, 35, 32],
'roll': [101, 102, 103, 104],
'marks':[111, 222, 333, 444],
'city':['bokaro', 'ranchi', 'dhanbad', 'deoghar']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
dt1 = df['age'].isin([28, 32])
print(dt1)
dt2 = df[df['age'].isin([28, 32])]
print(dt2)
Missing Data
isna() - Detects missing values, returning a DataFrame of the same shape with booleans indicating missing values.
fillna() - Fills missing values with a specified value or method.
dropna() - Removes missing values by dropping rows or columns with NaNs.
bfill() - This function performs backward filling of missing values in a DataFrame or Series, filling each missing value with the next valid value in the sequence.
ffill() - This function performs forward filling of missing values in a DataFrame or Series, filling each missing value with the previous valid value in the sequence.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, np.nan, 32],
'roll': [101, 102, 103, 104],
'marks':[111, np.nan, 333, np.nan],
'city':['bokaro', 'ranchi', 'dhanbad', 'deoghar']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Boolean Mask where values are nan
#print(df.isna())
#print(df.isna().sum())
# Fill some Value in Missing Data
#print(df.fillna(value=123))
# Drop missing data
#print(df.dropna())
# Backward Filling
#print(df.bfill())
# Forward Filling
#print(df.ffill())
Stats
mean() - Computes the mean of DataFrame columns or rows.
agg() - Aggregates data using one or more operations over specified axis.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, np.nan, 32],
'roll': [101, 102, 103, 104],
'marks':[111, 222, 333, np.nan],
'city':['bokaro', 'ranchi', 'dhanbad', 'deoghar']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Calculate the mean of the 'age' column"
#age_mean = df['age'].mean()
#print("\nMean age:", age_mean)
#Calculate the mean of the 'age' column including NaN values"
#age_mean_with_nan = df['age'].mean(skipna=False)
#print("\nMean age including NaN:", age_mean_with_nan)
# Calculate the mean of 'age' and 'marks' columns"
#selected_means = df[['age', 'marks']].mean()
#print("\nAge and Marks Mean:\n", selected_means)
# Apply a single aggregation function
#age_mean = df['age'].agg('mean')
#print("\nMean age:", age_mean)
# Apply multiple aggregation functions
#age_stats = df['age'].agg(['mean', 'min', 'max'])
#print("\nAge statistics:\n", age_stats)
# Apply different aggregation functions to different columns
custom_agg = df.agg({
'age': ['mean', 'min', 'max'],
'marks': 'sum',
'roll': 'count'
})
print("\nCustom aggregation:\n", custom_agg)
Transformation
transform() - Applies a function to each element in a DataFrame or Series.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'age': [28, 24, np.nan, 32],
'roll': [101, 102, 103, 104],
'marks':[111, 222, 333, np.nan],
'city':['bokaro', 'ranchi', 'dhanbad', 'deoghar']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Apply a single transformation function
#age_plus_one = df['age'].transform(lambda x: x + 1)
#print("\nAge plus one:\n", age_plus_one)
# Apply a built-in transformation function to a Series
#age_squared = df['age'].transform(np.sqrt)
#print("\nAge squared:\n", age_squared)
# Apply a single transformation function to multiple columns
#df_transformed = df[['age', 'marks']].transform(lambda x: x * 2)
#print("\nDataFrame transformed (age and marks doubled):\n", df_transformed)
# Apply different transformation functions to different columns
df_transformed_diff = df.transform({
'age': lambda x: x + 10,
'marks': lambda x: x - 100
})
print("\nDataFrame transformed with different functions:\n", df_transformed_diff)
Value Counts
value_counts() - Returns a Series containing counts of unique values.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj'],
'age': [28, 24, 35, 32, 28, 24, np.nan, 32, 28, 24],
'city': ['bokaro', 'ranchi', 'dhanbad', 'deoghar', 'ranchi', 'bokaro', 'dhanbad',
'deoghar', 'ranchi', 'ranchi']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Use value_counts to count unique values in the 'city' column
city_counts = df['city'].value_counts()
print("\nCity counts:\n", city_counts)
# Use value_counts to count unique values in the 'name' column
name_counts = df['name'].value_counts()
print("\nName counts:\n", name_counts)
# Use value_counts to count unique values in the 'age' column including NaN values
age_counts = df['age'].value_counts(dropna=False)
print("\nAge counts including NaN:\n", age_counts)
# Use value_counts with sorting by index
city_counts_sorted = df['city'].value_counts().sort_index()
print("\nCity counts sorted by index:\n", city_counts_sorted)
String Methods
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani'],
'city': ['Bokaro', 'Ranchi', 'Dhanbad', 'Deoghar']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Convert to lowercase
#print("\nLowercase cities:\n", df['city'].str.lower())
# Convert to Uppercase
#print("\nUppercase cities:\n", df['city'].str.upper())
# Length of String
#print("\nString Length cities:\n", df['city'].str.len())
# Extract first three characters
#print("\nCity substrings (first 3 characters):\n", df['city'].str[:3])
# Replace substrings
#print("\nCities with 'a' replaced by 'A':\n", df['city'].str.replace('a', 'A'))
Merge
merge() - Merges two DataFrames based on key columns or indexes.
# Create DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Sonam', 'Raj', 'Sonal', 'Rajeev']
})
df2 = pd.DataFrame({
'id': [3, 4, 5, 6],
'score': [90, 80, 70, 60]
})
# Inner Join
inner_join = pd.merge(df1, df2, on='id')
print("Inner Join:\n", inner_join)
# Left Join
left_join = pd.merge(df1, df2, on='id', how='left')
print("\nLeft Join:\n", left_join)
# Right Join
right_join = pd.merge(df1, df2, on='id', how='right')
print("\nRight Join:\n", right_join)
# Outer Join
outer_join = pd.merge(df1, df2, on='id', how='outer')
print("\nOuter Join:\n", outer_join)
df3 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Sonam', 'Raj', 'Sonal', 'Rajeev']
})
df4 = pd.DataFrame({
'stu_id': [3, 4, 5, 6],
'score': [90, 80, 70, 60]
})
# Inner Join
inner_join = pd.merge(df3, df4, left_on='id', right_on='stu_id')
print("Inner Join:\n", inner_join)
Concat
concat() - Concatenates DataFrames along a particular axis.
# Create DataFrames
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']
})
df2 = pd.DataFrame({
'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7']
})
df3 = pd.DataFrame({
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
})
# Vertical Concatenation
vertical_concat = pd.concat([df1, df2])
print("Vertical Concatenation:\n", vertical_concat)
# Horizontal Concatenation
horizontal_concat = pd.concat([df1, df2], axis=1)
print("\nHorizontal Concatenation:\n", horizontal_concat)
# Concatenation with Different Columns
different_columns_concat = pd.concat([df1, df3], axis=1)
print("\nConcatenation with Different Columns:\n", different_columns_concat)
Grouping
groupby() - Groups DataFrame using a mapper or by a Series of columns and performs aggregate functions.
data = {
'team': ['A', 'A', 'B', 'B', 'C', 'C'],
'player': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
'points': [10, 15, 10, 20, 10, 25]
}
df = pd.DataFrame(data)
# Display the DataFrame
print("Original DataFrame:\n", df)
# Group by 'team' and calculate the sum of 'points'
grouped_sum = df.groupby('team').sum()
print("\nGrouped by Team and summed:\n", grouped_sum)
# Group by 'Team' and 'Player' and calculate the sum of 'Points'
grouped_multi = df.groupby(['team', 'player']).sum()
print("\nGrouped by Team and Player and summed:\n", grouped_multi)
# Group by 'Team' and calculate various aggregations
grouped_agg = df.groupby('team').agg({
'points': ['sum', 'mean', 'max']
})
print("\nGrouped by Team with multiple aggregations:\n", grouped_agg)
# Calculate the mean Points by Team and assign it to each row
df['meanpoints'] = df.groupby('team')['points'].transform('mean')
print("\nMean Points by Team assigned to each row:\n", df)
# Iterate over groups
for name, group in df.groupby('team'):
print(f"\nGroup: {name}")
print(group)
Pivot Table
pivot_table() - Creates a pivot table from data in a DataFrame.
data = {
'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02'],
'City': ['Bokaro', 'Ranchi', 'Dhanbad', 'Dumka'],
'Sales': [200, 150, 300, 200],
'Expenses': [50, 60, 70, 80]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Basic Pivot Table
pivot = pd.pivot_table(df, values='Sales', index='Date', columns='City', aggfunc='sum')
print("\nPivot Table:\n", pivot)
# Multiple Aggregation Functions
pivot_multi = pd.pivot_table(df, values=['Sales', 'Expenses'], index='Date', columns='City',
aggfunc={'Sales': 'sum', 'Expenses': 'mean'})
print("\nPivot Table with Multiple Aggregations:\n", pivot_multi)
# Adding Margins (Subtotals)
pivot_margins = pd.pivot_table(df, values='Sales', index='Date', columns='City',
aggfunc='sum', margins=True)
print("\nPivot Table with Margins:\n", pivot_margins)
# Filling Missing Values
pivot_fill = pd.pivot_table(df, values='Sales', index='Date', columns='City',
aggfunc='sum', fill_value=0)
print("\nPivot Table with Fill Value:\n", pivot_fill)
Map
map() - Applies a function to each element of a Series.
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj'],
'age': [28, 24, 35, 32, 28, 24, np.nan, 32, 28, 24],
'city': ['bokaro', 'ranchi', 'dhanbad', 'deoghar', 'ranchi', 'bokaro', 'dhanbad',
'deoghar', 'ranchi', 'ranchi']
}
df = pd.DataFrame(data)
print("Display All Data:\n", df)
# Function to convert names to uppercase
def to_uppercase(name):
return name.upper()
# Apply the to_uppercase function to the 'name' column using map()
df['name'] = df['name'].map(to_uppercase)
# Display the modified DataFrame
print("\nDataFrame after applying map() to 'name' column:")
print(df)
Import and Export File
CSV File
# Import Export CSV File
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj'],
'age': [28, 24, 35, 32, 28, 24, np.nan, 32, 28, 24],
'city': ['bokaro', 'ranchi', 'dhanbad', 'deoghar', 'ranchi', 'bokaro', 'dhanbad',
'deoghar', 'ranchi', 'ranchi']
}
df = pd.DataFrame(data)
# Writing DataFrame to CSV file
# df.to_csv('sample.csv', index=False) # Without index
# Reading the CSV file back into a DataFrame
df_read = pd.read_csv('sample.csv')
print(df_read)
Excel File
# Import Export Excel File
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj'],
'age': [28, 24, 35, 32, 28, 24, np.nan, 32, 28, 24],
'city': ['bokaro', 'ranchi', 'dhanbad', 'deoghar', 'ranchi', 'bokaro', 'dhanbad',
'deoghar', 'ranchi', 'ranchi']
}
df = pd.DataFrame(data)
# Writing DataFrame to Excel file
#df.to_excel('sample.xlsx', index=False) # Without index
# Reading the Excel file back into a DataFrame
df_read = pd.read_excel('sample.xlsx')
print(df_read)
JSON File
# Import Export JSON File
data = {
'name': ['Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj', 'Rohit', 'Rani', 'Sonam', 'Raj'],
'age': [28, 24, 35, 32, 28, 24, np.nan, 32, 28, 24],
'city': ['bokaro', 'ranchi', 'dhanbad', 'deoghar', 'ranchi', 'bokaro', 'dhanbad',
'deoghar', 'ranchi', 'ranchi']
}
df = pd.DataFrame(data)
# Write DataFrame to JSON string
#json_str = df.to_json("data.json")
# Reading the JSON file back into a DataFrame
df_read = pd.read_json('data.json')
print(df_read)