Import The Numpy and Pandas Packages¶
In [2]:
import pandas as pd
import numpy as np
Task 1: Reading and Inspection¶
- ### Subtask 1.1: Import and read
Import and read the melboune house database. Store it in a variable called data
.
In [3]:
data = pd.read_csv('Melbourne.csv')
OrgData = data
data
Out[3]:
- ### Subtask 1.2: Inspect the dataframe
Inspect the dataframe’s columns, shapes, variable types etc.
In [4]:
data.shape
Out[4]:
In [5]:
data.info()
Task 2: Cleaning the Data¶
- ### Subtask 2.1: Inspect Null values
Find out the number of Null values in all the columns and rows. Also, find the percentage of Null values in each column. Round off the percentages upto two decimal places.
Write your code for column-wise null count here¶
In [6]:
data.isnull().sum(axis=0).sort_values(ascending=False)
Out[6]:
Write your code for row-wise null count here¶
In [7]:
data.isnull().sum(axis=1).sort_values(ascending=False)
Out[7]:
Find Columns having at least one missing value¶
In [8]:
data.isnull().sum() >0
Out[8]:
Find Columns having at least one missing value using any() function¶
In [9]:
d = data.isnull().any()
d
Out[9]:
In [10]:
d.index[d.values]
Out[10]:
By default any() operates on columns¶
In [11]:
data.isnull().any(axis=0)
Out[11]:
Find the missing value row wise if any found return True/False¶
In [12]:
data.isnull().any(axis=1)
Out[12]:
In [13]:
data[data.isnull().any(axis=1)]
Out[13]:
Columns having all missing values¶
In [14]:
data.isnull().all(axis=0)
Out[14]:
Rows having all missing values¶
In [15]:
data.isnull().all(axis=1)
Out[15]:
In [16]:
data.isnull().all(axis=1).sum()
Out[16]:
Summing up the missing values (column-wise) : Cal in %¶
In [17]:
len(data)
Out[17]:
In [18]:
data.isnull().sum(axis=0).sort_values(ascending=False)/len(data)*100
Out[18]:
Removing the three columns where the max null value percentage¶
In [19]:
Col = data.isnull().sum(axis=0).sort_values(ascending=False).head(3).index.values
Col
Out[19]:
In [20]:
data = data.drop(Col,axis='columns')
data
Out[20]:
In [21]:
data.isnull().sum().sort_values(ascending=False)/len(data)*100
Out[21]:
Check the rows where you have more then 5 missing Values¶
In [22]:
data[data.isnull().sum(axis=1) > 5]
Out[22]:
Count the number of rows having >5 missing values¶
In [23]:
len(data[data.isnull().sum(axis=1) > 5])
Out[23]:
In [24]:
data[data.isnull().sum(axis=1) > 5].shape
Out[24]:
Calculte the percentage¶
In [25]:
len(data[data.isnull().sum(axis=1) > 5])/len(data)*100
Out[25]:
In [26]:
round(len(data[data.isnull().sum(axis=1) > 5])/len(data)*100,2)
Out[26]:
Retaining the rows having <=5 NaNs¶
In [27]:
data = data[data.isnull().sum(axis=1) <=5]
data
Out[27]:
In [28]:
data.isnull().sum().sort_values(ascending=False)/len(data)*100
Out[28]:
Removing NAN Price rows¶
In [29]:
data = data[data.Price.notnull()]
data
Out[29]:
In [30]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[30]:
Still Landsize columns has 9.83% nan data, Now Describe the row¶
In [31]:
data['Landsize'].describe()
Out[31]:
In [32]:
data = data[data.Landsize.notnull()]
data
Out[32]:
In [33]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[33]:
Describe Lattitude and Longtitude and later Imputing Lattitude and Longtitude by Mean Values¶
In [34]:
data.loc[:,['Lattitude','Longtitude']].describe()
Out[34]:
In [35]:
data['Lattitude'].mean()
Out[35]:
In [36]:
data.loc[:,'Lattitude'].fillna(data['Lattitude'].mean(),inplace=True)
In [37]:
data.loc[:,'Longtitude'].fillna(data['Longtitude'].mean(),inplace=True)
In [38]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[38]:
Now Still we are left with Bathroom and Car Parking.¶
In [39]:
data.loc[:,['Bathroom','Car']].describe()
Out[39]:
In [40]:
data.groupby('Car').Car.count().sort_values(ascending=False)
Out[40]:
Another way to count using category method¶
In [41]:
data['Car'].astype('category').value_counts()
Out[41]:
In [42]:
data.loc[:,'Car'].fillna(2,inplace=True)
In [43]:
data.isnull().sum().sort_values(ascending=False)
Out[43]:
Now check for Bathroom¶
In [44]:
data['Bathroom'].astype('category').value_counts()
Out[44]:
In [45]:
data.loc[:,'Bathroom'].fillna(1,inplace=True)
In [46]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[46]:
Check No of Row Retain¶
In [47]:
len(data)/len(OrgData)*100
Out[47]:
Now Do The Data Analysis and Apply Machine Learning Algorithm To Predict Price of a House¶
In [48]:
data
Out[48]: