Saturday , December 21 2024

Data Cleaning Melbourne House Dataset

2 – Project Melboune House Data Cleaning

Import The Numpy and Pandas Packages

In [2]:
import pandas as pd
import numpy as np

Task 1: Reading and Inspection

  • ### Subtask 1.1: Import and read

Import and read the melboune house database. Store it in a variable called data.

In [3]:
data = pd.read_csv('Melbourne.csv')
OrgData = data
data
Out[3]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 03-09-2016 2.5 3067.0 1.0 1.0 126.0 NaN NaN Yarra -37.80140 144.99580 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 1.0 1.0 202.0 NaN NaN Yarra -37.79960 144.99840 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 04-02-2016 2.5 3067.0 1.0 0.0 156.0 79.0 1900.0 Yarra -37.80790 144.99340 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 04-02-2016 2.5 3067.0 2.0 1.0 0.0 NaN NaN Yarra -37.81140 145.01160 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 04-03-2017 2.5 3067.0 2.0 0.0 134.0 150.0 1900.0 Yarra -37.80930 144.99440 Northern Metropolitan 4019.0
23542 Wyndham Vale 25 Clitheroe Dr 3 u NaN PN Harcourts 26-08-2017 27.2 3024.0 1.0 0.0 552.0 119.0 1990.0 NaN -37.90032 144.61839 Western Metropolitan 5262.0
23543 Wyndham Vale 19 Dalrymple Bvd 4 h NaN S hockingstuart 26-08-2017 27.2 3024.0 NaN NaN NaN NaN NaN NaN -37.87882 144.60184 Western Metropolitan 5262.0
23544 Yallambie 17 Amaroo Wy 4 h 1100000.0 S Buckingham 26-08-2017 12.7 3085.0 3.0 2.0 NaN NaN NaN NaN -37.72006 145.10547 Northern Metropolitan 1369.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 1.0 1.0 362.0 112.0 1920.0 NaN -37.81188 144.88449 Western Metropolitan 6543.0
23546 Yarraville 33 Freeman St 4 h 1050000.0 VB Village 26-08-2017 6.3 3013.0 2.0 2.0 NaN 139.0 1950.0 NaN -37.81829 144.87404 Western Metropolitan 6543.0

23547 rows × 21 columns

  • ### Subtask 1.2: Inspect the dataframe

Inspect the dataframe’s columns, shapes, variable types etc.

In [4]:
data.shape
Out[4]:
(23547, 21)
In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23547 entries, 0 to 23546
Data columns (total 21 columns):
Suburb           23547 non-null object
Address          23547 non-null object
Rooms            23547 non-null int64
Type             23547 non-null object
Price            18396 non-null float64
Method           23547 non-null object
SellerG          23547 non-null object
Date             23547 non-null object
Distance         23546 non-null float64
Postcode         23546 non-null float64
Bedroom2         19066 non-null float64
Bathroom         19063 non-null float64
Car              18921 non-null float64
Landsize         17410 non-null float64
BuildingArea     10018 non-null float64
YearBuilt        11540 non-null float64
CouncilArea      15656 non-null object
Lattitude        19243 non-null float64
Longtitude       19243 non-null float64
Regionname       23546 non-null object
Propertycount    23546 non-null float64
dtypes: float64(12), int64(1), object(8)
memory usage: 3.8+ MB

Task 2: Cleaning the Data

  • ### Subtask 2.1: Inspect Null values

Find out the number of Null values in all the columns and rows. Also, find the percentage of Null values in each column. Round off the percentages upto two decimal places.

Write your code for column-wise null count here

In [6]:
data.isnull().sum(axis=0).sort_values(ascending=False)
Out[6]:
BuildingArea     13529
YearBuilt        12007
CouncilArea       7891
Landsize          6137
Price             5151
Car               4626
Bathroom          4484
Bedroom2          4481
Longtitude        4304
Lattitude         4304
Distance             1
Propertycount        1
Postcode             1
Regionname           1
Date                 0
SellerG              0
Method               0
Type                 0
Rooms                0
Address              0
Suburb               0
dtype: int64

Write your code for row-wise null count here

In [7]:
data.isnull().sum(axis=1).sort_values(ascending=False)
Out[7]:
18523    13
17100    10
16232    10
20968    10
6313     10
         ..
11553     0
5077      0
11551     0
11550     0
8551      0
Length: 23547, dtype: int64

Find Columns having at least one missing value

In [8]:
data.isnull().sum() >0
Out[8]:
Suburb           False
Address          False
Rooms            False
Type             False
Price             True
Method           False
SellerG          False
Date             False
Distance          True
Postcode          True
Bedroom2          True
Bathroom          True
Car               True
Landsize          True
BuildingArea      True
YearBuilt         True
CouncilArea       True
Lattitude         True
Longtitude        True
Regionname        True
Propertycount     True
dtype: bool

Find Columns having at least one missing value using any() function

In [9]:
d = data.isnull().any()
d
Out[9]:
Suburb           False
Address          False
Rooms            False
Type             False
Price             True
Method           False
SellerG          False
Date             False
Distance          True
Postcode          True
Bedroom2          True
Bathroom          True
Car               True
Landsize          True
BuildingArea      True
YearBuilt         True
CouncilArea       True
Lattitude         True
Longtitude        True
Regionname        True
Propertycount     True
dtype: bool
In [10]:
d.index[d.values]
Out[10]:
Index(['Price', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

By default any() operates on columns

In [11]:
data.isnull().any(axis=0)
Out[11]:
Suburb           False
Address          False
Rooms            False
Type             False
Price             True
Method           False
SellerG          False
Date             False
Distance          True
Postcode          True
Bedroom2          True
Bathroom          True
Car               True
Landsize          True
BuildingArea      True
YearBuilt         True
CouncilArea       True
Lattitude         True
Longtitude        True
Regionname        True
Propertycount     True
dtype: bool

Find the missing value row wise if any found return True/False

In [12]:
data.isnull().any(axis=1)
Out[12]:
0         True
1         True
2        False
3         True
4        False
         ...  
23542     True
23543     True
23544     True
23545     True
23546     True
Length: 23547, dtype: bool
In [13]:
data[data.isnull().any(axis=1)]
Out[13]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bathroom Car Landsize BuildingArea YearBuilt CouncilArea Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 03-09-2016 2.5 3067.0 1.0 1.0 126.0 NaN NaN Yarra -37.80140 144.99580 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 1.0 1.0 202.0 NaN NaN Yarra -37.79960 144.99840 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 04-02-2016 2.5 3067.0 2.0 1.0 0.0 NaN NaN Yarra -37.81140 145.01160 Northern Metropolitan 4019.0
5 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 04-03-2017 2.5 3067.0 2.0 1.0 94.0 NaN NaN Yarra -37.79690 144.99690 Northern Metropolitan 4019.0
7 Abbotsford 16 Maugie St 4 h NaN SN Nelson 06-08-2016 2.5 3067.0 2.0 2.0 400.0 220.0 2006.0 Yarra -37.79650 144.99650 Northern Metropolitan 4019.0
23542 Wyndham Vale 25 Clitheroe Dr 3 u NaN PN Harcourts 26-08-2017 27.2 3024.0 1.0 0.0 552.0 119.0 1990.0 NaN -37.90032 144.61839 Western Metropolitan 5262.0
23543 Wyndham Vale 19 Dalrymple Bvd 4 h NaN S hockingstuart 26-08-2017 27.2 3024.0 NaN NaN NaN NaN NaN NaN -37.87882 144.60184 Western Metropolitan 5262.0
23544 Yallambie 17 Amaroo Wy 4 h 1100000.0 S Buckingham 26-08-2017 12.7 3085.0 3.0 2.0 NaN NaN NaN NaN -37.72006 145.10547 Northern Metropolitan 1369.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 1.0 1.0 362.0 112.0 1920.0 NaN -37.81188 144.88449 Western Metropolitan 6543.0
23546 Yarraville 33 Freeman St 4 h 1050000.0 VB Village 26-08-2017 6.3 3013.0 2.0 2.0 NaN 139.0 1950.0 NaN -37.81829 144.87404 Western Metropolitan 6543.0

17351 rows × 21 columns

Columns having all missing values

In [14]:
data.isnull().all(axis=0)
Out[14]:
Suburb           False
Address          False
Rooms            False
Type             False
Price            False
Method           False
SellerG          False
Date             False
Distance         False
Postcode         False
Bedroom2         False
Bathroom         False
Car              False
Landsize         False
BuildingArea     False
YearBuilt        False
CouncilArea      False
Lattitude        False
Longtitude       False
Regionname       False
Propertycount    False
dtype: bool

Rows having all missing values

In [15]:
data.isnull().all(axis=1)
Out[15]:
0        False
1        False
2        False
3        False
4        False
         ...  
23542    False
23543    False
23544    False
23545    False
23546    False
Length: 23547, dtype: bool
In [16]:
data.isnull().all(axis=1).sum()
Out[16]:
0

Summing up the missing values (column-wise) : Cal in %

In [17]:
len(data)
Out[17]:
23547
In [18]:
data.isnull().sum(axis=0).sort_values(ascending=False)/len(data)*100
Out[18]:
BuildingArea     57.455302
YearBuilt        50.991634
CouncilArea      33.511700
Landsize         26.062768
Price            21.875398
Car              19.645815
Bathroom         19.042766
Bedroom2         19.030025
Longtitude       18.278337
Lattitude        18.278337
Distance          0.004247
Propertycount     0.004247
Postcode          0.004247
Regionname        0.004247
Date              0.000000
SellerG           0.000000
Method            0.000000
Type              0.000000
Rooms             0.000000
Address           0.000000
Suburb            0.000000
dtype: float64

Removing the three columns where the max null value percentage

In [19]:
Col = data.isnull().sum(axis=0).sort_values(ascending=False).head(3).index.values
Col
Out[19]:
array(['BuildingArea', 'YearBuilt', 'CouncilArea'], dtype=object)
In [20]:
data = data.drop(Col,axis='columns')
data
Out[20]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 03-09-2016 2.5 3067.0 2.0 1.0 1.0 126.0 -37.80140 144.99580 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 2.0 1.0 1.0 202.0 -37.79960 144.99840 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 04-02-2016 2.5 3067.0 2.0 1.0 0.0 156.0 -37.80790 144.99340 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 04-02-2016 2.5 3067.0 3.0 2.0 1.0 0.0 -37.81140 145.01160 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 04-03-2017 2.5 3067.0 3.0 2.0 0.0 134.0 -37.80930 144.99440 Northern Metropolitan 4019.0
23542 Wyndham Vale 25 Clitheroe Dr 3 u NaN PN Harcourts 26-08-2017 27.2 3024.0 3.0 1.0 0.0 552.0 -37.90032 144.61839 Western Metropolitan 5262.0
23543 Wyndham Vale 19 Dalrymple Bvd 4 h NaN S hockingstuart 26-08-2017 27.2 3024.0 NaN NaN NaN NaN -37.87882 144.60184 Western Metropolitan 5262.0
23544 Yallambie 17 Amaroo Wy 4 h 1100000.0 S Buckingham 26-08-2017 12.7 3085.0 4.0 3.0 2.0 NaN -37.72006 145.10547 Northern Metropolitan 1369.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 4.0 1.0 1.0 362.0 -37.81188 144.88449 Western Metropolitan 6543.0
23546 Yarraville 33 Freeman St 4 h 1050000.0 VB Village 26-08-2017 6.3 3013.0 4.0 2.0 2.0 NaN -37.81829 144.87404 Western Metropolitan 6543.0

23547 rows × 18 columns

In [21]:
data.isnull().sum().sort_values(ascending=False)/len(data)*100
Out[21]:
Landsize         26.062768
Price            21.875398
Car              19.645815
Bathroom         19.042766
Bedroom2         19.030025
Longtitude       18.278337
Lattitude        18.278337
Propertycount     0.004247
Regionname        0.004247
Distance          0.004247
Postcode          0.004247
Date              0.000000
SellerG           0.000000
Method            0.000000
Type              0.000000
Rooms             0.000000
Address           0.000000
Suburb            0.000000
dtype: float64

Check the rows where you have more then 5 missing Values

In [22]:
data[data.isnull().sum(axis=1) > 5]
Out[22]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize Lattitude Longtitude Regionname Propertycount
15 Abbotsford 217 Langridge St 3 h 1000000.0 S Jellis 08-10-2016 2.5 3067.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 4019.0
16 Abbotsford 18a Mollison St 2 t 745000.0 S Jellis 08-10-2016 2.5 3067.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 4019.0
19 Abbotsford 403/609 Victoria St 2 u 542000.0 S Dingle 08-10-2016 2.5 3067.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 4019.0
21 Abbotsford 25/84 Trenerry Cr 2 u 760000.0 SP Biggin 10-12-2016 2.5 3067.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 4019.0
22 Abbotsford 106/119 Turner St 1 u 481000.0 SP Purplebricks 10-12-2016 2.5 3067.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 4019.0
23495 Templestowe Lower 1/207 Manningham Rd 2 u 550000.0 PI Barry 26-08-2017 12.4 3107.0 NaN NaN NaN NaN NaN NaN Eastern Metropolitan 5420.0
23499 Thornbury 1/128 Dundas St 2 t 770000.0 PI McGrath 26-08-2017 7.0 3071.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 8870.0
23502 Thornbury 111 Pender St 2 t 858000.0 S Jellis 26-08-2017 7.0 3071.0 NaN NaN NaN NaN NaN NaN Northern Metropolitan 8870.0
23508 Toorak 21/1059 Malvern Rd 2 u 720000.0 VB Beller 26-08-2017 4.1 3142.0 NaN NaN NaN NaN NaN NaN Southern Metropolitan 7217.0
23531 Werribee 91 Latham St 4 h 540000.0 PI Triwest 26-08-2017 14.7 3030.0 NaN NaN NaN NaN NaN NaN Western Metropolitan 16166.0

4278 rows × 18 columns

Count the number of rows having >5 missing values

In [23]:
len(data[data.isnull().sum(axis=1) > 5])
Out[23]:
4278
In [24]:
data[data.isnull().sum(axis=1) > 5].shape
Out[24]:
(4278, 18)

Calculte the percentage

In [25]:
len(data[data.isnull().sum(axis=1) > 5])/len(data)*100
Out[25]:
18.16791948018856
In [26]:
round(len(data[data.isnull().sum(axis=1) > 5])/len(data)*100,2)
Out[26]:
18.17

Retaining the rows having <=5 NaNs

In [27]:
data = data[data.isnull().sum(axis=1) <=5]
data
Out[27]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize Lattitude Longtitude Regionname Propertycount
0 Abbotsford 68 Studley St 2 h NaN SS Jellis 03-09-2016 2.5 3067.0 2.0 1.0 1.0 126.0 -37.80140 144.99580 Northern Metropolitan 4019.0
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 2.0 1.0 1.0 202.0 -37.79960 144.99840 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 04-02-2016 2.5 3067.0 2.0 1.0 0.0 156.0 -37.80790 144.99340 Northern Metropolitan 4019.0
3 Abbotsford 18/659 Victoria St 3 u NaN VB Rounds 04-02-2016 2.5 3067.0 3.0 2.0 1.0 0.0 -37.81140 145.01160 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 04-03-2017 2.5 3067.0 3.0 2.0 0.0 134.0 -37.80930 144.99440 Northern Metropolitan 4019.0
23542 Wyndham Vale 25 Clitheroe Dr 3 u NaN PN Harcourts 26-08-2017 27.2 3024.0 3.0 1.0 0.0 552.0 -37.90032 144.61839 Western Metropolitan 5262.0
23543 Wyndham Vale 19 Dalrymple Bvd 4 h NaN S hockingstuart 26-08-2017 27.2 3024.0 NaN NaN NaN NaN -37.87882 144.60184 Western Metropolitan 5262.0
23544 Yallambie 17 Amaroo Wy 4 h 1100000.0 S Buckingham 26-08-2017 12.7 3085.0 4.0 3.0 2.0 NaN -37.72006 145.10547 Northern Metropolitan 1369.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 4.0 1.0 1.0 362.0 -37.81188 144.88449 Western Metropolitan 6543.0
23546 Yarraville 33 Freeman St 4 h 1050000.0 VB Village 26-08-2017 6.3 3013.0 4.0 2.0 2.0 NaN -37.81829 144.87404 Western Metropolitan 6543.0

19269 rows × 18 columns

In [28]:
data.isnull().sum().sort_values(ascending=False)/len(data)*100
Out[28]:
Price            21.708444
Landsize          9.647621
Car               1.806010
Bathroom          1.069075
Bedroom2          1.053506
Longtitude        0.134932
Lattitude         0.134932
Propertycount     0.000000
Method            0.000000
Address           0.000000
Rooms             0.000000
Type              0.000000
Distance          0.000000
SellerG           0.000000
Date              0.000000
Regionname        0.000000
Postcode          0.000000
Suburb            0.000000
dtype: float64

Removing NAN Price rows

In [29]:
data = data[data.Price.notnull()]
data
Out[29]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize Lattitude Longtitude Regionname Propertycount
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 2.0 1.0 1.0 202.0 -37.79960 144.99840 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 04-02-2016 2.5 3067.0 2.0 1.0 0.0 156.0 -37.80790 144.99340 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 04-03-2017 2.5 3067.0 3.0 2.0 0.0 134.0 -37.80930 144.99440 Northern Metropolitan 4019.0
5 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 04-03-2017 2.5 3067.0 3.0 2.0 1.0 94.0 -37.79690 144.99690 Northern Metropolitan 4019.0
6 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 04-06-2016 2.5 3067.0 3.0 1.0 2.0 120.0 -37.80720 144.99410 Northern Metropolitan 4019.0
23540 Williamstown 8/2 Thompson St 2 t 622500.0 SP Greg 26-08-2017 6.8 3016.0 2.0 2.0 1.0 NaN -37.86393 144.90484 Western Metropolitan 6380.0
23541 Williamstown 96 Verdon St 4 h 2500000.0 PI Sweeney 26-08-2017 6.8 3016.0 4.0 1.0 5.0 866.0 -37.85908 144.89299 Western Metropolitan 6380.0
23544 Yallambie 17 Amaroo Wy 4 h 1100000.0 S Buckingham 26-08-2017 12.7 3085.0 4.0 3.0 2.0 NaN -37.72006 145.10547 Northern Metropolitan 1369.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 4.0 1.0 1.0 362.0 -37.81188 144.88449 Western Metropolitan 6543.0
23546 Yarraville 33 Freeman St 4 h 1050000.0 VB Village 26-08-2017 6.3 3013.0 4.0 2.0 2.0 NaN -37.81829 144.87404 Western Metropolitan 6543.0

15086 rows × 18 columns

In [30]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[30]:
Landsize         9.83
Car              1.76
Bathroom         1.07
Bedroom2         1.05
Longtitude       0.15
Lattitude        0.15
Propertycount    0.00
Method           0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Distance         0.00
SellerG          0.00
Date             0.00
Regionname       0.00
Postcode         0.00
Suburb           0.00
dtype: float64

Still Landsize columns has 9.83% nan data, Now Describe the row

In [31]:
data['Landsize'].describe()
Out[31]:
count     13603.000000
mean        558.116371
std        3987.326586
min           0.000000
25%         176.500000
50%         440.000000
75%         651.000000
max      433014.000000
Name: Landsize, dtype: float64
In [32]:
data = data[data.Landsize.notnull()]
data
Out[32]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize Lattitude Longtitude Regionname Propertycount
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 2.0 1.0 1.0 202.0 -37.79960 144.99840 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 04-02-2016 2.5 3067.0 2.0 1.0 0.0 156.0 -37.80790 144.99340 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 04-03-2017 2.5 3067.0 3.0 2.0 0.0 134.0 -37.80930 144.99440 Northern Metropolitan 4019.0
5 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 04-03-2017 2.5 3067.0 3.0 2.0 1.0 94.0 -37.79690 144.99690 Northern Metropolitan 4019.0
6 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 04-06-2016 2.5 3067.0 3.0 1.0 2.0 120.0 -37.80720 144.99410 Northern Metropolitan 4019.0
23537 Wheelers Hill 12 Strada Cr 4 h 1245000.0 S Barry 26-08-2017 16.7 3150.0 4.0 2.0 2.0 652.0 -37.90562 145.16761 South-Eastern Metropolitan 7392.0
23538 Williamstown 77 Merrett Dr 3 h 1031000.0 SP Williams 26-08-2017 6.8 3016.0 3.0 2.0 2.0 333.0 -37.85927 144.87904 Western Metropolitan 6380.0
23539 Williamstown 83 Power St 3 h 1170000.0 S Raine 26-08-2017 6.8 3016.0 3.0 2.0 4.0 436.0 -37.85274 144.88738 Western Metropolitan 6380.0
23541 Williamstown 96 Verdon St 4 h 2500000.0 PI Sweeney 26-08-2017 6.8 3016.0 4.0 1.0 5.0 866.0 -37.85908 144.89299 Western Metropolitan 6380.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 4.0 1.0 1.0 362.0 -37.81188 144.88449 Western Metropolitan 6543.0

13603 rows × 18 columns

In [33]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[33]:
Car              0.46
Longtitude       0.16
Lattitude        0.16
Bathroom         0.01
Propertycount    0.00
SellerG          0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
Distance         0.00
Date             0.00
Regionname       0.00
Postcode         0.00
Bedroom2         0.00
Landsize         0.00
Suburb           0.00
dtype: float64

Describe Lattitude and Longtitude and later Imputing Lattitude and Longtitude by Mean Values

In [34]:
data.loc[:,['Lattitude','Longtitude']].describe()
Out[34]:
Lattitude Longtitude
count 13581.000000 13581.000000
mean -37.809204 144.995221
std 0.079257 0.103913
min -38.182550 144.431810
25% -37.856820 144.929600
50% -37.802360 145.000100
75% -37.756400 145.058320
max -37.408530 145.526350
In [35]:
data['Lattitude'].mean()
Out[35]:
-37.809203506369194
In [36]:
data.loc[:,'Lattitude'].fillna(data['Lattitude'].mean(),inplace=True)
c:\users\karan\appdata\local\programs\python\python36\lib\site-packages\pandas\core\generic.py:6287: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)
In [37]:
data.loc[:,'Longtitude'].fillna(data['Longtitude'].mean(),inplace=True)
In [38]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[38]:
Car              0.46
Bathroom         0.01
Propertycount    0.00
Date             0.00
Address          0.00
Rooms            0.00
Type             0.00
Price            0.00
Method           0.00
SellerG          0.00
Distance         0.00
Regionname       0.00
Postcode         0.00
Bedroom2         0.00
Landsize         0.00
Lattitude        0.00
Longtitude       0.00
Suburb           0.00
dtype: float64

Now Still we are left with Bathroom and Car Parking.

In [39]:
data.loc[:,['Bathroom','Car']].describe()
Out[39]:
Bathroom Car
count 13602.000000 13540.000000
mean 1.534921 1.610414
std 0.691834 0.962244
min 0.000000 0.000000
25% 1.000000 1.000000
50% 1.000000 2.000000
75% 2.000000 2.000000
max 8.000000 10.000000
In [40]:
data.groupby('Car').Car.count().sort_values(ascending=False)
Out[40]:
Car
2.0     5606
1.0     5515
0.0     1026
3.0      748
4.0      507
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64

Another way to count using category method

In [41]:
data['Car'].astype('category').value_counts()
Out[41]:
2.0     5606
1.0     5515
0.0     1026
3.0      748
4.0      507
5.0       63
6.0       54
8.0        9
7.0        8
10.0       3
9.0        1
Name: Car, dtype: int64
In [42]:
data.loc[:,'Car'].fillna(2,inplace=True)
In [43]:
data.isnull().sum().sort_values(ascending=False)
Out[43]:
Bathroom         1
Propertycount    0
Date             0
Address          0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Distance         0
Regionname       0
Postcode         0
Bedroom2         0
Car              0
Landsize         0
Lattitude        0
Longtitude       0
Suburb           0
dtype: int64

Now check for Bathroom

In [44]:
data['Bathroom'].astype('category').value_counts()
Out[44]:
1.0    7517
2.0    4987
3.0     921
4.0     106
0.0      34
5.0      28
6.0       5
8.0       2
7.0       2
Name: Bathroom, dtype: int64
In [45]:
data.loc[:,'Bathroom'].fillna(1,inplace=True)
In [46]:
round(data.isnull().sum().sort_values(ascending=False)/len(data)*100,2)
Out[46]:
Propertycount    0.0
Regionname       0.0
Address          0.0
Rooms            0.0
Type             0.0
Price            0.0
Method           0.0
SellerG          0.0
Date             0.0
Distance         0.0
Postcode         0.0
Bedroom2         0.0
Bathroom         0.0
Car              0.0
Landsize         0.0
Lattitude        0.0
Longtitude       0.0
Suburb           0.0
dtype: float64

Check No of Row Retain

In [47]:
len(data)/len(OrgData)*100
Out[47]:
57.769567248481756

Now Do The Data Analysis and Apply Machine Learning Algorithm To Predict Price of a House

In [48]:
data
Out[48]:
Suburb Address Rooms Type Price Method SellerG Date Distance Postcode Bedroom2 Bathroom Car Landsize Lattitude Longtitude Regionname Propertycount
1 Abbotsford 85 Turner St 2 h 1480000.0 S Biggin 03-12-2016 2.5 3067.0 2.0 1.0 1.0 202.0 -37.79960 144.99840 Northern Metropolitan 4019.0
2 Abbotsford 25 Bloomburg St 2 h 1035000.0 S Biggin 04-02-2016 2.5 3067.0 2.0 1.0 0.0 156.0 -37.80790 144.99340 Northern Metropolitan 4019.0
4 Abbotsford 5 Charles St 3 h 1465000.0 SP Biggin 04-03-2017 2.5 3067.0 3.0 2.0 0.0 134.0 -37.80930 144.99440 Northern Metropolitan 4019.0
5 Abbotsford 40 Federation La 3 h 850000.0 PI Biggin 04-03-2017 2.5 3067.0 3.0 2.0 1.0 94.0 -37.79690 144.99690 Northern Metropolitan 4019.0
6 Abbotsford 55a Park St 4 h 1600000.0 VB Nelson 04-06-2016 2.5 3067.0 3.0 1.0 2.0 120.0 -37.80720 144.99410 Northern Metropolitan 4019.0
23537 Wheelers Hill 12 Strada Cr 4 h 1245000.0 S Barry 26-08-2017 16.7 3150.0 4.0 2.0 2.0 652.0 -37.90562 145.16761 South-Eastern Metropolitan 7392.0
23538 Williamstown 77 Merrett Dr 3 h 1031000.0 SP Williams 26-08-2017 6.8 3016.0 3.0 2.0 2.0 333.0 -37.85927 144.87904 Western Metropolitan 6380.0
23539 Williamstown 83 Power St 3 h 1170000.0 S Raine 26-08-2017 6.8 3016.0 3.0 2.0 4.0 436.0 -37.85274 144.88738 Western Metropolitan 6380.0
23541 Williamstown 96 Verdon St 4 h 2500000.0 PI Sweeney 26-08-2017 6.8 3016.0 4.0 1.0 5.0 866.0 -37.85908 144.89299 Western Metropolitan 6380.0
23545 Yarraville 6 Agnes St 4 h 1285000.0 SP Village 26-08-2017 6.3 3013.0 4.0 1.0 1.0 362.0 -37.81188 144.88449 Western Metropolitan 6543.0

13603 rows × 18 columns

About Machine Learning

Check Also

Groupby in Pandas - Data Science Tutorials

Groupby in Pandas – Data Science Tutorials

14- Groupby Groupby in Pandas¶Pandas groupby is used for grouping the data according to the …

Leave a Reply

Your email address will not be published. Required fields are marked *