In [1]:
import pandas as pd
import numpy as np
In [2]:
data = pd.read_csv('mail_data.csv')
data
Out[2]:
Category | Message | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only … |
1 | ham | Ok lar… Joking wif u oni… |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina… |
3 | ham | U dun say so early hor… U c already then say… |
4 | ham | Nah I don’t think he goes to usf, he lives aro… |
… | … | … |
5567 | spam | This is the 2nd time we have tried 2 contact u… |
5568 | ham | Will ü b going to esplanade fr home? |
5569 | ham | Pity, * was in mood for that. So…any other s… |
5570 | ham | The guy did some bitching but I acted like i’d… |
5571 | ham | Rofl. Its true to its name |
5572 rows × 2 columns
In [3]:
data.shape
Out[3]:
(5572, 2)
Steps to do:¶
- Data cleaning
- EDA
- Text Preprocessing
- Model Building
- Evaluation
- Improvement
- Website
- Deploy
1. Data Cleaning¶
In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5572 entries, 0 to 5571 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Category 5572 non-null object 1 Message 5572 non-null object dtypes: object(2) memory usage: 87.2+ KB
In [5]:
data.isnull().sum()
Out[5]:
Category 0 Message 0 dtype: int64
There is no null value in the data¶
In [6]:
data.sample(5)
Out[6]:
Category | Message | |
---|---|---|
4914 | spam | Goal! Arsenal 4 (Henry, 7 v Liverpool 2 Henry … |
4912 | ham | Love that holiday Monday feeling even if I hav… |
3676 | ham | Whos this am in class:-) |
3331 | ham | Send me yetty’s number pls. |
4175 | ham | And pls pls drink plenty plenty water |
Re-naming the columns name¶
In [7]:
data.rename(columns={'Category':'target','Message':'text'},inplace=True)
LabelEncoder¶
In [8]:
from sklearn.preprocessing import LabelEncoder
In [9]:
encoder = LabelEncoder()
In [10]:
data['target'] = encoder.fit_transform(data['target'])
In [11]:
data.head()
Out[11]:
target | text | |
---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only … |
1 | 0 | Ok lar… Joking wif u oni… |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina… |
3 | 0 | U dun say so early hor… U c already then say… |
4 | 0 | Nah I don’t think he goes to usf, he lives aro… |
Check missing values¶
In [12]:
data.isnull().sum()
Out[12]:
target 0 text 0 dtype: int64
Check duplicate values¶
In [13]:
data.duplicated().sum()
Out[13]:
415
In [14]:
data = data.drop_duplicates(keep='first')
In [15]:
data.duplicated().sum()
Out[15]:
0
In [16]:
data.shape
Out[16]:
(5157, 2)
2. EDA¶
In [17]:
data.head()
Out[17]:
target | text | |
---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only … |
1 | 0 | Ok lar… Joking wif u oni… |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina… |
3 | 0 | U dun say so early hor… U c already then say… |
4 | 0 | Nah I don’t think he goes to usf, he lives aro… |
In [18]:
data['target'].value_counts()
Out[18]:
target 0 4516 1 641 Name: count, dtype: int64
In [19]:
import matplotlib.pyplot as plt
In [20]:
data['target'].value_counts().plot(kind='pie',autopct='%.2f')
plt.show()
Data is imbalanced¶
In [21]:
import nltk
In [22]:
nltk.download('punkt')
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\Mehak\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
Out[22]:
True
In [23]:
data['num_characters']=data['text'].apply(len) #it will count len of characters per line
data['num_characters']
C:\Users\Mehak\AppData\Local\Temp\ipykernel_16948\2449063349.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data['num_characters']=data['text'].apply(len) #it will count len of characters per line
Out[23]:
0 111 1 29 2 155 3 49 4 61 ... 5567 160 5568 36 5569 57 5570 125 5571 26 Name: num_characters, Length: 5157, dtype: int64
In [24]:
data.head()
Out[24]:
target | text | num_characters | |
---|---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only … | 111 |
1 | 0 | Ok lar… Joking wif u oni… | 29 |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina… | 155 |
3 | 0 | U dun say so early hor… U c already then say… | 49 |
4 | 0 | Nah I don’t think he goes to usf, he lives aro… | 61 |
In [25]:
#num of words
data['text'].apply(lambda x: nltk.word_tokenize(x)) #word_tokenize break the sentence after a word
Out[25]:
0 [Go, until, jurong, point, ,, crazy, .., Avail... 1 [Ok, lar, ..., Joking, wif, u, oni, ...] 2 [Free, entry, in, 2, a, wkly, comp, to, win, F... 3 [U, dun, say, so, early, hor, ..., U, c, alrea... 4 [Nah, I, do, n't, think, he, goes, to, usf, ,,... ... 5567 [This, is, the, 2nd, time, we, have, tried, 2,... 5568 [Will, ü, b, going, to, esplanade, fr, home, ?] 5569 [Pity, ,, *, was, in, mood, for, that, ., So, ... 5570 [The, guy, did, some, bitching, but, I, acted,... 5571 [Rofl, ., Its, true, to, its, name] Name: text, Length: 5157, dtype: object
In [26]:
#num of words
data['num_words']=data['text'].apply(lambda x: len(nltk.word_tokenize(x))) #count len of words
data['num_words']
C:\Users\Mehak\AppData\Local\Temp\ipykernel_16948\1744167564.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data['num_words']=data['text'].apply(lambda x: len(nltk.word_tokenize(x))) #count len of words
Out[26]:
0 24 1 8 2 37 3 13 4 15 .. 5567 35 5568 9 5569 15 5570 27 5571 7 Name: num_words, Length: 5157, dtype: int64
In [27]:
data.head()
Out[27]:
target | text | num_characters | num_words | |
---|---|---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only … | 111 | 24 |
1 | 0 | Ok lar… Joking wif u oni… | 29 | 8 |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina… | 155 | 37 |
3 | 0 | U dun say so early hor… U c already then say… | 49 | 13 |
4 | 0 | Nah I don’t think he goes to usf, he lives aro… | 61 | 15 |
In [28]:
#num of sentences
data['text'].apply(lambda x: nltk.sent_tokenize(x)) #sent_tokenize break the sentence
Out[28]:
0 [Go until jurong point, crazy.., Available onl... 1 [Ok lar..., Joking wif u oni...] 2 [Free entry in 2 a wkly comp to win FA Cup fin... 3 [U dun say so early hor... U c already then sa... 4 [Nah I don't think he goes to usf, he lives ar... ... 5567 [This is the 2nd time we have tried 2 contact ... 5568 [Will ü b going to esplanade fr home?] 5569 [Pity, * was in mood for that., So...any other... 5570 [The guy did some bitching but I acted like i'... 5571 [Rofl., Its true to its name] Name: text, Length: 5157, dtype: object
In [29]:
#num of sentence
data['num_sentences']=data['text'].apply(lambda x: len(nltk.sent_tokenize(x))) #count len of sentence
data['num_sentences']
C:\Users\Mehak\AppData\Local\Temp\ipykernel_16948\1791182575.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data['num_sentences']=data['text'].apply(lambda x: len(nltk.sent_tokenize(x))) #count len of sentence
Out[29]:
0 2 1 2 2 2 3 1 4 1 .. 5567 4 5568 1 5569 2 5570 1 5571 2 Name: num_sentences, Length: 5157, dtype: int64
In [30]:
data.head()
Out[30]:
target | text | num_characters | num_words | num_sentences | |
---|---|---|---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only … | 111 | 24 | 2 |
1 | 0 | Ok lar… Joking wif u oni… | 29 | 8 | 2 |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina… | 155 | 37 | 2 |
3 | 0 | U dun say so early hor… U c already then say… | 49 | 13 | 1 |
4 | 0 | Nah I don’t think he goes to usf, he lives aro… | 61 | 15 | 1 |
In [31]:
data[['num_characters','num_words','num_sentences']].describe()
Out[31]:
num_characters | num_words | num_sentences | |
---|---|---|---|
count | 5157.000000 | 5157.000000 | 5157.000000 |
mean | 79.103936 | 18.560403 | 1.969750 |
std | 58.382922 | 13.405970 | 1.455526 |
min | 2.000000 | 1.000000 | 1.000000 |
25% | 36.000000 | 9.000000 | 1.000000 |
50% | 61.000000 | 15.000000 | 1.000000 |
75% | 118.000000 | 26.000000 | 2.000000 |
max | 910.000000 | 220.000000 | 38.000000 |
In [32]:
data[data['target'] == 0][['num_characters','num_words','num_sentences']].describe()
Out[32]:
num_characters | num_words | num_sentences | |
---|---|---|---|
count | 4516.000000 | 4516.000000 | 4516.000000 |
mean | 70.869353 | 17.267715 | 1.827724 |
std | 56.708301 | 13.588065 | 1.394338 |
min | 2.000000 | 1.000000 | 1.000000 |
25% | 34.000000 | 8.000000 | 1.000000 |
50% | 53.000000 | 13.000000 | 1.000000 |
75% | 91.000000 | 22.000000 | 2.000000 |
max | 910.000000 | 220.000000 | 38.000000 |
In [33]:
#spam
data[data['target'] == 1][['num_characters','num_words','num_sentences']].describe()
Out[33]:
num_characters | num_words | num_sentences | |
---|---|---|---|
count | 641.000000 | 641.000000 | 641.000000 |
mean | 137.118565 | 27.667707 | 2.970359 |
std | 30.399707 | 7.103501 | 1.485575 |
min | 7.000000 | 2.000000 | 1.000000 |
25% | 130.000000 | 25.000000 | 2.000000 |
50% | 148.000000 | 29.000000 | 3.000000 |
75% | 157.000000 | 32.000000 | 4.000000 |
max | 223.000000 | 46.000000 | 9.000000 |
In [34]:
import seaborn as sns
In [35]:
sns.histplot(data[data['target'] == 0]['num_characters'])
sns.histplot(data[data['target'] == 1]['num_characters'],color='red')
Out[35]:
<Axes: xlabel='num_characters', ylabel='Count'>
In [36]:
sns.histplot(data[data['target'] == 0]['num_words'])
sns.histplot(data[data['target'] == 1]['num_words'],color='red')
Out[36]:
<Axes: xlabel='num_words', ylabel='Count'>
In [37]:
sns.pairplot(data,hue='target')
Out[37]:
<seaborn.axisgrid.PairGrid at 0x19202cc30b0>
In [38]:
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 5157 entries, 0 to 5571 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 target 5157 non-null int32 1 text 5157 non-null object 2 num_characters 5157 non-null int64 3 num_words 5157 non-null int64 4 num_sentences 5157 non-null int64 dtypes: int32(1), int64(3), object(1) memory usage: 221.6+ KB
In [39]:
#data.corr()
#sns.heatmap(data.corr(),annot=True)
3. Data Preprocessing¶
- Lower Case
- Tokenization
- Removing Special Characters
- Removing Stopwords and Punctuation
- Stemming
In [40]:
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
return y
In [41]:
transform_text('Hi how Are You 20%% eg')
Out[41]:
['hi', 'how', 'are', 'you', '20', 'eg']
In [42]:
transform_text('Hi how Are You 20 eg')
Out[42]:
['hi', 'how', 'are', 'you', '20', 'eg']
In [43]:
data['text'][45]
Out[43]:
'No calls..messages..missed calls'
In [44]:
from nltk.corpus import stopwords
stopwords.words('english')
Out[44]:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [45]:
import string
string.punctuation
Out[45]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
In [46]:
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
return y
In [47]:
transform_text('Hi how Are You Sham?')
Out[47]:
['hi', 'sham']
In [48]:
transform_text('Did you like to Travel?')
Out[48]:
['like', 'travel']
In [49]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
In [50]:
ps.stem('dancing')
Out[50]:
'danc'
In [51]:
ps.stem('loving')
Out[51]:
'love'
In [52]:
ps.stem('travelling')
Out[52]:
'travel'
In [53]:
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)
y = []
for i in text:
if i.isalnum():
y.append(i)
text = y[:]
y.clear()
for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)
text = y[:]
y.clear()
for i in text:
y.append(ps.stem(i))
return " ".join(y)
In [54]:
transform_text('Did you like to Travel?')
Out[54]:
'like travel'
In [55]:
transform_text('I loved to use some scenes of videos in faceboook')
Out[55]:
'love use scene video faceboook'
In [56]:
transform_text('Do you love Travelling?')
Out[56]:
'love travel'
In [57]:
transform_text('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...')
Out[57]:
'go jurong point crazi avail bugi n great world la e buffet cine got amor wat'
In [58]:
data['text'][10]
Out[58]:
"I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today."
In [59]:
transform_text("I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today."
)
Out[59]:
'gon na home soon want talk stuff anymor tonight k cri enough today'
Make a new column¶
In [60]:
data['transformed_text'] = data['text'].apply(transform_text)
data['transformed_text']
C:\Users\Mehak\AppData\Local\Temp\ipykernel_16948\3521414896.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data['transformed_text'] = data['text'].apply(transform_text)
Out[60]:
0 go jurong point crazi avail bugi n great world... 1 ok lar joke wif u oni 2 free entri 2 wkli comp win fa cup final tkt 21... 3 u dun say earli hor u c alreadi say 4 nah think goe usf live around though ... 5567 2nd time tri 2 contact u pound prize 2 claim e... 5568 ü b go esplanad fr home 5569 piti mood suggest 5570 guy bitch act like interest buy someth els nex... 5571 rofl true name Name: transformed_text, Length: 5157, dtype: object
In [61]:
data.head()
Out[61]:
target | text | num_characters | num_words | num_sentences | transformed_text | |
---|---|---|---|---|---|---|
0 | 0 | Go until jurong point, crazy.. Available only … | 111 | 24 | 2 | go jurong point crazi avail bugi n great world… |
1 | 0 | Ok lar… Joking wif u oni… | 29 | 8 | 2 | ok lar joke wif u oni |
2 | 1 | Free entry in 2 a wkly comp to win FA Cup fina… | 155 | 37 | 2 | free entri 2 wkli comp win fa cup final tkt 21… |
3 | 0 | U dun say so early hor… U c already then say… | 49 | 13 | 1 | u dun say earli hor u c alreadi say |
4 | 0 | Nah I don’t think he goes to usf, he lives aro… | 61 | 15 | 1 | nah think goe usf live around though |
how to make word cloud of spam messages¶
In [62]:
from wordcloud import WordCloud
wc = WordCloud(width = 500, height = 500, min_font_size = 10, background_color='white')
In [63]:
spam_wc = wc.generate(data[data['target'] == 1]['transformed_text'].str.cat(sep = ' '))
In [64]:
plt.figure(figsize=(15,6))
plt.imshow(spam_wc)
Out[64]:
<matplotlib.image.AxesImage at 0x19202ddae10>
In [65]:
ham_wc = wc.generate(data[data['target'] == 0]['transformed_text'].str.cat(sep = ' '))
In [66]:
plt.figure(figsize=(15,6))
plt.imshow(ham_wc)
Out[66]:
<matplotlib.image.AxesImage at 0x192065b7560>