pip install apriori
Collecting apriori Downloading apriori-1.0.0.tar.gz (1.8 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Building wheels for collected packages: apriori Building wheel for apriori (setup.py): started Building wheel for apriori (setup.py): finished with status 'done' Created wheel for apriori: filename=apriori-1.0.0-py3-none-any.whl size=2461 sha256=e7a218cb1515923b728187e5bf3e3ee0af71b55f2bc028616f355f8de48b7f0f Stored in directory: c:\users\mehak\appdata\local\pip\cache\wheels\42\42\e7\ec4c18deb50ea29505d96a153528e9b3631dee38d34a1db60b Successfully built apriori Installing collected packages: apriori Successfully installed apriori-1.0.0 Note: you may need to restart the kernel to use updated packages.
pip install apyori
Collecting apyori Downloading apyori-1.1.2.tar.gz (8.6 kB) Preparing metadata (setup.py): started Preparing metadata (setup.py): finished with status 'done' Building wheels for collected packages: apyori Building wheel for apyori (setup.py): started Building wheel for apyori (setup.py): finished with status 'done' Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5975 sha256=6d4dcd3819cbc2fe69b2375e139d42f7f6d5858360cc88275121ebdc9b89cdfc Stored in directory: c:\users\mehak\appdata\local\pip\cache\wheels\7f\49\e3\42c73b19a264de37129fadaa0c52f26cf50e87de08fb9804af Successfully built apyori Installing collected packages: apyori Successfully installed apyori-1.1.2 Note: you may need to restart the kernel to use updated packages.
Apriori¶
The Apriori algorithm is a popular machine learning algorithm used for association rule learning, which is the process of finding patterns, associations, or correlations among sets of items in large databases. This algorithm used in data mining for association rule learning. It’s mainly used to find frequent itemsets in transactional data, which can then be used to derive association rules. These rules help in identifying patterns and relationships between items in large datasets.
For example, in a retail setting, the Apriori algorithm might identify that customers who buy bread often also buy butter. This information can be used for inventory management, promotions, or recommendations.
The algorithm works by iteratively finding frequent itemsets (combinations of items that appear together in transactions) and then using these to generate association rules with metrics like support, confidence, and lift to measure their strength and relevance.
It’s most commonly used in market basket analysis, where retailers identify which products are frequently bought together.
How the Apriori Algorithm Works:¶
Frequent Itemsets Generation: The algorithm identifies all item sets (combinations of items) that occur frequently in the dataset. It starts with individual items and progressively builds larger sets of items, only including those that meet a minimum threshold of support (the proportion of transactions that include the item set).
Association Rule Generation: Once frequent itemsets are identified, the algorithm generates association rules, which suggest relationships between items. For example, if customers frequently buy milk and bread together, an association rule could be: “If a customer buys milk, they are likely to buy bread.”
Key Concepts:¶
Support: How often an itemset appears in the dataset. High support means an itemset appears frequently.
Confidence: Given that a certain itemset appears, the confidence of an association rule measures how often the rule is true (e.g., if “milk” is purchased, how often “bread” is also purchased).
Lift: The ratio of the observed support of an itemset to the expected support if the items were independent. A lift greater than 1 indicates a positive association.
Example in Market Basket Analysis:¶
- Data: Transactions in a grocery store.
- Frequent Itemset: {Milk, Bread} with high support.
- Association Rule: “If Milk, then Bread” with high confidence and lift.
The Apriori algorithm is a good choice when you want to find frequent patterns or relationships within large datasets.
In the Apriori algorithm, the key elements are Support, Confidence, and Lift. These are used to evaluate how strong or useful the generated association rules are. Let’s define these terms with formulas:
1. Support:¶
Support measures how frequently an itemset appears in the dataset.
${Support}(A) = \frac{\text{Number of transactions containing A}}{\text{Total number of transactions}}$
- A: A particular itemset (e.g., {Milk, Bread})
- Support tells us how popular the itemset is in the dataset.
Example: If “Milk” appears in 200 out of 1000 transactions, then the support of {Milk} is:
${Support}(Milk) = \frac{200}{1000} = 0.2 \text{ (or 20%)}$
2. Confidence:¶
Confidence measures how often the rule “If A, then B” holds true, meaning how often item B is purchased when item A is purchased.
${Confidence}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)}$
- A: The “antecedent” (e.g., Milk)
- B: The “consequent” (e.g., Bread)
- Confidence measures the likelihood of purchasing B given that A is purchased.
Example: Suppose {Milk} appears in 200 transactions and {Milk, Bread} appears together in 150 transactions. The confidence for the rule “If Milk, then Bread” is:
${Confidence}(Milk \Rightarrow Bread) = \frac{150}{200} = 0.75 \text{ (or 75%)}$
3. Lift:¶
Lift measures how much more likely item B is purchased when item A is purchased, compared to the likelihood of purchasing B independently. A lift greater than 1 indicates a strong association.
${Lift}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A) \times \text{Support}(B)}$
- Lift compares the actual occurrence of A and B together to what we would expect if they were independent.
Example: Suppose the support for {Bread} is 0.3 (30%), and {Milk, Bread} together is 0.15 (15%). The lift for the rule “If Milk, then Bread” is:
${Lift}(Milk \Rightarrow Bread) = \frac{0.15}{0.2 \times 0.3} = \frac{0.15}{0.06} = 2.5$
A lift of 2.5 means that buying bread is 2.5 times more likely when milk is purchased, compared to buying bread independently.
Summary of Formulas:¶
Support: ${Support}(A) = \frac{\text{Transactions with A}}{\text{Total Transactions}}$
Confidence: ${Confidence}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)}$
Lift: ${Lift}(A \Rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A) \times \text{Support}(B)}$
These formulas are used to evaluate the strength of association rules discovered by the Apriori algorithm.
L = [['A','B','C'], ['A','C'], ['A','D'],['B','E','F']]
L
[['A', 'B', 'C'], ['A', 'C'], ['A', 'D'], ['B', 'E', 'F']]
Parameters of Apriori¶
Certainly! In the context of the Apriori algorithm, these parameters are used to define how the algorithm should find and filter association rules:
- L: The list of itemsets or transactions to be analyzed.
- min_support=0.5: Minimum support threshold; itemsets must appear in at least 50% of transactions to be considered frequent.
- min_confidence=0.5: Minimum confidence threshold; rules must have at least 50% confidence to be considered strong.
- min_lift=1.1: Minimum lift threshold; rules must have a lift of at least 1.1 to be considered significant.
- min_length=2: Minimum length of itemsets or rules; itemsets or rules must have at least 2 items to be considered.
from apyori import apriori
rules = apriori(L,min_support=0.5,min_confidence=0.5,min_lift=1.1,min_length=2)
data = list(rules)
data
[RelationRecord(items=frozenset({'C', 'A'}), support=0.5, ordered_statistics=[OrderedStatistic(items_base=frozenset({'A'}), items_add=frozenset({'C'}), confidence=0.6666666666666666, lift=1.3333333333333333), OrderedStatistic(items_base=frozenset({'C'}), items_add=frozenset({'A'}), confidence=1.0, lift=1.3333333333333333)])]
data[0][2]
[OrderedStatistic(items_base=frozenset({'A'}), items_add=frozenset({'C'}), confidence=0.6666666666666666, lift=1.3333333333333333), OrderedStatistic(items_base=frozenset({'C'}), items_add=frozenset({'A'}), confidence=1.0, lift=1.3333333333333333)]
import pandas as pd
df = pd.DataFrame(data[0][2])
df
items_base | items_add | confidence | lift | |
---|---|---|---|---|
0 | (A) | (C) | 0.666667 | 1.333333 |
1 | (C) | (A) | 1.000000 | 1.333333 |
Grocery Store DataSet¶
1. Importing Libraries¶
import numpy as np
import pandas as pd
- NumPy: For numerical operations.
- Pandas: For data manipulation and analysis.
2. Loading the Dataset¶
data = pd.read_csv('GroceryStoreDataSet.csv', header=None)
data
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | MILK | BREAD | BISCUIT | NaN |
1 | BREAD | MILK | BISCUIT | CORNFLAKES |
2 | BREAD | TEA | BOURNVITA | NaN |
3 | JAM | MAGGI | BREAD | MILK |
4 | MAGGI | TEA | BISCUIT | NaN |
5 | BREAD | TEA | BOURNVITA | NaN |
6 | MAGGI | TEA | CORNFLAKES | NaN |
7 | MAGGI | BREAD | TEA | BISCUIT |
8 | JAM | MAGGI | BREAD | TEA |
9 | BREAD | MILK | NaN | NaN |
10 | COFFEE | COCK | BISCUIT | CORNFLAKES |
11 | COFFEE | COCK | BISCUIT | CORNFLAKES |
12 | COFFEE | SUGER | BOURNVITA | NaN |
13 | BREAD | COFFEE | COCK | NaN |
14 | BREAD | SUGER | BISCUIT | NaN |
15 | COFFEE | SUGER | CORNFLAKES | NaN |
16 | BREAD | SUGER | BOURNVITA | NaN |
17 | BREAD | COFFEE | SUGER | NaN |
18 | BREAD | COFFEE | SUGER | NaN |
19 | TEA | MILK | COFFEE | CORNFLAKES |
- This line reads the grocery store dataset from a CSV file into a Pandas DataFrame named
data
. Theheader=None
parameter indicates that the first row should not be treated as column headers.
3. Converting DataFrame to List¶
L = data.values.tolist()
L
[['MILK', 'BREAD', 'BISCUIT', nan], ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'], ['BREAD', 'TEA', 'BOURNVITA', nan], ['JAM', 'MAGGI', 'BREAD', 'MILK'], ['MAGGI', 'TEA', 'BISCUIT', nan], ['BREAD', 'TEA', 'BOURNVITA', nan], ['MAGGI', 'TEA', 'CORNFLAKES', nan], ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'], ['JAM', 'MAGGI', 'BREAD', 'TEA'], ['BREAD', 'MILK', nan, nan], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'SUGER', 'BOURNVITA', nan], ['BREAD', 'COFFEE', 'COCK', nan], ['BREAD', 'SUGER', 'BISCUIT', nan], ['COFFEE', 'SUGER', 'CORNFLAKES', nan], ['BREAD', 'SUGER', 'BOURNVITA', nan], ['BREAD', 'COFFEE', 'SUGER', nan], ['BREAD', 'COFFEE', 'SUGER', nan], ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]
- Converts the DataFrame into a list of lists (where each inner list represents a transaction).
4. Removing NaN Values from Transactions¶
L[0].remove(np.nan) # Example: removing NaN from the first transaction
- This line removes NaN values from the first transaction for illustration.
5. Counting NaN Values¶
L[9].count(np.nan) # Counts NaN values in the 10th transaction
2
- This line counts how many NaN values are present in the 10th transaction.
6. Counting NaN Values for Each Transaction¶
for items in L:
c = items.count(np.nan)
print(c) # Prints the count of NaN values for each transaction
0 0 1 0 1 1 1 0 0 2 0 0 1 1 1 1 1 1 1 0
- This loop iterates over each transaction in the list and prints the number of NaN values present.
7. Displaying Transactions with NaN Values¶
for items in L:
if items.count(np.nan) > 0:
print(items) # Prints transactions that contain NaN values
['BREAD', 'TEA', 'BOURNVITA', nan] ['MAGGI', 'TEA', 'BISCUIT', nan] ['BREAD', 'TEA', 'BOURNVITA', nan] ['MAGGI', 'TEA', 'CORNFLAKES', nan] ['BREAD', 'MILK', nan, nan] ['COFFEE', 'SUGER', 'BOURNVITA', nan] ['BREAD', 'COFFEE', 'COCK', nan] ['BREAD', 'SUGER', 'BISCUIT', nan] ['COFFEE', 'SUGER', 'CORNFLAKES', nan] ['BREAD', 'SUGER', 'BOURNVITA', nan] ['BREAD', 'COFFEE', 'SUGER', nan] ['BREAD', 'COFFEE', 'SUGER', nan]
- This loop checks each transaction for NaN values and prints any transactions that contain them.
8. Removing All NaN Values from Transactions¶
for items in L:
if items.count(np.nan) > 0:
for j in range(items.count(np.nan)):
items.remove(np.nan) # Removes all NaN values from transactions
- This nested loop iterates through each transaction and removes all NaN values found.
9. Verifying the Result¶
L # Displays the modified list without NaN values
[['MILK', 'BREAD', 'BISCUIT'], ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'], ['BREAD', 'TEA', 'BOURNVITA'], ['JAM', 'MAGGI', 'BREAD', 'MILK'], ['MAGGI', 'TEA', 'BISCUIT'], ['BREAD', 'TEA', 'BOURNVITA'], ['MAGGI', 'TEA', 'CORNFLAKES'], ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'], ['JAM', 'MAGGI', 'BREAD', 'TEA'], ['BREAD', 'MILK'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'], ['COFFEE', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'COCK'], ['BREAD', 'SUGER', 'BISCUIT'], ['COFFEE', 'SUGER', 'CORNFLAKES'], ['BREAD', 'SUGER', 'BOURNVITA'], ['BREAD', 'COFFEE', 'SUGER'], ['BREAD', 'COFFEE', 'SUGER'], ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]
- At this point,
L
should no longer contain any NaN values in any transactions.
10. Applying the Apriori Algorithm¶
from apyori import apriori
rules = apriori(L, min_support=0.20, min_confidence=0.35, min_lift=1, min_length=2)
- Here, the Apriori algorithm from the
apyori
library is applied to the cleaned list of transactions (L
).min_support=0.20
: Itemsets must appear in at least 20% of transactions to be considered.min_confidence=0.35
: The rules must have at least 35% confidence.min_lift=1
: The lift of the rule must be at least 1, indicating that the itemsets appear together more often than would be expected by chance.min_length=2
: The minimum length of the itemsets must be at least 2.
11. Converting Rules to List¶
data = list(rules) # Converts the rules into a list
- This line converts the generated rules into a list format for further analysis or display.
Conclusion¶
Your code successfully loads transaction data, preprocesses it by removing NaN values, and then applies the Apriori algorithm to extract association rules from the cleaned data. If you want to explore the results or need help interpreting the rules generated, feel free to ask!