Zhu Lin's webpage
Excel vs. Machine Learning
Some background:
In this post, I compare traditional Excel-based methods with modern machine learning approaches.
A manufacturing company notices discrepancies between recorded and actual stock. The goal is to identify unusual stock movement patterns.
A. Classical Excel Techniques:
IF
, AVERAGE
, and STDEV
to calculate thresholds.Steps in Excel:
=AVERAGE(range)
) and standard deviation (=STDEV(range)
) of inventory usage for each product.Example Data:
Product ID | Month 1 Usage | Month 2 Usage | Month 3 Usage | Average Usage | Std Dev | Anomaly? |
---|---|---|---|---|---|---|
1001 | 500 | 510 | 2000 | 1000 | 866.03 | Yes |
1002 | 300 | 305 | 290 | 298.33 | 7.64 | No |
Limitations:
B. Modern ML Techniques:
Python Code:
import pandas as pd
from sklearn.ensemble import IsolationForest
# Sample Inventory Data
inventory_data = pd.DataFrame({
'product_id': [1001, 1002, 1003, 1004],
'month_1_usage': [500, 300, 100, 700],
'month_2_usage': [510, 305, 110, 720],
'month_3_usage': [2000, 290, 105, 730]
})
# Model Training
X = inventory_data[['month_1_usage', 'month_2_usage', 'month_3_usage']]
model = IsolationForest(random_state=42)
inventory_data['anomaly_score'] = model.fit_predict(X)
# Flagging Anomalies
inventory_data['is_anomaly'] = inventory_data['anomaly_score'] == -1
print(inventory_data)
Output:
Product ID | Month 1 Usage | Month 2 Usage | Month 3 Usage | Anomaly Score | Is Anomaly |
---|---|---|---|---|---|
1001 | 500 | 510 | 2000 | -1 | True |
1002 | 300 | 305 | 290 | 1 | False |
Benefits of ML:
An e-commerce company suspects fraudulent transactions in its general ledger.
A. Classical Excel Techniques:
IF
, AND
, OR
) to filter based on known risk patterns.Steps in Excel:
>10000
) or weekend dates (=TEXT(date, "ddd")
).Example Data:
Transaction ID | Amount | Date | Risky? |
---|---|---|---|
2001 | 15000 | 2024-11-12 | Yes |
2002 | 5000 | 2024-11-10 | No |
Limitations:
B. Modern ML Techniques:
Python Code:
from sklearn.cluster import KMeans
import numpy as np
# Sample Transaction Data
transaction_data = pd.DataFrame({
'transaction_id': [2001, 2002, 2003, 2004],
'amount': [15000, 5000, 100, 700],
'weekday': [2, 0, 3, 5], # Mon=0, ..., Sun=6
'hour_of_day': [14, 20, 11, 22]
})
# Clustering
X = transaction_data[['amount', 'weekday', 'hour_of_day']]
kmeans = KMeans(n_clusters=2, random_state=42)
transaction_data['cluster'] = kmeans.fit_predict(X)
# Flag Anomalous Cluster
transaction_data['is_anomaly'] = transaction_data['cluster'] == transaction_data['cluster'].value_counts().idxmin()
print(transaction_data)
Output:
Transaction ID | Amount | Weekday | Hour of Day | Cluster | Is Anomaly |
---|---|---|---|---|---|
2001 | 15000 | 2 | 14 | 1 | True |
2002 | 5000 | 0 | 20 | 0 | False |
Benefits of ML:
"Vendor A"
appears as "Vndr A"
, "Vendor_A"
, "VendorA"
).A. Classical Excel Techniques:
This technique relies on manual string matching and tools like Excel's VLOOKUP
or IF
formulas.
Example Workflow:
Manually Inspect Data: Identify inconsistencies visually.
Exact Matching: Use IF
or COUNTIF
to match strings exactly.
=IF(A1="Vendor A", "Match", "No Match")
Partial Matching: Use SEARCH
or FIND
for substring matching.
=IF(ISNUMBER(SEARCH("Vendor", A1)), "Potential Match", "No Match")
Limitations:
"Vndr_A"
will not match "Vendor A"
).B. Modern ML Techniques:
Python libraries like fuzzywuzzy
or rapidfuzz
can be used to group similar entries, automating the matching process while handling noise.
Python Code:
from rapidfuzz import process, fuzz
import pandas as pd
# Sample Data
data = pd.DataFrame({
'transaction_id': [1, 2, 3, 4, 5],
'vendor': ['Vendor A', 'Vndr A', 'Vendor_A', 'Vendor B', 'Vendoor C'],
'amount': [100, 200, 150, 300, 250]
})
# Define Reference List of Vendors
reference_vendors = ['Vendor A', 'Vendor B', 'Vendor C']
# Fuzzy Matching
data['matched_vendor'] = data['vendor'].apply(
lambda x: process.extractOne(x, reference_vendors, scorer=fuzz.ratio)[0]
)
print(data)
Output:
Transaction ID | Vendor | Amount | Matched Vendor |
---|---|---|---|
1 | Vendor A | 100 | Vendor A |
2 | Vndr A | 200 | Vendor A |
3 | Vendor_A | 150 | Vendor A |
4 | Vendor B | 300 | Vendor B |
5 | Vendoor C | 250 | Vendor C |
Advantages of this Approach:
Use Case Example: Combining Techniques:
Scenario: A retailer processes invoices from hundreds of suppliers. Many supplier names are misspelled or abbreviated, making it hard to reconcile payments.
Manual Without Fuzzy Matching | With Fuzzy Matching |
---|---|
Manually aligns and matches names. | Automates name reconciliation using fuzzy algorithms. |
Misses subtle variations (e.g., "Vendor" vs. "Vndr" ). |
Detects and matches all reasonable variations. |
Errors propagate into downstream analysis. | Pre-processed, consistent data improves accuracy. |
Excel Techniques:
ML Techniques: