Logo

Zhu Lin's webpage

Techniques

Excel vs. Machine Learning

Some background:

In this post, I compare traditional Excel-based methods with modern machine learning approaches.

Scenario 1: Inventory Ledger – Detecting Stock Anomalies

A manufacturing company notices discrepancies between recorded and actual stock. The goal is to identify unusual stock movement patterns.


A. Classical Excel Techniques:

Technique: Conditional Formatting and Threshold Filtering

  1. Use Excel functions like IF, AVERAGE, and STDEV to calculate thresholds.
  2. Highlight outliers using conditional formatting.

Steps in Excel:

Example Data:

Product ID Month 1 Usage Month 2 Usage Month 3 Usage Average Usage Std Dev Anomaly?
1001 500 510 2000 1000 866.03 Yes
1002 300 305 290 298.33 7.64 No

Limitations:


B. Modern ML Techniques:

Technique: Isolation Forest for Anomaly Detection

Python Code:

import pandas as pd
from sklearn.ensemble import IsolationForest

# Sample Inventory Data
inventory_data = pd.DataFrame({
    'product_id': [1001, 1002, 1003, 1004],
    'month_1_usage': [500, 300, 100, 700],
    'month_2_usage': [510, 305, 110, 720],
    'month_3_usage': [2000, 290, 105, 730]
})

# Model Training
X = inventory_data[['month_1_usage', 'month_2_usage', 'month_3_usage']]
model = IsolationForest(random_state=42)
inventory_data['anomaly_score'] = model.fit_predict(X)

# Flagging Anomalies
inventory_data['is_anomaly'] = inventory_data['anomaly_score'] == -1
print(inventory_data)

Output:

Product ID Month 1 Usage Month 2 Usage Month 3 Usage Anomaly Score Is Anomaly
1001 500 510 2000 -1 True
1002 300 305 290 1 False

Benefits of ML:


Scenario 2: Accounting Ledger – Detecting Suspicious Transactions

An e-commerce company suspects fraudulent transactions in its general ledger.


A. Classical Excel Techniques:

Technique: Rule-Based Filters

  1. Use logical Excel formulas (IF, AND, OR) to filter based on known risk patterns.
  2. Example Rules:
    • Transactions > $10,000.
    • Transactions on weekends or public holidays.
    • Payments split into multiple smaller amounts.

Steps in Excel:

Example Data:

Transaction ID Amount Date Risky?
2001 15000 2024-11-12 Yes
2002 5000 2024-11-10 No

Limitations:


B. Modern ML Techniques:

Technique: Clustering Transactions with K-Means

Python Code:

from sklearn.cluster import KMeans
import numpy as np

# Sample Transaction Data
transaction_data = pd.DataFrame({
    'transaction_id': [2001, 2002, 2003, 2004],
    'amount': [15000, 5000, 100, 700],
    'weekday': [2, 0, 3, 5],  # Mon=0, ..., Sun=6
    'hour_of_day': [14, 20, 11, 22]
})

# Clustering
X = transaction_data[['amount', 'weekday', 'hour_of_day']]
kmeans = KMeans(n_clusters=2, random_state=42)
transaction_data['cluster'] = kmeans.fit_predict(X)

# Flag Anomalous Cluster
transaction_data['is_anomaly'] = transaction_data['cluster'] == transaction_data['cluster'].value_counts().idxmin()
print(transaction_data)

Output:

Transaction ID Amount Weekday Hour of Day Cluster Is Anomaly
2001 15000 2 14 1 True
2002 5000 0 20 0 False

Benefits of ML:


Scenario 3: Entity Resolution – Matching Vendor Names


A. Classical Excel Techniques:

Technique: Manual Filtering and String Matching in Excel

This technique relies on manual string matching and tools like Excel's VLOOKUP or IF formulas.

Example Workflow:

  1. Manually Inspect Data: Identify inconsistencies visually.

  2. Exact Matching: Use IF or COUNTIF to match strings exactly.

    =IF(A1="Vendor A", "Match", "No Match")
    
  3. Partial Matching: Use SEARCH or FIND for substring matching.

    =IF(ISNUMBER(SEARCH("Vendor", A1)), "Potential Match", "No Match")
    

Limitations:


B. Modern ML Techniques:

Technique: Using Fuzzy String Matching and Pre-processing

Python libraries like fuzzywuzzy or rapidfuzz can be used to group similar entries, automating the matching process while handling noise.

Python Code:

from rapidfuzz import process, fuzz
import pandas as pd

# Sample Data
data = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'vendor': ['Vendor A', 'Vndr A', 'Vendor_A', 'Vendor B', 'Vendoor C'],
    'amount': [100, 200, 150, 300, 250]
})

# Define Reference List of Vendors
reference_vendors = ['Vendor A', 'Vendor B', 'Vendor C']

# Fuzzy Matching
data['matched_vendor'] = data['vendor'].apply(
    lambda x: process.extractOne(x, reference_vendors, scorer=fuzz.ratio)[0]
)

print(data)

Output:

Transaction ID Vendor Amount Matched Vendor
1 Vendor A 100 Vendor A
2 Vndr A 200 Vendor A
3 Vendor_A 150 Vendor A
4 Vendor B 300 Vendor B
5 Vendoor C 250 Vendor C

Advantages of this Approach:

  1. Scalability: Automates reconciliation for thousands of rows in seconds.
  2. Accuracy: Handles complex variations, typos, and abbreviations reliably.
  3. Customization: Adjust similarity thresholds and match logic dynamically.
  4. Integration: Pre-processed fuzzy data can feed into machine learning models for anomaly detection, clustering, or forecasting.

Use Case Example: Combining Techniques:

Scenario: A retailer processes invoices from hundreds of suppliers. Many supplier names are misspelled or abbreviated, making it hard to reconcile payments.

Manual Without Fuzzy Matching With Fuzzy Matching
Manually aligns and matches names. Automates name reconciliation using fuzzy algorithms.
Misses subtle variations (e.g., "Vendor" vs. "Vndr"). Detects and matches all reasonable variations.
Errors propagate into downstream analysis. Pre-processed, consistent data improves accuracy.

Key Takeaways

  1. Excel Techniques:

    • Rule-based and statistical approaches are effective for structured problems with clear thresholds.
    • Limited scalability and accuracy for complex or multivariate data.
    • Without fuzzy matching, manual reconciliation is time-consuming and error-prone.
  2. ML Techniques:

    • Handle multivariate and nonlinear relationships, finding hidden patterns.
    • Adaptable to large datasets and dynamic patterns.
    • Require basic programming knowledge but provide significant insights over traditional methods.
    • Fuzzy matching automates data pre-processing, enhancing accuracy and scalability.