Techniques

Excel vs. Machine Learning

Some background:

My career began in internal audit.
As an ICAEW Chartered Accountant, I developed a strong foundation in financial analysis and process improvement.
However, I soon realized that classical methods were insufficient to address the complexities and scale of modern business challenges.
My pivot to data science and AI has enabled me to leverage advanced analytics techniques to solve complex problems.

In this post, I compare traditional Excel-based methods with modern machine learning approaches.

Scenario 1: Inventory Ledger – Detecting Stock Anomalies

A manufacturing company notices discrepancies between recorded and actual stock. The goal is to identify unusual stock movement patterns.

A. Classical Excel Techniques:

Technique: Conditional Formatting and Threshold Filtering

Use Excel functions like IF, AVERAGE, and STDEV to calculate thresholds.
Highlight outliers using conditional formatting.

Steps in Excel:

Calculate the mean (=AVERAGE(range)) and standard deviation (=STDEV(range)) of inventory usage for each product.
Flag values that deviate beyond ±2 standard deviations.

Example Data:

Product ID	Month 1 Usage	Month 2 Usage	Month 3 Usage	Average Usage	Std Dev	Anomaly?
1001	500	510	2000	1000	866.03	Yes
1002	300	305	290	298.33	7.64	No

Limitations:

Only identifies anomalies based on fixed thresholds (±2 SD).
Ignores multivariate patterns (e.g., seasonality or correlations).

B. Modern ML Techniques:

Technique: Isolation Forest for Anomaly Detection

Python Code:

import pandas as pd
from sklearn.ensemble import IsolationForest

# Sample Inventory Data
inventory_data = pd.DataFrame({
    'product_id': [1001, 1002, 1003, 1004],
    'month_1_usage': [500, 300, 100, 700],
    'month_2_usage': [510, 305, 110, 720],
    'month_3_usage': [2000, 290, 105, 730]
})

# Model Training
X = inventory_data[['month_1_usage', 'month_2_usage', 'month_3_usage']]
model = IsolationForest(random_state=42)
inventory_data['anomaly_score'] = model.fit_predict(X)

# Flagging Anomalies
inventory_data['is_anomaly'] = inventory_data['anomaly_score'] == -1
print(inventory_data)

Output:

Product ID	Month 1 Usage	Month 2 Usage	Month 3 Usage	Anomaly Score	Is Anomaly
1001	500	510	2000	-1	True
1002	300	305	290	1	False

Benefits of ML:

Captures multivariate relationships between usage patterns across months.
Adapts to non-linear data distributions.

Scenario 2: Accounting Ledger – Detecting Suspicious Transactions

An e-commerce company suspects fraudulent transactions in its general ledger.

A. Classical Excel Techniques:

Technique: Rule-Based Filters

Use logical Excel formulas (IF, AND, OR) to filter based on known risk patterns.
Example Rules:
- Transactions > $10,000.
- Transactions on weekends or public holidays.
- Payments split into multiple smaller amounts.

Steps in Excel:

Apply filters for conditions like large values (>10000) or weekend dates (=TEXT(date, "ddd")).

Example Data:

Transaction ID	Amount	Date	Risky?
2001	15000	2024-11-12	Yes
2002	5000	2024-11-10	No

Limitations:

Requires pre-defined rules, missing unknown fraud patterns.
Inflexible for complex conditions.

B. Modern ML Techniques:

Technique: Clustering Transactions with K-Means

Python Code:

from sklearn.cluster import KMeans
import numpy as np

# Sample Transaction Data
transaction_data = pd.DataFrame({
    'transaction_id': [2001, 2002, 2003, 2004],
    'amount': [15000, 5000, 100, 700],
    'weekday': [2, 0, 3, 5],  # Mon=0, ..., Sun=6
    'hour_of_day': [14, 20, 11, 22]
})

# Clustering
X = transaction_data[['amount', 'weekday', 'hour_of_day']]
kmeans = KMeans(n_clusters=2, random_state=42)
transaction_data['cluster'] = kmeans.fit_predict(X)

# Flag Anomalous Cluster
transaction_data['is_anomaly'] = transaction_data['cluster'] == transaction_data['cluster'].value_counts().idxmin()
print(transaction_data)

Output:

Transaction ID	Amount	Weekday	Hour of Day	Cluster	Is Anomaly
2001	15000	2	14	1	True
2002	5000	0	20	0	False

Benefits of ML:

Automatically groups similar transactions without manual rule-setting.
Identifies outliers based on clustering of similar patterns.

Scenario 3: Entity Resolution – Matching Vendor Names

You are tasked with reconciling a dataset where transaction descriptions are inconsistent due to typos, abbreviations, and variations in format (e.g., "Vendor A" appears as "Vndr A", "Vendor_A", "VendorA").
Accurate reconciliation requires grouping similar descriptions to identify discrepancies or summarize totals by vendor.

A. Classical Excel Techniques:

Technique: Manual Filtering and String Matching in Excel

This technique relies on manual string matching and tools like Excel's VLOOKUP or IF formulas.

Example Workflow:

Manually Inspect Data: Identify inconsistencies visually.
Exact Matching: Use IF or COUNTIF to match strings exactly.
```
=IF(A1="Vendor A", "Match", "No Match")
```

Partial Matching: Use SEARCH or FIND for substring matching.

=IF(ISNUMBER(SEARCH("Vendor", A1)), "Potential Match", "No Match")

Limitations:

Misses non-obvious variations (e.g., "Vndr_A" will not match "Vendor A").
Time-consuming for large datasets.
Prone to human error and lacks scalability.

B. Modern ML Techniques:

Technique: Using Fuzzy String Matching and Pre-processing

Python libraries like fuzzywuzzy or rapidfuzz can be used to group similar entries, automating the matching process while handling noise.

Python Code:

from rapidfuzz import process, fuzz
import pandas as pd

# Sample Data
data = pd.DataFrame({
    'transaction_id': [1, 2, 3, 4, 5],
    'vendor': ['Vendor A', 'Vndr A', 'Vendor_A', 'Vendor B', 'Vendoor C'],
    'amount': [100, 200, 150, 300, 250]
})

# Define Reference List of Vendors
reference_vendors = ['Vendor A', 'Vendor B', 'Vendor C']

# Fuzzy Matching
data['matched_vendor'] = data['vendor'].apply(
    lambda x: process.extractOne(x, reference_vendors, scorer=fuzz.ratio)[0]
)

print(data)

Output:

Transaction ID	Vendor	Amount	Matched Vendor
1	Vendor A	100	Vendor A
2	Vndr A	200	Vendor A
3	Vendor_A	150	Vendor A
4	Vendor B	300	Vendor B
5	Vendoor C	250	Vendor C

Advantages of this Approach:

Scalability: Automates reconciliation for thousands of rows in seconds.
Accuracy: Handles complex variations, typos, and abbreviations reliably.
Customization: Adjust similarity thresholds and match logic dynamically.
Integration: Pre-processed fuzzy data can feed into machine learning models for anomaly detection, clustering, or forecasting.

Use Case Example: Combining Techniques:

Scenario: A retailer processes invoices from hundreds of suppliers. Many supplier names are misspelled or abbreviated, making it hard to reconcile payments.

Manual Without Fuzzy Matching	With Fuzzy Matching
Manually aligns and matches names.	Automates name reconciliation using fuzzy algorithms.
Misses subtle variations (e.g., `"Vendor"` vs. `"Vndr"`).	Detects and matches all reasonable variations.
Errors propagate into downstream analysis.	Pre-processed, consistent data improves accuracy.

Key Takeaways

Excel Techniques:
- Rule-based and statistical approaches are effective for structured problems with clear thresholds.
- Limited scalability and accuracy for complex or multivariate data.
- Without fuzzy matching, manual reconciliation is time-consuming and error-prone.
ML Techniques:
- Handle multivariate and nonlinear relationships, finding hidden patterns.
- Adaptable to large datasets and dynamic patterns.
- Require basic programming knowledge but provide significant insights over traditional methods.
- Fuzzy matching automates data pre-processing, enhancing accuracy and scalability.