Sophos AI at Black Hat 2025: Smarter Anomaly Detection

Written by Adriana Aguilar | Aug 14, 2025 3:19:19 PM

In cybersecurity, anomaly detection promises to identify threats by flagging behavior that is “out of the ordinary.” The practical problem, as you’ve likely experienced, is the flood of false positives when trying to detect malicious command lines—noise, alert fatigue, and wasted time.

The research presented by Sophos at Black Hat USA 2025 offers an interesting twist: don’t use anomaly detection as your sole “hunter,” use it as a source of uncommon benign data to better train your supervised classifiers. The result? Far fewer false positives and more focus on what’s truly malicious.

The Shift in Approach: From “Finding the Bad” to Better Understanding the Legitimate

The key idea is to feed (not replace) your supervised models with anomalous but benign commands. To do this, Sophos combined two components:

Anomaly detection to locate uncommon commands.

Automatic labeling with LLMs (using OpenAI’s o3-mini) to classify these anomalous commands as benign or malicious with high accuracy.

Counterintuitively, the success doesn’t hinge on anomaly detection finding the malicious items—it’s about identifying “benign rarity” to expand the model’s understanding of what is normal in your environment. This “catalog of complex benigns” is what drastically reduces false positives.

How It Was Done: Data, Features, and Two Scaling Strategies

During January 2025, the research processed more than 50 million daily command lines with two ingestion and “featurization” approaches:

Full-scale implementation (entire telemetry set)

Infrastructure: Apache Spark + AWS SageMaker with auto-scaling.

Manual feature engineering, focusing on:

Entropy (command complexity/randomness)

Character-level (presence of tokens/special characters)

Token-level (frequencies and significance in distributions)

Behavior checks (obfuscation indicators, data transfers, credential dumping, memory use, etc.)

Advantage: Complete coverage and high granularity.
Challenge: High computational cost.

Reduced-scale implementation (~4M daily samples)

Semantic embeddings with Jina Embeddings V2 (pre-trained on commands, scripts, and code).

Feasible compute on SageMaker GPU and low-cost EC2 CPU.

No manual feature engineering—semantic vectors capture complex relationships between commands.

Advantage: Much lower cost and simpler deployment.
Challenge: Sampling may take longer to capture full diversity.

Both approaches worked, offering options depending on budget and compute time needs.

Learn more: What is Sophos and how does it improve enterprise cybersecurity?

Anomaly Detection: Three Complementary Algorithms

After featurization, anomalies were identified using three unsupervised methods for robustness:

Isolation Forest – isolates rare points by randomly partitioning feature space.

Modified k-means – uses distance to the centroid to detect points far from common trends.

PCA (Principal Component Analysis) – flags high reconstruction errors in projected subspace.

This ensemble avoids reliance on a single “rarity” criterion.

Figure 1: Cumulative distribution of command lines gathered per day over the test month using the full-scale method. The graph shows all command lines, deduplication by unique command line, and near-deduplication by cosine similarity of command line embeddings (Source: Sophos)

Avoiding Duplicates: Embeddings + Cosine Similarity

Many anomalies are near-identical variants (e.g., a parameter change). To prevent overweighting a pattern, they deduplicated candidates using embeddings (Jina) and cosine similarity, keeping only truly distinct anomalies before labeling.

Automatic Labeling with LLM and Validation

The o3-mini reasoning LLM labeled each anomaly as benign or malicious. Manual validation later showed near-perfect benign accuracy for an entire week’s worth of data—enough to integrate benigns directly into training datasets with minimal human intervention.

Operational takeaway: You can expand your “good” dataset without hiring an army of analysts, and with statistical confidence.

The Results: Less Noise, More Signal

Models were evaluated with two benchmarks:

Time split test: three weeks post-training.

Incident test AUC: analyst-labeled dataset (real investigations + active learning).

Baselines compared:

RB (Regex Baseline): labels via simple regex rules.

AB (Aggregated Baseline): regex + sandbox + client cases + telemetry (more mature pipeline).

Gains from adding “anomaly-derived benigns”:

AB → AB + Full-scale: +27.97 AUC points in incident benchmark (0.6138 → 0.8935).

AB → AB + Reduced-scale: 0.8063 AUC (solid improvement at lower cost).

RB → RB + Full-scale: 0.7072 → 0.7689.

RB → RB + Reduced-scale: 0.7077 AUC (less impact, but maintains high Time split score).

In every case, noise (false positives) dropped and useful detection increased.

Figure 2: Cumulative distribution of command lines gathered per day over the test month using the reduced-scale method. The reduced scale plateaus slower because the sampled data is likely finding more local optima

What This Means for You (Step-by-Step Adoption)

Prepare data and platform

Centralize command telemetry (EDR, shells, scripts, remote tools).

Ensure governance (retention, PII, data minimization).

Decide compute capacity: spark-like (full-scale) or batch embeddings (reduced-scale).

Orchestrate anomaly detection

Run Isolation Forest + modified k-means + PCA.

Set thresholds by environment/role (servers, VDI, CI/CD, jump hosts).

Deduplicate with embeddings

Generate embeddings and filter by cosine similarity.

Label with LLM (set guardrails)

LLM labels benign/malicious, never authorizes actions.

Periodic QA via manual sampling to track drift.

Retrain supervised classifiers

Add complex benigns to dataset.

Validate with Incident test AUC and Time split test.

Track false positive rate by domain (OS, segment, shift, team).

Continuous maintenance

Monitor for drift (pattern changes).

Set retraining schedule (monthly/quarterly).

Red team to stress-test edge cases.

Risks and How to Mitigate Them

Blind dependence on LLM: restrict role to labeling, with recurring QA.

Sensitive data: anonymize, segment, and apply least privilege in pipelines.

Cost: for heavy workloads, start with reduced-scale embeddings.

Environment drift: monitor changes (new tools, DevOps, golden images).

Conclusion: Anomaly Detection Didn’t “Fail”—It Had a Different Job

The key takeaway from the research is powerful: using anomalies to expand benign datasets—rather than blindly “guessing the bad”—changes the game. With that diverse benign data feeding your classifiers, false positives drop, SOC teams can breathe, and your staff can focus on the critical alerts.

At TecnetOne, as certified Sophos partners, we’re here to help you and your company stay ahead in technology with top-quality security services.

View full post