Threat Intelligence: Attack Chain Reconstruction
Authored kill-chain edges with provenance and temporal validity
What if the attack chain, the provenance of every link, and the time each link was valid lived in one database?
Cyber attacks are multi-stage operations. An advanced persistent threat follows the kill chain: initial access (phishing or exploit) leads to command-and-control, then lateral movement, then exfiltration. Each stage produces indicators of compromise (IOCs): malicious IPs, suspicious domains, file hashes, behavioral patterns, that appear as isolated alerts in traditional tools.
Reconstructing the chain means asserting how the stages connect, and in threat intel those assertions are facts your analysts and feeds establish: this domain delivered that payload, that payload beaconed to this C2, this C2 was active in this window. Those are explicit, typed edges you author: DELIVERED, BEACONS_TO, MOVED_TO, EXFILTRATED_TO. Each carries provenance (the source feed, the analyst, the case id) and temporal validity (the window the link held), because C2 infrastructure rotates and an edge true last week may be stale today. You author these as you investigate and bulk-import them from your threat feeds.
IOCs are also embedded as vectors (the IOC's vector id is its graph node id), so the same store ranks by behavioral similarity and walks the authored kill-chain edges. Reconstruction is a hybrid query: seed by similarity to a detected IOC, traverse the typed attack-chain edges, and rank the surviving frontier. Traversal can be quality-aware (weight hops by edge confidence) and temporal (restrict a hop to edges valid at the time of the incident), both opt-in and off by default.
The Traditional Approach
The fragmented stack most teams cobble together today.
- Authored links live in a graph tool; the IOC vectors live elsewhere
- No temporal validity: a rotated C2 edge stays 'true' long after it is stale
- Provenance is fragmented: hard to show where a link assertion came from
- Cross-category threats (phishing to C2 to exfil) span multiple systems
- SIEM correlation rules only catch known patterns, not novel chains
- Investigation time measured in hours, not milliseconds
The SwarnDB Approach
One database. Every capability built in.
Typed Kill-Chain Edges
DELIVERED, BEACONS_TO, MOVED_TO, and EXFILTRATED_TO are explicit, typed edges you author from investigations and feeds. Each carries provenance (source feed, analyst, case id) so every link in the chain is auditable.
Temporal and Quality-Aware Traversal
Edges carry temporal validity, so a hop can be restricted to links valid at the time of the incident. Hops can be weighted by edge confidence. Both are opt-in and off by default.
Hybrid Reconstruction Query
Seed by similarity to a detected IOC, traverse the typed attack-chain edges, then rank the surviving frontier. Reconstruct the chain from one indicator, with the evidence trail attached.
Clustering and Curation
K-means clustering groups related IOCs into campaigns. Verify, reject, and audit history on every edge keep a defensible record of how the chain was assembled.
- Attack-chain links are authored typed edges with provenance, fully auditable
- Temporal validity on edges: stale C2 links can be excluded by time of incident
- Quality-aware traversal weights hops by edge confidence (opt-in)
- IOC vectors and kill-chain edges on one object, no second store
- Reconstruct the chain from one IOC, with the evidence behind every link
- Bulk-import attack-chain edges from threat feeds as CSV or JSONL
01. Kill Chain Reconstruction
The kill chain is the blueprint of a cyber attack. Lockheed Martin's model defines seven stages: Reconnaissance, Weaponization, Delivery, Exploitation, Installation, Command & Control, and Actions on Objectives. A successful defense requires understanding the full chain, not just individual links.
Traditional detection catches individual stages. The email security gateway blocks a phishing email (Delivery). The endpoint detection tool flags a suspicious binary (Installation). The network monitor detects unusual outbound traffic (C2). The DLP system alerts on large data transfers (Exfiltration). Four alerts in four tools, and unless an analyst manually correlates them, the organization doesn't know they're under a coordinated attack.
SwarnDB reconstructs the chain by walking the typed edges you authored. Start from any IOC, say, the suspicious outbound connection flagged by the network monitor. The connections between this IOC and the rest of the attack are explicit, typed edges: BEACONS_TO the C2, DELIVERED by a phishing domain, MOVED_TO another host. Each was asserted by an analyst or a feed, carries provenance, and carries the time window it was valid.
A hybrid query traces the chain. Seed by similarity to the detected IOC, then traverse the typed attack-chain edges across hops: BEACONS_TO reaches the C2 infrastructure, DELIVERED reaches the phishing domain and email artifacts, MOVED_TO reaches lateral-movement hosts, EXFILTRATED_TO reaches the data-theft endpoints. Rank the surviving frontier so the most relevant indicators surface first. Because edges carry temporal validity, you can restrict the traversal to links that were valid at the time of the incident, so a rotated C2 that is stale today does not pollute the chain.
The result is the kill chain reconstructed from a single mid-chain indicator, with a provenance record behind every link and a timeline that reflects when each link actually held. The analyst gets a defensible investigation package, not a disconnected alert.
Key insight:The kill chain is walked over typed edges an analyst or feed authored, each with provenance and a validity window. Temporal traversal excludes stale links.
# From one IOC, walk typed attack-chain edges to reconstruct the chain.
# Temporal traversal (opt-in): only edges valid at the incident time.
chain = (
client.graph.query("threats")
.vector_similar(detected_ioc_embedding, k=20)
.traverse("BEACONS_TO", direction="outgoing")
.vector_rank(detected_ioc_embedding, k=100)
.return_nodes()
)
# Each hop follows an edge an analyst or feed authored, with provenance
# and a validity window. BEACONS_TO -> DELIVERED -> MOVED_TO -> EXFILTRATED_TO.
kill_chain = {}
for node in chain.nodes:
stage = node.metadata["category"]
kill_chain.setdefault(stage, []).append(node)
# Phishing -> C2 -> Lateral Movement -> Exfiltration,
# from ONE indicator, with the evidence trail behind every link.02. Threat Clustering
Security operations centers (SOCs) are drowning in alerts. A mid-size organization generates thousands of security events per day. Most are noise. The real signals, the alerts that indicate an actual attack, are scattered across the noise, often arriving hours apart with no obvious connection.
Manual triage is the current solution: analysts review alerts one by one, classify them, and try to identify which ones are related. This is expensive (senior analysts are scarce), slow (an investigation can take hours), and error-prone (related alerts are easy to miss when they arrive in different time windows or from different detection systems).
SwarnDB's clustering operation transforms this workflow. K-means clustering on threat indicator embeddings automatically groups related IOCs into campaign clusters. Indicators from the same attack campaign share behavioral characteristics (similar infrastructure patterns, timing signatures, or technical attributes), even when they appear across different categories and time windows.
The clustering is unsupervised: you don't need to define what a "campaign" looks like. You specify the expected number of distinct patterns (k), and SwarnDB groups indicators by behavioral similarity. Each cluster represents a distinct threat pattern: either a specific campaign, a known attack type, or a class of false positives. The SOC analyst reviews clusters, not individual alerts. A cluster of 40 indicators from a coordinated campaign becomes one investigation case, not 40 separate triage decisions.
Key insight:Thousands of alerts become 10 campaign clusters. Multi-category clusters are coordinated attacks. Single-category clusters are isolated incidents. Triage in minutes, not hours.
# Group threat indicators into campaign clusters
clusters = client.math.cluster("threats",
k=10 # Expected distinct threat patterns
)
# Each cluster is a potential campaign
for cluster in clusters.results:
categories = set(v.metadata["category"] for v in cluster.vectors)
timespan = max(v.metadata["timestamp"] for v in cluster.vectors) - \
min(v.metadata["timestamp"] for v in cluster.vectors)
print(f"Campaign {cluster.id}:")
print(f" Indicators: {len(cluster.vectors)}")
print(f" Categories: {', '.join(categories)}")
print(f" Timespan: {timespan}")
print(f" Severity: {'HIGH' if len(categories) > 2 else 'MEDIUM'}")
# Multi-category clusters = coordinated attacks
# Single-category clusters = isolated incidents or false positives
# Drill into a suspicious campaign: walk typed edges from the seed
campaign = high_risk_clusters[0]
full_chain = (
client.graph.query("threats")
.vector_similar(campaign.centroid, k=20)
.traverse("BEACONS_TO", direction="outgoing")
.vector_rank(campaign.centroid, k=100)
.return_nodes()
)03. Cross-Category Correlation
The hardest problem in threat intelligence is connecting indicators across categories. A phishing domain (delivery) and a file hash (installation) are different types of artifacts in different systems. A C2 IP and a data-transfer pattern (exfiltration) look nothing alike to a rule engine. Yet they are part of the same attack.
In SwarnDB those cross-category links are typed edges that span categories: a DELIVERED edge from a phishing domain to a payload file hash, a BEACONS_TO edge from that payload to a C2 IP, an EXFILTRATED_TO edge from a compromised host to a destination. An analyst or a feed asserts each one, with provenance, so the correlation is a recorded fact, not a guess. The edge type carries the meaning ("this domain delivered that payload"), which a similarity score never could.
Similarity still helps, in its proper role: it surfaces candidate links to review and ranks the frontier. When a new IOC arrives, seed by similarity to find the closest known indicators, then traverse the authored cross-category edges to pull in the connected domains, hashes, C2 IPs, and exfiltration endpoints, and rank what survives. The query returns the relationship network and the provenance behind it, so the analyst sees not just that two indicators are related but who asserted the link and when it was valid.
Key insight:Cross-category links are authored typed edges with provenance, not similarity guesses. Similarity surfaces candidates and ranks; the edge type carries the meaning.
# New IOC: seed by similarity, then traverse authored cross-category edges.
enriched = (
client.graph.query("threats")
.vector_similar(new_ioc_embedding, k=20)
.traverse("DELIVERED", direction="outgoing")
.vector_rank(new_ioc_embedding, k=20)
.return_nodes()
)
# A phishing domain pulls in, via typed edges with provenance:
# - DELIVERED -> payload file hashes
# - BEACONS_TO -> C2 IPs
# - EXFILTRATED_TO -> data-theft endpoints
for node in enriched.nodes:
print(node.metadata["indicator"], node.metadata["category"])
# The edge TYPE carries the meaning ("delivered", "beacons to"),
# and each edge records who asserted it and when it was valid.04. Real-Time Enrichment
Speed matters in incident response. When a new IOC is detected (a suspicious domain in DNS logs, an unknown file hash in endpoint telemetry, an unusual outbound connection), the analyst needs context immediately. What is this indicator? Is it known? What campaign is it associated with? What other indicators should we look for?
Traditional enrichment involves querying multiple systems: check the threat intel platform for known associations, search the SIEM for related alerts, query the graph database for relationship mapping. Each query takes seconds, requires different interfaces, and returns results in different formats. Assembling a complete picture takes minutes to hours.
SwarnDB provides real-time enrichment in one query. The new IOC is embedded, similarity finds the closest known indicators, and a traversal over the authored attack-chain edges returns the connected indicators with their provenance: who asserted each link and the window it was valid. The analyst sees the IOC's place in known campaigns, not just a list of look-alikes.
If the IOC matches a known indicator, the traversal provides the full campaign context: the connected indicators, the stages of the kill chain, the associated artifacts, each link auditable. If the IOC is novel (no close matches and no authored edges yet), similarity still gives a starting point: the nearest known indicators and categories, so the analyst has somewhere to begin instead of a blank page, and can author new edges as the investigation establishes them.
The response time is measured in milliseconds. From detection to full contextual enrichment in the time traditional systems take to complete a single database query, with a provenance trail behind every asserted link.
Key insight:Seed by similarity, traverse authored edges, rank the frontier, all in one query. Every connected indicator carries the provenance of how it was linked.
# Real-time IOC enrichment - milliseconds, not minutes
# New suspicious domain detected in DNS logs
new_ioc_embedding = embed_indicator(suspicious_domain)
# Seed by similarity, traverse authored edges, rank the frontier.
enrichment = (
client.graph.query("threats")
.vector_similar(new_ioc_embedding, k=10)
.traverse("DELIVERED", direction="outgoing")
.vector_rank(new_ioc_embedding, k=10)
.return_nodes()
)
if enrichment.nodes:
# Known context: connected indicators with provenance and validity.
for node in enrichment.nodes:
print(node.metadata["indicator"], node.metadata["category"])
else:
# Novel IOC: fall back to nearest known indicators as a starting point.
nearest = client.search.query("threats", vector=new_ioc_embedding, k=10)
print("NOVEL IOC: nearest known indicator provides a starting point")
# Continuous monitoring: enrich every new IOC as it arrives, one query each.SwarnDB vs Traditional Stack
A side-by-side look at the traditional approach versus SwarnDB.
| Capability | Traditional Stack | SwarnDB |
|---|---|---|
| Kill Chain | Manual correlation in a separate graph tool | Walk typed edges from one IOC |
| Link provenance | Fragmented across systems | Per-edge audit record |
| Temporal validity | None: stale links stay 'true' | Time-filtered hops (opt-in) |
| Edge confidence | Not modeled | Quality-aware traversal (opt-in) |
| Stack | SIEM + Threat Intel + Graph tools | One database, one auditable view |
Key Metrics
The Code
Everything above, in a few lines of Python.
from swarndb import SwarnDBClient
client = SwarnDBClient(host="localhost", port=50051)
# Hybrid mode: IOC vectors and a typed attack-chain graph in one store.
client.collections.create(
"threats", dimension=384, distance_metric="cosine", mode="hybrid"
)
# Author kill-chain edges from investigations / feeds, with provenance.
client.graph.put_edge("threats", source=domain_id, target=payload_id,
edge_type="DELIVERED",
provenance={"feed": "misp", "case": "INC-204"})
client.graph.bulk_import_edges("threats", feed_edges, format="jsonl")
# Reconstruct the chain: seed by similarity, traverse typed edges, rank.
chain = (
client.graph.query("threats")
.vector_similar(detected_ioc_embedding, k=20)
.traverse("BEACONS_TO", direction="outgoing")
.vector_rank(detected_ioc_embedding, k=100)
.return_nodes()
)
# Group indicators into campaigns (vector math).
clusters = client.math.cluster("threats", k=10)
# Curate the chain: verify, reject, audit history.
client.graph.verify_edge("threats", edge_id)