Back to Home
Pharmaceutical

Drug Discovery: Molecular Similarity

Vector-only similarity, plus authored shared-scaffold edges

What if structural similarity search and a curated scaffold graph shared one store, each doing the job it is actually good at?

Drug discovery has two distinct questions, and SwarnDB answers them with two distinct tools. Be explicit about which is which, because conflating them is exactly the trap.

The first question is structural similarity: "what is near this compound in chemical space?" That is genuinely a vector problem. Store molecular fingerprints as vectors, and nearest-neighbor search returns structurally similar molecules in milliseconds. The same vector space supports SLERP interpolation between two known actives to propose candidates in the space between them, centroid computation to find a representative molecule, and ghost detection to surface orphan structures. None of this needs a graph; it is vector math over fingerprints, and that is where it belongs.

The second question is membership: "which compounds belong to the same scaffold family or series?" That is a curated fact, not a similarity score. A scaffold class is defined by a known core motif from your chemistry, so it is an authored, typed edge: SHARES_SCAFFOLD, SAME_SERIES, BIOISOSTERE_OF, each carrying provenance back to the curation or assay that established it. You bulk-import these from your compound registry. The graph holds the relationships a chemist asserts; the vector space holds the similarity the geometry measures. Both live on one object, because the fingerprint vector's id is the molecule's graph node id.

The Traditional Approach

The fragmented stack most teams cobble together today.

PubChem/ChEMBL: chemical databases for known compound lookupRDKit: molecular fingerprint generation and similarity computationA separate graph store: curated scaffold families and SAR seriesComputational chemistry pipeline: docking, ADMET prediction, scoringManual scaffold analysis stitched to the similarity tooling by hand
  • Similarity search and curated scaffold relationships live in separate systems
  • Exploring between known compounds requires specialized computational chemistry code
  • Scaffold membership is curated in one tool, similarity computed in another
  • No single object carrying both a fingerprint and its asserted family edges
  • Similarity search is batch-oriented, not real-time
  • Orphan molecules (unique structures) are invisible to standard screening

The SwarnDB Approach

One database. Every capability built in.

Fingerprint Similarity (Vector-Only)

Store molecular fingerprints as vectors; nearest-neighbor search returns structurally similar molecules in milliseconds. This part is purely vector-based and needs no graph.

SLERP, Centroid, Ghost (Vector Math)

SLERP interpolates between two known actives to propose candidates between them. Centroid finds a representative molecule. Ghost detection surfaces orphan structures. All server-side vector math over fingerprints.

Authored Scaffold Edges

SHARES_SCAFFOLD, SAME_SERIES, and BIOISOSTERE_OF are typed edges you author from your chemistry, with provenance. Scaffold-family membership is a curated assertion, not a similarity score, so it is an explicit edge.

One Object, Two Lenses

The fingerprint vector's id is the molecule's graph node id. The same molecule answers similarity queries by geometry and family queries by its authored edges, in one store.

  • Structural similarity search is vector-only, exactly where vectors excel
  • SLERP, centroid, and ghost detection run server-side over fingerprints
  • Scaffold families are authored typed edges with provenance, not inferred
  • Similarity and curated relationships on one object, no second store
  • Bulk-import curated scaffold and series edges from CSV or JSONL
  • Real-time search, not batch Tanimoto screening
Section 01

01. Fingerprint Similarity (Vector-Only)

The foundation of computational drug discovery is molecular fingerprinting. Each compound is reduced to a numerical descriptor, a high-dimensional vector that encodes structural features, functional groups, pharmacophore patterns, and physicochemical properties. Two molecules with similar fingerprints tend to have similar biological activity (the similarity principle, the bedrock of medicinal chemistry).

This part is purely vector-based, and it should be. Store fingerprints as vectors, and nearest-neighbor search returns the structurally closest molecules in milliseconds, not the minutes or hours that batch Tanimoto screening takes over large databases. No graph is involved or needed: "what is near this compound in chemical space?" is exactly the question a vector index answers, and SwarnDB answers it fast and accurately by default.

Keep this clear. Structural similarity is geometry, measured over fingerprints. It is not a relationship a chemist asserted. The curated relationships (which compounds belong to the same scaffold family or SAR series) are a separate, authored layer covered in the next sections. Do not read a similarity neighborhood as a scaffold family; they are different facts, answered by different tools in the same store.

Key insight:Structural similarity is geometry over fingerprints, vector-only and graph-free. A similarity neighborhood is not a scaffold family; that is the authored layer.

chemical_search.py
from swarndb import SwarnDBClient

client = SwarnDBClient(host="localhost", port=50051)

# Hybrid mode: fingerprint vectors plus an authored scaffold graph.
client.collections.create(
    "molecules", dimension=2048, distance_metric="cosine", mode="hybrid"
)

# Store molecular fingerprints. The insert id is the molecule's node id.
for compound in library:
    client.vectors.insert("molecules",
        vector=compound["fingerprint"],  # e.g. Morgan fingerprint
        metadata={"name": compound["name"], "smiles": compound["smiles"],
                  "activity": compound["ic50"], "scaffold": compound["scaffold_class"]}
    )

# Structural similarity search - purely vector, no graph
similar = client.search.query("molecules",
    vector=lead_compound_fingerprint, k=20
)
# The 20 structurally closest molecules. Geometry, not asserted family.
Section 02

02. Novel Compound Generation

This is SwarnDB's most unique capability for drug discovery. SLERP (Spherical Linear Interpolation) generates new vectors along the geodesic path between two known vectors on a hypersphere. In molecular terms: given two known active compounds, SLERP generates fingerprints that represent molecules "between" them in chemical space.

Consider two compounds: Compound A is a potent kinase inhibitor with excellent selectivity but poor solubility. Compound B is a moderately potent kinase inhibitor with good solubility but less selectivity. The chemical space between them potentially contains compounds that balance both properties: moderate potency with moderate solubility.

SLERP at t=0.0 returns Compound A's fingerprint exactly. At t=1.0, Compound B's fingerprint. At t=0.5, the midpoint, a novel fingerprint that represents a hypothetical molecule halfway between the two known actives in chemical space. At t=0.3, a fingerprint closer to Compound A (biased toward potency). At t=0.7, closer to Compound B (biased toward solubility).

These interpolated fingerprints can be reverse-mapped to candidate structures using fingerprint-to-structure decoders or matched against large virtual libraries to find the closest real compounds. The result: novel candidates that no medicinal chemist would have proposed, because they exist in the uncharted territory between known molecules.

You can generate an entire series by sweeping t from 0.0 to 1.0 in small increments, producing a "chemical path" through space that you can screen computationally. This is systematic exploration of the space between known actives, something that traditional toolchains simply cannot do.

Key insight:SLERP at t=0.5 generates a molecular fingerprint that exists nowhere in any database, a genuinely novel candidate in the uncharted space between two known actives.

slerp_generation.py
# SLERP: generate novel candidates between two known actives
# t=0.0 → Compound A, t=1.0 → Compound B
# t=0.5 → novel midpoint in chemical space

midpoint = client.math.slerp("molecules",
    vector_id_a=compound_a_id,
    vector_id_b=compound_b_id,
    t=0.5  # Midpoint in chemical space
)
# midpoint.vector = entirely new molecular fingerprint

# Generate a series along the chemical path
candidates = []
for t in [0.2, 0.35, 0.5, 0.65, 0.8]:
    candidate = client.math.slerp("molecules",
        vector_id_a=compound_a_id,
        vector_id_b=compound_b_id,
        t=t
    )
    candidates.append(candidate)

# Find the closest real compound to each candidate
for candidate in candidates:
    match = client.search.query("molecules",
        vector=candidate.vector, k=1
    )
    # The nearest known compound to this novel fingerprint
Section 03

03. Molecular Families

Medicinal chemistry revolves around scaffold classes, families of molecules that share a core structural motif. SAR (Structure-Activity Relationship) studies systematically modify a scaffold to understand which changes improve potency, selectivity, or drug-like properties. Mapping scaffold relationships is foundational work, but it's traditionally manual: a chemist examines compounds, identifies the core scaffold, and draws the relationship tree.

Scaffold-family membership is a curated fact, so it is an authored, typed edge. A chemist (or your registry) defines a scaffold class by its core motif; you write that as SHARES_SCAFFOLD edges, SAME_SERIES for a SAR series, and BIOISOSTERE_OF for functionally related scaffolds, each carrying provenance back to the curation that established it. Bulk-import them from your compound registry. This is the relationship layer, and it is asserted, not inferred from fingerprint similarity.

Centroid computation then operates over a curated family. Gather the molecules a SHARES_SCAFFOLD edge set defines, compute the centroid of their fingerprints, and you have the "average molecule" of that family. Search for real compounds near the centroid to find the most representative one. Here the graph supplies the membership and the vector math supplies the summary.

Scaffold-network exploration is a hybrid query: seed by fingerprint similarity, traverse the authored SHARES_SCAFFOLD and BIOISOSTERE_OF edges to reach the curated family and its bioisosteric neighbors, then rank the surviving frontier by structural similarity. Structure comes from the edges you authored; relevance comes from the geometry. The result is explainable, because every family link traces back to its curation provenance.

Key insight:Scaffold-family membership is an authored, typed edge with provenance, not inferred from similarity. The graph supplies membership; vector math supplies the summary.

molecular_families.py
# Scaffold-family membership is an authored, typed edge (with provenance).
client.graph.put_edge("molecules", source=mol_a, target=mol_b,
                      edge_type="SHARES_SCAFFOLD",
                      provenance={"source": "registry", "scaffold": "pyrimidine"})
client.graph.bulk_import_edges("molecules", scaffold_rows, format="csv")

# Representative molecule of a curated family (graph membership + vector math).
centroid = client.math.centroid("molecules",
    vector_ids=kinase_inhibitor_ids
)
representative = client.search.query("molecules",
    vector=centroid.vector, k=1
)

# Scaffold-network exploration: seed by similarity, walk authored edges, rank.
network = (
    client.graph.query("molecules")
    .vector_similar(lead_fingerprint, k=20)
    .traverse("SHARES_SCAFFOLD", direction="outgoing")
    .vector_rank(lead_fingerprint, k=100)
    .return_nodes()
)
# Family membership from authored edges; ranking from fingerprint geometry.
Section 04

04. Orphan Detection

Every compound library contains orphans, molecules that don't structurally resemble anything else in the collection. In a traditional screening campaign, these compounds are invisible. They don't match any known scaffold class, they don't appear in SAR series, and they're easily overlooked during manual review.

But orphans are interesting. A molecule with no structural relatives in a library might represent a genuinely novel scaffold, a chemical starting point that no one has explored. Alternatively, it might be a data quality issue: a malformed fingerprint, a misclassified compound, or a curation error. Either way, orphans deserve attention.

SwarnDB's ghost detection finds them automatically. Ghost vectors are vectors with no genuine neighbors; they exist in the high-dimensional space but don't cluster with any other vectors. For molecular libraries, ghosts are compounds whose fingerprints are isolated: their maximum similarity to any other compound falls below the threshold. This is purely vector-based, over fingerprints.

The output is a list of molecular ghosts ranked by isolation score. The most isolated compounds, those with the lowest maximum similarity to any neighbor, are the most structurally unique. These are candidates for two very different follow-up actions: if the data is clean, investigate them as novel scaffolds worth exploring. If the data might be dirty, review them for quality issues.

This kind of analysis typically requires a dedicated cheminformatics workflow: compute all-against-all Tanimoto scores, find the minimum per compound, rank and filter. In SwarnDB, it's a single function call.

Key insight:Ghost detection finds molecules with no structural relatives, potentially novel scaffolds worth millions, or data errors worth fixing. One function call either way.

orphan_detection.py
# Find orphan molecules - novel scaffolds or data issues
ghosts = client.math.detect_ghosts("molecules",
    threshold=0.3
)

# Ghost molecules: structurally isolated from the library
for ghost in ghosts.results:
    print(f"Orphan: {ghost.metadata['name']}")
    print(f"  SMILES: {ghost.metadata['smiles']}")
    print(f"  Max similarity to any neighbor: {ghost.max_similarity}")
    print(f"  Potential novel scaffold: {ghost.max_similarity < 0.2}")

# Separate novel scaffolds from data quality issues
novel_scaffolds = [g for g in ghosts.results
    if g.metadata.get("source") == "validated"]
review_needed = [g for g in ghosts.results
    if g.metadata.get("source") != "validated"]

# For novel scaffolds: generate SLERP candidates
# to explore the space around them
for scaffold in novel_scaffolds:
    nearest = client.search.query("molecules",
        vector=scaffold.vector, k=1
    )
    # Interpolate between the orphan and its nearest neighbor
    exploration = client.math.slerp("molecules",
        vector_id_a=scaffold.id,
        vector_id_b=nearest.results[0].id,
        t=0.3  # Stay close to the novel scaffold
    )

SwarnDB vs Traditional Stack

A side-by-side look at the traditional approach versus SwarnDB.

CapabilityTraditional StackSwarnDB
Similarity SearchCustom RDKit Tanimoto screeningVector search, milliseconds (vector-only)
Novel CompoundsManual medicinal chemistry designSLERP interpolation between known actives
Scaffold FamiliesCurated in a separate graph toolAuthored typed edges, same object
Representative MoleculeNot automatedCentroid over a curated family
Orphan DetectionAll-vs-all screening (expensive)Ghost detection, one call

Key Metrics

Vector-Only
Similarity
SLERP
Interpolation
Typed Edges
Scaffold Families
Curation + Provenance
Family Source
Ghost Vectors
Orphan Detection
Milliseconds
Search Speed

The Code

Everything above, in a few lines of Python.

drug_discovery.py
from swarndb import SwarnDBClient

client = SwarnDBClient(host="localhost", port=50051)

# Hybrid mode: fingerprint vectors plus an authored scaffold graph.
client.collections.create(
    "molecules", dimension=2048, distance_metric="cosine", mode="hybrid"
)

# Store fingerprints. The insert id is the molecule's graph node id.
client.vectors.insert("molecules",
    vector=compound_fingerprint,
    metadata={"name": "Compound A", "smiles": smiles_str}
)

# Vector-only: structural similarity, SLERP, ghost detection (over fingerprints).
candidate = client.math.slerp("molecules",
    vector_id_a=compound_a, vector_id_b=compound_b, t=0.5
)
ghosts = client.math.detect_ghosts("molecules", threshold=0.3)

# Authored layer: scaffold-family membership as typed edges, with provenance.
client.graph.put_edge("molecules", source=mol_a, target=mol_b,
                      edge_type="SHARES_SCAFFOLD",
                      provenance={"source": "registry", "scaffold": "pyrimidine"})

# Scaffold-network exploration: seed by similarity, walk authored edges, rank.
network = (
    client.graph.query("molecules")
    .vector_similar(lead_fingerprint, k=20)
    .traverse("SHARES_SCAFFOLD", direction="outgoing")
    .vector_rank(lead_fingerprint, k=100)
    .return_nodes()
)

Try it yourself

Clone the repo, spin up SwarnDB, and run this use case in minutes.

View on GitHub