Skip to content

Architecture

Overview

Minerva operates as two distinct pipelines: an ingestion pipeline that processes scientific papers and builds the knowledge graph, and a query pipeline that answers questions using graph + textbook retrieval. Both are served through a FastAPI backend (api.py).

minerva — ingestion pipeline
Retrieval Extraction Verification Critic Layer TTD-DR Neo4j Write
arXiv/OpenAlex GPT-4o-mini GPT-4o-mini Python FAISS+LLM
minerva — query pipeline
Question natural language input
Cypher gen GPT-4o Neo4j
FAISS TextbookKB parallel · fallback if 0 results
GPT-4o-mini synthesis Answer

Ingestion Pipeline

The pipeline is triggered via POST /search/start and runs as a FastAPI BackgroundTask. Progress is polled via GET /search/progress/{job_id}. Each paper moves through five sequential stages.

Stage 1 — Paper Retrieval (retrieval/)

Papers are fetched from three sources simultaneously via asyncio.gather:

Source API What it returns
arXiv arxiv_source.py Preprints with full abstracts
OpenAlex openalex_source.py Open academic graph — broad coverage
CrossRef crossref_source.py DOI metadata — strong for journal papers

After fetching, three post-processing steps run:

1. Relevance filter (is_relevant) — drops papers whose titles don't contain enough query keywords. For queries with ≤ 2 keywords, all must appear; for longer queries, more than half must match. Stop words (the, a, of, ...) are excluded before counting.

2. Deduplication (deduplicate) — uses DOI as primary key, MD5(title+year) as fallback. If a duplicate is found with a better abstract, the better one wins.

3. Unpaywall enrichment — for papers without abstracts that have a DOI, the Unpaywall API is queried for an open-access PDF URL. If found, PyMuPDF (fitz) downloads and extracts the first 3,000 characters of text as the abstract.

# Parallel fetch — all three sources at once
arxiv, openalex, crossref = await asyncio.gather(
    search_arxiv(query, max_per_source),
    search_openalex(query, max_per_source),
    search_crossref(query, max_per_source),
    return_exceptions=True
)

Stage 2 — Entity & Relation Extraction (agents/extraction_agent.py)

Model: gpt-4o-mini
Output format: response_format={"type": "json_object"} — structured JSON, no markdown fences
Retry logic: 3 attempts with 10s sleep on 429 rate limit errors

7 entity types extracted:

Type Description Valid Example Invalid Example
Material Named chemical substance or compound NbN, MgB2, copper nanowire, semiconductor, device
Property Measurable characteristic (name only, no value) superconductivity, band gap good performance, high quality
Value Numerical measurement with unit 1.107 eV, 39 K, 207 GPa high, several, improved
Application Specific real-world use case single-photon detection, blue LED research, applications
Method Fabrication/synthesis technique (noun, not verb) CVD, sputtering, annealing deposited, grown, fabricated
Element Periodic table symbol only (1-2 chars) Nb, Si, Au niobium, silicon (full names)
Formula Exact chemical formula MgB2, NbN, Al2O3 alloy, compound

6 relation types:

(Material)-[:HAS_PROPERTY]   (Property)       # intrinsic material property
(Material)-[:HAS_VALUE]      (Value)           # numerical measurement + unit
(Material)-[:USED_IN]        (Application)     # real-world application
(Material)-[:SYNTHESIZED_BY] (Method)          # fabrication method
(Material)-[:HAS_ELEMENT]    (Element)         # constituent element
(Material)-[:HAS_FORMULA]    (Formula)         # chemical formula

The HAS_VALUE relation stores three fields on the edge: value, unit, and evidence (the source sentence).


Stage 3 — Verification Agent (agents/verification_agent.py)

Model: gpt-4o-mini

A second LLM pass checks every extracted relation against the original abstract. Its main job is semantic type classification for HAS_PROPERTY relations:

Semantic class Decision Examples
material_property ACCEPT superconductivity, resistivity, band gap, critical temperature
device_metric REJECT response time, detection efficiency, dark count rate
measurement REJECT I-V curve, resistance vs temperature
phenomenon REJECT quantum phase slips, Andreev reflection
process REJECT thermally activated, flux creep
geometry REJECT film thickness, wire width
operating_condition REJECT wavelength of operation, voltage bias

The result is an accepted_set of (source, target, relation_type) tuples. The extraction result is filtered against this set. If the verifier runs successfully but accepts nothing, all relations for that paper are dropped.


Stage 4 — Critic Layer (critical_layer/schema_validator.py)

Pure Python — no LLM, no API calls. 7 deterministic rules run synchronously before TTD-DR. See Critic Layer for the full rule list and implementation.


Stage 5 — TTD-DR (agents/ttd_dr.py)

Model: gpt-4o-mini (temperature=0)
Concurrency: asyncio.Semaphore(3) — max 3 parallel OpenAI calls per paper

Each relation is converted to a natural language claim, verified against TextbookKB via FAISS, and judged. Only SUPPORTED claims proceed to Neo4j. See TTD-DR for details.


Stage 6 — Neo4j Write (graph/graph_builder.py)

Pure Python — no LLM. Verified nodes and relations are written using Cypher MERGE (no duplicates ever created). Unique constraints exist for: Material, Property, Application, Method, Element, Formula.

Paper nodes store full metadata:

MERGE (p:Paper {title: $title})
SET p.doi = $doi, p.url = $url, p.year = $year, p.abstract = $abstract

Material nodes store key properties directly on the node for fast lookup without traversals:

Material {
    name, notes,
    band_gap_eV, electron_mobility_m2_Vs, hole_mobility_m2_Vs,
    lattice_parameter_nm, atom_radius_nm, density_kg_m3, atom_density_m3,
    molecular_weight, intrinsic_carrier_density_300K, crystal_structure
}


Query Pipeline

Served at POST /graph/ask. Takes a natural language question and returns a synthesized answer.

Step 1 — Cypher Generation

Model: gpt-4o (more capable than mini — Cypher quality matters here)

The CYPHER_PROMPT includes: - Full graph schema (all node types, all relation types, Value edge properties) - Material node direct properties (for fast m.band_gap_eV lookups) - SemiconductorConcept node fields (c.name, c.formula, c.description) - 3 named query patterns (A: material lookup, B: concept-first, C: material+concept) - Keyword → SemiconductorConcept name mapping table:

Keywords Concept name
linear density, [111] Linear Density [111]
planar density, (111) Planar Density (111)
conduction band, promotion, probability Electron Promotion Probability
band gap, characterize Band Gap from Conductivity
vacancy, Schottky Vacancy Density
doping, dopant, ppb, atomic percent Atomic Percent Dopant
transistor, collector current Transistor Collector Current
photon, wavelength, LED Photon Wavelength

Step 2 — Neo4j Execution

The Cypher runs against the graph. Results are cleaned (None values removed, duplicates collapsed) and capped at 10 rows for the answer prompt.

Step 3 — TextbookKB Fallback (Parallel)

FAISS semantic search runs in parallel with Neo4j — not after it. If Neo4j returns 0 results, TextbookKB provides the answer context. The _detect_question_type() method classifies the question to guide answer synthesis:

Question type Trigger keywords
semiconductor_calculation band gap, mobility, carrier, silicon, doping, ppb, probability...
properties property, characteristic, feature
applications application, used in, use case
methods method, synthesize, fabricate, deposit
general (fallback)

Step 4 — Answer Synthesis

Model: gpt-4o-mini

The ANSWER_PROMPT enforces: - English-only responses - Use BOTH graph results AND textbook knowledge - For calculation questions: extract formula → compute step by step → give specific numerical answer - 3–5 sentence maximum - Never invent facts


API Endpoints

Endpoint Method Description
/search/start POST Start ingestion pipeline. Body: {query, max_per_source}. Returns job_id.
/search/progress/{job_id} GET Poll progress — includes per-paper status, log, entity/relation counts.
/graph/ask POST GraphRAG Q&A. Body: {question}. Returns answer + Cypher + raw results.
/graph/stats GET Neo4j node/relation counts by type.
/graph/nodes GET List up to 200 nodes with name and type.
/ GET Health check.