Installation¶
System Requirements¶
| Component | Requirement | Notes |
|---|---|---|
| Python | 3.10+ | |
| Neo4j | 5.x | Local Docker or AuraDB cloud |
| RAM | 4 GB+ | FAISS index loads ~200MB into memory |
| Disk | 2 GB+ | PDF textbooks + FAISS index |
| OpenAI API | Required | GPT-4o-mini + GPT-4o |
| Groq / Cerebras | Optional | Benchmark comparison only |
Python Dependencies¶
pip install openai \
neo4j \
faiss-cpu \
sentence-transformers \
fastapi \
uvicorn \
python-dotenv \
httpx \
aiohttp \
pymupdf \
pypdf
For the benchmark evaluation only:
Environment Variables¶
# ── OpenAI (required for extraction, TTD-DR, and GraphRAG) ──
OPENAI_API_KEY=sk-...
# ── Neo4j ──
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_password
# ── Unpaywall (for PDF enrichment) ──
# Uses your email as identifier — required by Unpaywall ToS
UNPAYWALL_EMAIL=your@email.com
# ── Optional — benchmark only ──
GROQ_API_KEY=gsk_...
CEREBRAS_API_KEY=csk_...
Neo4j Setup¶
Browser UI: http://localhost:7474
Download from neo4j.com/download. Create a project, add a local DBMS (version 5.x), set a password, start it.
- Sign up at neo4j.com/cloud
- Create a free instance
- Copy the Bolt URL (
neo4j+s://...) and set asNEO4J_URI - Set
NEO4J_USER=neo4jand the password from the instance details
TextbookKB Index¶
On first startup, if data/textbook_index/textbook.index is missing, the system builds it automatically. You can also build manually:
What happens:
- Reads all PDFs from
data/year{1..4}_*/ - Extracts text page by page with
pypdf - Chunks text (800 chars, 100 overlap)
- Embeds with
all-MiniLM-L6-v2(batch size 256) - Builds
faiss.IndexFlatL2(384) - Saves
data/textbook_index/textbook.indexanddata/textbook_index/chunks.json
Time: ~2–5 minutes for 9 textbooks (~29,500 chunks)
Index size on disk: ~45 MB (FAISS binary + JSON metadata)
Adding more textbooks
Place new PDFs in the appropriate data/year*/ folder, delete data/textbook_index/, and restart. The index rebuilds automatically.
Project Structure¶
scientific-literature-knowledge-graph/
│
├── main.py (api.py) # FastAPI entry point — pipeline orchestration
├── eval_fillblank.py # Benchmark evaluator (8 models, 200 questions)
├── benchmark_fillblank_200.json # The 200-question benchmark dataset
│
├── agents/
│ ├── extraction_agent.py # GPT-4o-mini entity/relation extraction
│ ├── verification_agent.py # GPT-4o-mini semantic relation verifier
│ ├── ttd_dr.py # TTD-DR claim verification (FAISS + LLM)
│ ├── graph_reasoning_agent.py # GraphRAG query pipeline (GPT-4o)
│ ├── textbook_kb.py # FAISS TextbookKB + search API
│ ├── schema_validator.py # 7-rule Critic Layer (pure Python)
│ └── query_expansion_agent.py # (unused in v2 — query goes direct)
│
├── critical_layer/
│ ├── textbook_kb.py # Mirror of agents/textbook_kb.py
│ └── schema_validator.py # Mirror of agents/schema_validator.py
│
├── graph/
│ └── graph_builder.py # Neo4j MERGE write layer
│
├── retrieval/
│ ├── retrieval_manager.py # Parallel multi-source search + Unpaywall
│ ├── arxiv_source.py
│ ├── openalex_source.py
│ └── crossref_source.py
│
└── data/
├── year1_temel/ # Introductory textbook PDFs
├── year2_orta/ # Intermediate textbook PDFs
├── year3_ileri/ # Advanced textbook PDFs
├── year4_uzman/ # Expert textbook PDFs
└── textbook_index/
├── textbook.index # FAISS binary index
└── chunks.json # Chunk metadata (text, source, page, level)
Neo4j Schema¶
After ingestion, the graph contains these node types:
// Node types
(:Material) // e.g. GaN, MgB2, copper
(:Property) // e.g. superconductivity, band gap
(:Value) // e.g. "3.4 eV", "39 K"
(:Application) // e.g. single-photon detection
(:Method) // e.g. sputtering, CVD
(:Element) // e.g. Nb, Si (periodic table symbols)
(:Formula) // e.g. NbN, MgB2 (chemical formula)
(:SemiconductorConcept) // formulas + worked examples
(:Paper) // source paper with abstract
// Relation types
(Material)-[:HAS_PROPERTY]->(Property)
(Material)-[:HAS_VALUE]->(Value)
(Material)-[:USED_IN]->(Application)
(Material)-[:SYNTHESIZED_BY]->(Method)
(Material)-[:HAS_ELEMENT]->(Element)
(Material)-[:HAS_FORMULA]->(Formula)
(Material)-[:USES_CONCEPT]->(SemiconductorConcept)
(Paper)-[:PAPER_MENTIONS]->(any node)
Unique constraints are created automatically on startup for: Material.name, Property.name, Application.name, Method.name, Element.name, Formula.name.