|
@@ -0,0 +1,106 @@
|
|
|
|
|
+# RAG
|
|
|
|
|
+
|
|
|
|
|
+(26/02/18) Although I still primarily use the Lunr XML search feature, I occasionally use the AI helper and I want to get rid of the remote API calls. I'm adding this page for documenting the process.
|
|
|
|
|
+
|
|
|
|
|
+## Overview
|
|
|
|
|
+Build a fully local RAG system that ingests ~313 markdown files from the Docusaurus site, stores them in ChromaDB with Ollama embeddings, and provides a web UI for semantic search and Q&A.
|
|
|
|
|
+
|
|
|
|
|
+## Stack
|
|
|
|
|
+- Python 3 with FastAPI for the web server
|
|
|
|
|
+- Ollama with nomic-embed-text for embeddings
|
|
|
|
|
+- ChromaDB for vector storage (persisted to disk)
|
|
|
|
|
+- Simple HTML/JS frontend served by FastAPI
|
|
|
|
|
+
|
|
|
|
|
+## Directory Structure
|
|
|
|
|
+
|
|
|
|
|
+```sh
|
|
|
|
|
+daw_til/
|
|
|
|
|
+โโโ rag/
|
|
|
|
|
+ โโโ ingest.py # Markdown parser + ChromaDB ingestion
|
|
|
|
|
+ โโโ server.py # FastAPI web server + query endpoint
|
|
|
|
|
+ โโโ requirements.txt # Python dependencies
|
|
|
|
|
+ โโโ templates/
|
|
|
|
|
+ โ โโโ index.html # Web UI (search box + results)
|
|
|
|
|
+ โโโ chroma_db/ # ChromaDB persistence (gitignored)
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+## Implementation Steps
|
|
|
|
|
+1. Create rag/requirements.txt
|
|
|
|
|
+```sh
|
|
|
|
|
+chromadb
|
|
|
|
|
+fastapi
|
|
|
|
|
+uvicorn[standard]
|
|
|
|
|
+ollama
|
|
|
|
|
+python-frontmatter
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+2. Create ```rag/ingest.py``` โ Markdown Ingestion Pipeline
|
|
|
|
|
+- Walk docs/, notes/, lists/, posts/ directories
|
|
|
|
|
+- Parse each .md file:
|
|
|
|
|
+ - Extract YAML frontmatter (posts have title, slug, description, tags)
|
|
|
|
|
+ - For docs/notes/lists: derive title from first # heading
|
|
|
|
|
+ - Extract date from post filenames (YYYY-MM-DD pattern)
|
|
|
|
|
+ - Categorize by directory: docs, notes, lists, posts
|
|
|
|
|
+- Chunking strategy:
|
|
|
|
|
+ - Split on markdown headers (##, ###) to keep semantic sections intact
|
|
|
|
|
+ - For sections exceeding ~1000 chars, further split on paragraphs
|
|
|
|
|
+ - Each chunk gets metadata: source (file path), category, section (header path), title, tags (if post)
|
|
|
|
|
+- Generate embeddings via Ollama (nomic-embed-text)
|
|
|
|
|
+- Upsert into ChromaDB collection with metadata
|
|
|
|
|
+- Print progress (file count, chunk count, timing)
|
|
|
|
|
+3. Create ```rag/server.py``` โ FastAPI Query Server
|
|
|
|
|
+- ```GET /``` โ serve the HTML UI
|
|
|
|
|
+- ```POST /query``` โ accept a search query, embed it with Ollama, query ChromaDB for top-k results (default 8), return JSON with chunks + metadata + distances
|
|
|
|
|
+- ```GET /stats``` โ return collection stats (doc count, chunk count)
|
|
|
|
|
+- CORS enabled for local dev
|
|
|
|
|
+4. Create ```rag/templates/index.html``` โ Web UI
|
|
|
|
|
+- Clean, minimal search interface
|
|
|
|
|
+- Search box with submit button
|
|
|
|
|
+- Results displayed as cards showing:
|
|
|
|
|
+ - Matched text snippet
|
|
|
|
|
+ - Source file path (clickable link to Docusaurus page)
|
|
|
|
|
+ - Category badge (docs/notes/lists/posts)
|
|
|
|
|
+ - Relevance score
|
|
|
|
|
+- Loading spinner during search
|
|
|
|
|
+5. Add ```rag/``` to ```.gitignore```
|
|
|
|
|
+Add ```rag/chroma_db/``` to gitignore to keep vector data out of git
|
|
|
|
|
+
|
|
|
|
|
+## Metadata Schema (ChromaDB)
|
|
|
|
|
+Each chunk stored with:
|
|
|
|
|
+```js
|
|
|
|
|
+{
|
|
|
|
|
+ "source": "docs/lang/JavaScript.md", # relative file path
|
|
|
|
|
+ "category": "docs", # docs|notes|lists|posts
|
|
|
|
|
+ "title": "JavaScript", # doc/post title
|
|
|
|
|
+ "section": "Arrays > Map and Filter", # header breadcrumb
|
|
|
|
|
+ "tags": "tech", # comma-separated (posts only)
|
|
|
|
|
+ "date": "2025-03-03", # posts only
|
|
|
|
|
+ "url": "/til/docs/lang/JavaScript" # reconstructed Docusaurus URL
|
|
|
|
|
+}
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+## Prerequisites
|
|
|
|
|
+- Ollama installed and running with nomic-embed-text model pulled
|
|
|
|
|
+- Python 3.10+
|
|
|
|
|
+
|
|
|
|
|
+## Usage
|
|
|
|
|
+```sh
|
|
|
|
|
+cd rag
|
|
|
|
|
+pip install -r requirements.txt
|
|
|
|
|
+ollama pull nomic-embed-text
|
|
|
|
|
+
|
|
|
|
|
+# Ingest all markdown files (run once, re-run to rebuild)
|
|
|
|
|
+python ingest.py
|
|
|
|
|
+
|
|
|
|
|
+# Start the web UI
|
|
|
|
|
+python server.py
|
|
|
|
|
+# Open http://localhost:8808
|
|
|
|
|
+```
|
|
|
|
|
+
|
|
|
|
|
+## Verification
|
|
|
|
|
+1. Run ```python ingest.py``` โ should report ~300+ files processed and ~1000+ chunks created
|
|
|
|
|
+2. Run ```python server.py``` โ should start on port 8808
|
|
|
|
|
+3. Open browser to ```http://localhost:8808```
|
|
|
|
|
+4. Search for "PostgreSQL" โ should return relevant chunks from ```docs/db/PostgreSQL.md```
|
|
|
|
|
+5. Search for "Docker" โ should return chunks from docs/server/Docker.md
|
|
|
|
|
+6. Search for a conceptual query like "how to set up SSL certificates" โ should return relevant Let's Encrypt / Apache / Nginx docs
|