windhamdavid 6 days ago
parent
commit
4d6d7d1776
3 changed files with 108 additions and 0 deletions
  1. 1 0
      notes/index.md
  2. 106 0
      notes/work/projects/ai-rag.md
  3. 1 0
      sidebarsnotes.js

+ 1 - 0
notes/index.md

@@ -7,6 +7,7 @@ slug: /
 
 
 ## Log
 ## Log
 
 
+- 26/02/18 - ๐Ÿค– [projects/RAG](/notes/work/projects/ai-rag)
 - 26/01/02 - ๐ŸŽจ [art/paint](/notes/art/paint)
 - 26/01/02 - ๐ŸŽจ [art/paint](/notes/art/paint)
 - 25/08/16 - ๐Ÿšฐ [house/bath](/notes/house/bath)
 - 25/08/16 - ๐Ÿšฐ [house/bath](/notes/house/bath)
 - 25/04/26 - ๐ŸŽน [music/music](/notes/music/)
 - 25/04/26 - ๐ŸŽน [music/music](/notes/music/)

+ 106 - 0
notes/work/projects/ai-rag.md

@@ -0,0 +1,106 @@
+# RAG
+
+(26/02/18) Although I still primarily use the Lunr XML search feature, I occasionally use the AI helper and I want to get rid of the remote API calls. I'm adding this page for documenting the process. 
+
+## Overview
+Build a fully local RAG system that ingests ~313 markdown files from the Docusaurus site, stores them in ChromaDB with Ollama embeddings, and provides a web UI for semantic search and Q&A.
+
+## Stack
+- Python 3 with FastAPI for the web server
+- Ollama with nomic-embed-text for embeddings
+- ChromaDB for vector storage (persisted to disk)
+- Simple HTML/JS frontend served by FastAPI
+
+## Directory Structure
+
+```sh
+daw_til/
+โ””โ”€โ”€ rag/
+    โ”œโ”€โ”€ ingest.py          # Markdown parser + ChromaDB ingestion
+    โ”œโ”€โ”€ server.py          # FastAPI web server + query endpoint
+    โ”œโ”€โ”€ requirements.txt   # Python dependencies
+    โ”œโ”€โ”€ templates/
+    โ”‚   โ””โ”€โ”€ index.html     # Web UI (search box + results)
+    โ””โ”€โ”€ chroma_db/         # ChromaDB persistence (gitignored)
+```
+
+## Implementation Steps
+1. Create rag/requirements.txt
+```sh
+chromadb
+fastapi
+uvicorn[standard]
+ollama
+python-frontmatter
+```
+
+2. Create ```rag/ingest.py``` โ€” Markdown Ingestion Pipeline
+- Walk docs/, notes/, lists/, posts/ directories
+- Parse each .md file:
+  - Extract YAML frontmatter (posts have title, slug, description, tags)
+  - For docs/notes/lists: derive title from first # heading
+  - Extract date from post filenames (YYYY-MM-DD pattern)
+  - Categorize by directory: docs, notes, lists, posts
+- Chunking strategy:
+  - Split on markdown headers (##, ###) to keep semantic sections intact
+  - For sections exceeding ~1000 chars, further split on paragraphs
+  - Each chunk gets metadata: source (file path), category, section (header path), title, tags (if post)
+- Generate embeddings via Ollama (nomic-embed-text)
+- Upsert into ChromaDB collection with metadata
+- Print progress (file count, chunk count, timing)
+3. Create ```rag/server.py``` โ€” FastAPI Query Server
+- ```GET /``` โ€” serve the HTML UI
+- ```POST /query``` โ€” accept a search query, embed it with Ollama, query ChromaDB for top-k results (default 8), return JSON with chunks + metadata + distances
+- ```GET /stats``` โ€” return collection stats (doc count, chunk count)
+- CORS enabled for local dev
+4. Create ```rag/templates/index.html``` โ€” Web UI
+- Clean, minimal search interface
+- Search box with submit button
+- Results displayed as cards showing:
+  - Matched text snippet
+  - Source file path (clickable link to Docusaurus page)
+  - Category badge (docs/notes/lists/posts)
+  - Relevance score
+- Loading spinner during search
+5. Add ```rag/``` to ```.gitignore```
+Add ```rag/chroma_db/``` to gitignore to keep vector data out of git
+
+## Metadata Schema (ChromaDB)
+Each chunk stored with:
+```js
+{
+    "source": "docs/lang/JavaScript.md",  # relative file path
+    "category": "docs",                    # docs|notes|lists|posts
+    "title": "JavaScript",                 # doc/post title
+    "section": "Arrays > Map and Filter",  # header breadcrumb
+    "tags": "tech",                        # comma-separated (posts only)
+    "date": "2025-03-03",                  # posts only
+    "url": "/til/docs/lang/JavaScript"     # reconstructed Docusaurus URL
+}
+```
+
+## Prerequisites
+- Ollama installed and running with nomic-embed-text model pulled
+- Python 3.10+
+
+## Usage
+```sh
+cd rag
+pip install -r requirements.txt
+ollama pull nomic-embed-text
+
+# Ingest all markdown files (run once, re-run to rebuild)
+python ingest.py
+
+# Start the web UI
+python server.py
+# Open http://localhost:8808
+```
+
+## Verification
+1. Run ```python ingest.py``` โ€” should report ~300+ files processed and ~1000+ chunks created
+2. Run ```python server.py``` โ€” should start on port 8808
+3. Open browser to ```http://localhost:8808```
+4. Search for "PostgreSQL" โ€” should return relevant chunks from ```docs/db/PostgreSQL.md```
+5. Search for "Docker" โ€” should return chunks from docs/server/Docker.md
+6. Search for a conceptual query like "how to set up SSL certificates" โ€” should return relevant Let's Encrypt / Apache / Nginx docs

+ 1 - 0
sidebarsnotes.js

@@ -164,6 +164,7 @@ module.exports = {
           },
           },
           items: [
           items: [
             'work/projects/ai',
             'work/projects/ai',
+            'work/projects/ai-rag',
             'work/projects/game',
             'work/projects/game',
             'work/projects/gzet',
             'work/projects/gzet',
             'work/projects/ham',
             'work/projects/ham',