🎉 Initial commit!

2026-05-02 10:22:56 -07:00 · 2026-05-02 10:22:56 -07:00 · f12cc7aaf8
commit f12cc7aaf8
6 changed files with 541 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,7 @@
+.env*
+.editorconfig
+*.vim
+.venv/
+models/
+documents/
+*.sqlite
--- a/32
+++ b/32
@ -0,0 +1,32 @@
+The Clear BSD License
+
+Copyright (c) 2026 Brian Hayes
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted (subject to the limitations in the disclaimer
+below) provided that the following conditions are met:
+
+     * Redistributions of source code must retain the above copyright notice,
+     this list of conditions and the following disclaimer.
+
+     * Redistributions in binary form must reproduce the above copyright
+     notice, this list of conditions and the following disclaimer in the
+     documentation and/or other materials provided with the distribution.
+
+     * Neither the name of the copyright holder nor the names of its
+     contributors may be used to endorse or promote products derived from this
+     software without specific prior written permission.
+
+NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
+THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
+CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,130 @@
+# Minimal RAG PDF Reader
+
+## Introduction
+
+This repository contains a begginer implemention of a very basic Retrieval
+Augmented Generated (RAG) LLM for PDFs. It is meant as a simple exercise with
+RAGs, but also demonstrates my attempts at creating a minimal RAG implementation
+that runs locally without API usage during execution.
+
+## Setup
+
+This document is mainly meant for personal use, and thusly there will not be
+extensive explanation or instruction for how to setup this repository. Those
+familiary with git and python should be well versed in these procedures.
+
+**Cloning the repo:**
+
+```sh
+git clone <this_url> && \
+cd minimal_rag_pdf
+```
+
+**Starting the environment:**
+
+```sh
+python -m venv .venv && \
+source .venv/bin/activate
+```
+
+**Upgrading pip**
+
+```sh
+python -m pip install --upgrade pip
+```
+
+**CUDA GCC Version mismatch solve:**
+
+There is potentially a mismatch when setting up `llama-cpp-python` on different
+systems. Please refer to their
+[documentation](https://llama-cpp-python.readthedocs.io/en/latest/).
+
+The following environment variable declarations and subsequent installation with
+proper flags is what got it working on my personal machine. Note that depending
+on your system the compile time can take a while:
+
+```sh
+CC=gcc-14 CXX=g++-14 \
+CUDACXX=/opt/cuda/bin/nvcc \
+CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14" \
+python -m pip install llama-cpp-python --no-cache-dir --force-reinstall
+```
+
+**Installing requirements:**
+
+```sh
+python -m pip install -r requirements.txt
+```
+
+**Environment variables:**
+
+```sh
+cp env.sample .env
+```
+
+Note that if you don't use the exact same LLM model, embedding model, and PDF
+that I used, this application will not work without you changing the environment
+variable names.
+
+## Downloading the models
+
+You can use more powerful models than the ones I used if you so choose, but if
+you want to just run what I tried, you can find the instructions here. Please
+note that I have a very low end GPU and low end CPU, so I could only use very
+low parameter LLMs.
+
+Head over to [HuggingFace](https://huggingface.co/) and download the
+[bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF](https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf)
+LLM model, and the
+[CompendiumLabs/bge-small-en-v1.5-q8_0](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/blob/main/bge-small-en-v1.5-q8_0.gguf)
+embedding model.
+
+Note that these models should be placed in the `models` folder. If it doesn't
+exist, go ahead and make it:
+
+```sh
+mkdir models
+```
+
+And if you used a different LLM model and/or embedding model, make sure to
+change the name(s) in the `.env` file.
+
+## Finding PDFs
+
+I made this script just as a novelty, and currently it only reads a single PDF
+as data for the RAG. If you want to replicate what I did exactly, I ended up
+feeding the RAG the Linux Essentials Study Guide from
+[LPI](https://learning.lpi.org/en/learning-materials/010-160/). Any PDF that you
+do want to use should be placed in the `documents` folder. Again, if it doesn't
+exist, go ahead and make it:
+
+```sh
+mkdir documents
+```
+
+And if you used a different PDF document, make sure to change the name in the
+`.env` file.
+
+## Running the application
+
+```sh
+python main.py
+```
+
+The first time running the application, it will populate the sqlite DB with the
+vectorized embeddings, so just let it do its thing. After that initial
+populating of the database, it should run much faster (especially with GPU
+acceleration).
+
+## Notes/Disclaimer
+
+It's worth noting this is a very very basic RAG application. It uses
+[sqlite-vec](https://www.sqlite.ai/sqlite-vector) instead of ChromaDB just as an
+exploration into alternatives. It doesn't utilize LangChain or LlamaIndex or
+bring in a bunch of APIs. It does utilize [LLama CPP](https://llama-cpp.com/)
+via [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) to
+bring in the LLM and embedding models, as well as
+[pypdf](https://pypdf.readthedocs.io/en/stable/) to read the PDF file.
+
+This project is not meant to be utilized in any commercial way, but is purely
+educational in purpose.
--- a/env.sample
+++ b/env.sample
@ -0,0 +1,3 @@
+LLM_MODEL="Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf"
+EMBEDDING_MODEL="bge-small-en-v1.5-q8_0.gguf"
+PDF_DOCUMENT="LPI-Learning-Material-010-160-en.pdf"
--- a/main.py
+++ b/main.py
@ -0,0 +1,286 @@
+import os
+import re
+import sqlite3
+import sys
+
+import numpy as np
+import sqlite_vec
+from dotenv import load_dotenv
+from llama_cpp import Llama
+from pypdf import PdfReader
+from sqlite_vec import serialize_float32
+
+load_dotenv()
+
+DEBUG = False
+LLM_MODEL = os.getenv("LLM_MODEL")
+EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
+PDF_DOCUMENT = os.getenv("PDF_DOCUMENT")
+
+os.makedirs("./models", exist_ok=True)
+os.makedirs("./documents", exist_ok=True)
+
+conn = sqlite3.connect("./vector_db.sqlite")
+conn.enable_load_extension(True)
+sqlite_vec.load(conn)
+
+
+llm = Llama(
+    model_path=f"./models/{LLM_MODEL}",
+    n_gpu_layers=-1,  # Uncomment to use GPU acceleration
+    n_ctx=6096,  # Uncomment to increase the context window
+    verbose=False,
+    log_level="error",
+    # seed = 1337, # Uncommment to set a specific seed
+    #  temperature=0.2,
+    #  repeat_penalty=1.15,
+    #  top_p=0.9,
+    #  top_k=40,
+)
+
+
+_embedding_model = None
+
+
+def get_embedding_model():
+    global _embedding_model
+    if _embedding_model is None:
+        print("Loading embedded model...")
+        _embedding_model = Llama(
+            model_path=f"./models/{EMBEDDING_MODEL}",
+            embedding=True,
+            verbose=False,
+            log_level="error",
+        )
+    return _embedding_model
+
+
+def init_db(dim: int):
+    conn.execute("PRAGMA journal_mode=WAL;")
+
+    conn.execute(f"""
+                 CREATE VIRTUAL TABLE IF NOT EXISTS chunks USING vec0(
+                     id INTEGER,
+                     embedding float[{dim}],
+                     text TEXT
+                 );
+                 """)
+
+
+def load_pdf(path):
+    reader = PdfReader(path)
+    text = ""
+    empty_pages = 0
+
+    for page in reader.pages:
+        page_text = page.extract_text()
+
+        if not page_text:
+            empty_pages += 1
+            continue
+
+        text += page_text + "\n"
+
+    print(f"Empty pages: {empty_pages}/{len(reader.pages)}")
+    print(f"Total extracted chars: {len(text)}")
+
+    return text
+
+
+def chunk_text(text, max_chars=1200):
+    paragraphs = text.split("\n")
+    chunks = []
+    current = ""
+
+    for p in paragraphs:
+        if len(current) + len(p) < max_chars:
+            current += p + "\n"
+        else:
+            chunks.append(current.strip())
+            current = p + "\n"
+
+    if current:
+        chunks.append(current.strip())
+
+    return chunks
+
+
+def normalize(vec):
+    v = np.array(vec, dtype=np.float32)
+    return (v / np.linalg.norm(v)).tolist()
+
+
+def embed_chunks(chunks, batch_size=1):
+    all_embeddings = []
+
+    model = get_embedding_model()
+
+    for i in range(0, len(chunks), batch_size):
+        batch = chunks[i : i + batch_size]
+
+        result = model.create_embedding(batch)
+        batch_embeddings = [normalize(e["embedding"]) for e in result["data"]]
+
+        all_embeddings.extend(batch_embeddings)
+
+        print(f"Embedded {i + len(batch)}/{len(chunks)}")
+
+    return all_embeddings
+
+
+def store_embeddings(chunks, embeddings):
+    dim = len(embeddings[0])
+    init_db(dim)
+
+    for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
+        conn.execute(
+            "INSERT INTO chunks (id, embedding, text) VALUES (?, ?, ?)",
+            (i, serialize_float32(emb), chunk),
+        )
+
+    conn.commit()
+
+
+def tokenize(text):
+    # TODO: put this in a config file or something
+    stop_words = {
+        "the",
+        "is",
+        "a",
+        "an",
+        "who",
+        "what",
+        "when",
+        "where",
+        "why",
+        "how",
+        "and",
+        "or",
+        "to",
+        "of",
+        "in",
+        "on",
+        "for",
+        "with",
+        "as",
+        "by",
+    }
+    words = set(re.findall(r"\b\w+\b", text.lower()))
+    return {w for w in words if w not in stop_words}
+
+
+def keyword_score(query, text):
+    q = tokenize(query)
+    t = tokenize(text)
+
+    if not q:
+        return 0
+
+    overlap = q & t
+
+    score = len(overlap) / len(q)
+
+    if query.lower() in text.lower():
+        score += 1.0
+
+    return score
+
+
+def query(question, top_k=3, initial_k=10):
+    model = get_embedding_model()
+
+    query_embedding = normalize(
+        model.create_embedding([question])["data"][0]["embedding"]
+    )
+
+    rows = conn.execute(
+        """
+        SELECT id, text FROM chunks WHERE embedding MATCH ? AND k = ?
+        """,
+        (serialize_float32(query_embedding), initial_k),  # type: ignore
+    ).fetchall()
+
+    scored = []
+    for cid, text in rows:
+        score = keyword_score(question, text)
+        scored.append((cid, text, score))
+
+    scored.sort(key=lambda x: x[2], reverse=True)
+
+    if DEBUG:
+        print("\n--- RETRIEVAL DEBUG ---")
+        for cid, text, s in scored[:5]:
+            print(f"[{cid}] score={s:.2f} | {text[:120]}\n")
+
+    return [(cid, text) for cid, text, _ in scored[:top_k]]
+
+
+def ask_llm(context_chunks, question):
+    context = "\n\n".join(f"[{cid}] {text}" for cid, text in context_chunks)
+
+    prompt = f"""You are a precise assistant.
+
+    Use ONLY the provided context to answer.
+
+    Cite sources at the end of your sentences using bracket IDs.
+
+    If unsure , say "I don't know based on the provided context."
+
+    Context:
+    {context}
+
+    Question:
+    {question}
+
+    Answer:"""
+
+    stream = llm(
+        prompt, max_tokens=200, stop=["</s>", "<|end|>", "Question:"], stream=True
+    )
+
+    print("\nANSWER:\n")
+
+    for chunk in stream:
+        token = chunk["choices"][0]["text"]  # type: ignore
+        print(token, end="", flush=True)
+
+    print()
+
+
+def main():
+    print("Loading DB...")
+
+    exists = conn.execute(
+        "SELECT name FROM sqlite_master WHERE type='table' AND name='chunks'"
+    ).fetchone()
+
+    count = 0
+    if exists:
+        count = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
+
+    if count == 0:
+        print("No data found. Ingesting PDF...")
+
+        text = load_pdf(f"./documents/{PDF_DOCUMENT}")
+        chunks = chunk_text(text)
+        embeddings = embed_chunks(chunks)
+        store_embeddings(chunks, embeddings)
+
+    print("\nRAG is ready. Ask questions (type 'exit' to quit)")
+
+    while True:
+        print()
+
+        question = input("Question: ").strip()
+
+        if question.lower() in ["exit", "quit"]:
+            break
+
+        results = query(question)
+        ask_llm(results, question)
+
+
+if __name__ == "__main__":
+    main()
+    print("Goodbye!")
+    sys.exit()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,83 @@
+annotated-doc==0.0.4
+annotated-types==0.7.0
+anyio==4.13.0
+attrs==26.1.0
+bcrypt==5.0.0
+build==1.5.0
+certifi==2026.4.22
+charset-normalizer==3.4.7
+click==8.3.3
+diskcache==5.6.3
+durationpy==0.10
+fastapi==0.136.1
+filelock==3.29.0
+flatbuffers==25.12.19
+fsspec==2026.4.0
+googleapis-common-protos==1.74.0
+grpcio==1.80.0
+h11==0.16.0
+hf-xet==1.4.3
+httpcore==1.0.9
+httptools==0.7.1
+httpx==0.28.1
+huggingface_hub==1.13.0
+idna==3.13
+importlib_metadata==8.7.1
+importlib_resources==7.1.0
+Jinja2==3.1.6
+jsonschema==4.26.0
+jsonschema-specifications==2025.9.1
+kubernetes==35.0.0
+llama_cpp_python==0.3.21
+markdown-it-py==4.0.0
+MarkupSafe==3.0.3
+mdurl==0.1.2
+mmh3==5.2.1
+numpy==2.4.4
+oauthlib==3.3.1
+onnxruntime==1.25.1
+opentelemetry-api==1.41.1
+opentelemetry-exporter-otlp-proto-common==1.41.1
+opentelemetry-exporter-otlp-proto-grpc==1.41.1
+opentelemetry-proto==1.41.1
+opentelemetry-sdk==1.41.1
+opentelemetry-semantic-conventions==0.62b1
+orjson==3.11.8
+overrides==7.7.0
+packaging==26.2
+protobuf==6.33.6
+pybase64==1.4.3
+pydantic==2.13.3
+pydantic-settings==2.14.0
+pydantic_core==2.46.3
+Pygments==2.20.0
+pypdf==6.10.2
+PyPika==0.51.1
+pyproject_hooks==1.2.0
+python-dateutil==2.9.0.post0
+python-dotenv==1.2.2
+PyYAML==6.0.3
+referencing==0.37.0
+requests==2.33.1
+requests-oauthlib==2.0.0
+rich==15.0.0
+rpds-py==0.30.0
+shellingham==1.5.4
+six==1.17.0
+sqlite-vec==0.1.9
+sse-starlette==3.4.1
+starlette==1.0.0
+starlette-context==0.3.6
+tenacity==9.1.4
+tokenizers==0.23.1
+tqdm==4.67.3
+typer==0.25.1
+typing-inspection==0.4.2
+typing_extensions==4.15.0
+urllib3==2.6.3
+uvicorn==0.46.0
+uvloop==0.22.1
+watchfiles==1.1.1
+websocket-client==1.9.0
+websockets==16.0
+zipp==3.23.1