🎉 Initial commit!

2026-05-02 10:22:56 -07:00 · 2026-05-02 10:22:56 -07:00 · f12cc7aaf8
commit f12cc7aaf8
6 changed files with 541 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,7 @@
 .env*
 .editorconfig
 *.vim
 .venv/
 models/
 documents/
 *.sqlite
--- a/32
+++ b/32
@ -0,0 +1,32 @@
 The Clear BSD License
 Copyright (c) 2026 Brian Hayes
 All rights reserved.
 Redistribution and use in source and binary forms, with or without
 modification, are permitted (subject to the limitations in the disclaimer
 below) provided that the following conditions are met:
     * Redistributions of source code must retain the above copyright notice,
     this list of conditions and the following disclaimer.
     * Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.
     * Neither the name of the copyright holder nor the names of its
     contributors may be used to endorse or promote products derived from this
     software without specific prior written permission.
 NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
 THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
 CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
 PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
 CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
 BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
 IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
 ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 POSSIBILITY OF SUCH DAMAGE.
--- a/README.md
+++ b/README.md
@ -0,0 +1,130 @@
 # Minimal RAG PDF Reader
 ## Introduction
 This repository contains a begginer implemention of a very basic Retrieval
 Augmented Generated (RAG) LLM for PDFs. It is meant as a simple exercise with
 RAGs, but also demonstrates my attempts at creating a minimal RAG implementation
 that runs locally without API usage during execution.
 ## Setup
 This document is mainly meant for personal use, and thusly there will not be
 extensive explanation or instruction for how to setup this repository. Those
 familiary with git and python should be well versed in these procedures.
 **Cloning the repo:**
 ```sh
 git clone <this_url> && \
 cd minimal_rag_pdf
 ```
 **Starting the environment:**
 ```sh
 python -m venv .venv && \
 source .venv/bin/activate
 ```
 **Upgrading pip**
 ```sh
 python -m pip install --upgrade pip
 ```
 **CUDA GCC Version mismatch solve:**
 There is potentially a mismatch when setting up `llama-cpp-python` on different
 systems. Please refer to their
 [documentation](https://llama-cpp-python.readthedocs.io/en/latest/).
 The following environment variable declarations and subsequent installation with
 proper flags is what got it working on my personal machine. Note that depending
 on your system the compile time can take a while:
 ```sh
 CC=gcc-14 CXX=g++-14 \
 CUDACXX=/opt/cuda/bin/nvcc \
 CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14" \
 python -m pip install llama-cpp-python --no-cache-dir --force-reinstall
 ```
 **Installing requirements:**
 ```sh
 python -m pip install -r requirements.txt
 ```
 **Environment variables:**
 ```sh
 cp env.sample .env
 ```
 Note that if you don't use the exact same LLM model, embedding model, and PDF
 that I used, this application will not work without you changing the environment
 variable names.
 ## Downloading the models
 You can use more powerful models than the ones I used if you so choose, but if
 you want to just run what I tried, you can find the instructions here. Please
 note that I have a very low end GPU and low end CPU, so I could only use very
 low parameter LLMs.
 Head over to [HuggingFace](https://huggingface.co/) and download the
 [bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF](https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf)
 LLM model, and the
 [CompendiumLabs/bge-small-en-v1.5-q8_0](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/blob/main/bge-small-en-v1.5-q8_0.gguf)
 embedding model.
 Note that these models should be placed in the `models` folder. If it doesn't
 exist, go ahead and make it:
 ```sh
 mkdir models
 ```
 And if you used a different LLM model and/or embedding model, make sure to
 change the name(s) in the `.env` file.
 ## Finding PDFs
 I made this script just as a novelty, and currently it only reads a single PDF
 as data for the RAG. If you want to replicate what I did exactly, I ended up
 feeding the RAG the Linux Essentials Study Guide from
 [LPI](https://learning.lpi.org/en/learning-materials/010-160/). Any PDF that you
 do want to use should be placed in the `documents` folder. Again, if it doesn't
 exist, go ahead and make it:
 ```sh
 mkdir documents
 ```
 And if you used a different PDF document, make sure to change the name in the
 `.env` file.
 ## Running the application
 ```sh
 python main.py
 ```
 The first time running the application, it will populate the sqlite DB with the
 vectorized embeddings, so just let it do its thing. After that initial
 populating of the database, it should run much faster (especially with GPU
 acceleration).
 ## Notes/Disclaimer
 It's worth noting this is a very very basic RAG application. It uses
 [sqlite-vec](https://www.sqlite.ai/sqlite-vector) instead of ChromaDB just as an
 exploration into alternatives. It doesn't utilize LangChain or LlamaIndex or
 bring in a bunch of APIs. It does utilize [LLama CPP](https://llama-cpp.com/)
 via [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) to
 bring in the LLM and embedding models, as well as
 [pypdf](https://pypdf.readthedocs.io/en/stable/) to read the PDF file.
 This project is not meant to be utilized in any commercial way, but is purely
 educational in purpose.
--- a/env.sample
+++ b/env.sample
@ -0,0 +1,3 @@
 LLM_MODEL="Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf"
 EMBEDDING_MODEL="bge-small-en-v1.5-q8_0.gguf"
 PDF_DOCUMENT="LPI-Learning-Material-010-160-en.pdf"
--- a/main.py
+++ b/main.py
@ -0,0 +1,286 @@
 import os
 import re
 import sqlite3
 import sys
 import numpy as np
 import sqlite_vec
 from dotenv import load_dotenv
 from llama_cpp import Llama
 from pypdf import PdfReader
 from sqlite_vec import serialize_float32
 load_dotenv()
 DEBUG = False
 LLM_MODEL = os.getenv("LLM_MODEL")
 EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
 PDF_DOCUMENT = os.getenv("PDF_DOCUMENT")
 os.makedirs("./models", exist_ok=True)
 os.makedirs("./documents", exist_ok=True)
 conn = sqlite3.connect("./vector_db.sqlite")
 conn.enable_load_extension(True)
 sqlite_vec.load(conn)
 llm = Llama(
    model_path=f"./models/{LLM_MODEL}",
    n_gpu_layers=-1,  # Uncomment to use GPU acceleration
    n_ctx=6096,  # Uncomment to increase the context window
    verbose=False,
    log_level="error",
    # seed = 1337, # Uncommment to set a specific seed
    #  temperature=0.2,
    #  repeat_penalty=1.15,
    #  top_p=0.9,
    #  top_k=40,
 )
 _embedding_model = None
 def get_embedding_model():
    global _embedding_model
    if _embedding_model is None:
        print("Loading embedded model...")
        _embedding_model = Llama(
            model_path=f"./models/{EMBEDDING_MODEL}",
            embedding=True,
            verbose=False,
            log_level="error",
        )
    return _embedding_model
 def init_db(dim: int):
    conn.execute("PRAGMA journal_mode=WAL;")
    conn.execute(f"""
                 CREATE VIRTUAL TABLE IF NOT EXISTS chunks USING vec0(
                     id INTEGER,
                     embedding float[{dim}],
                     text TEXT
                 );
                 """)
 def load_pdf(path):
    reader = PdfReader(path)
    text = ""
    empty_pages = 0
    for page in reader.pages:
        page_text = page.extract_text()
        if not page_text:
            empty_pages += 1
            continue
        text += page_text + "\n"
    print(f"Empty pages: {empty_pages}/{len(reader.pages)}")
    print(f"Total extracted chars: {len(text)}")
    return text
 def chunk_text(text, max_chars=1200):
    paragraphs = text.split("\n")
    chunks = []
    current = ""
    for p in paragraphs:
        if len(current) + len(p) < max_chars:
            current += p + "\n"
        else:
            chunks.append(current.strip())
            current = p + "\n"
    if current:
        chunks.append(current.strip())
    return chunks
 def normalize(vec):
    v = np.array(vec, dtype=np.float32)
    return (v / np.linalg.norm(v)).tolist()
 def embed_chunks(chunks, batch_size=1):
    all_embeddings = []
    model = get_embedding_model()
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        result = model.create_embedding(batch)
        batch_embeddings = [normalize(e["embedding"]) for e in result["data"]]
        all_embeddings.extend(batch_embeddings)
        print(f"Embedded {i + len(batch)}/{len(chunks)}")
    return all_embeddings
 def store_embeddings(chunks, embeddings):
    dim = len(embeddings[0])
    init_db(dim)
    for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
        conn.execute(
            "INSERT INTO chunks (id, embedding, text) VALUES (?, ?, ?)",
            (i, serialize_float32(emb), chunk),
        )
    conn.commit()
 def tokenize(text):
    # TODO: put this in a config file or something
    stop_words = {
        "the",
        "is",
        "a",
        "an",
        "who",
        "what",
        "when",
        "where",
        "why",
        "how",
        "and",
        "or",
        "to",
        "of",
        "in",
        "on",
        "for",
        "with",
        "as",
        "by",
    }
    words = set(re.findall(r"\b\w+\b", text.lower()))
    return {w for w in words if w not in stop_words}
 def keyword_score(query, text):
    q = tokenize(query)
    t = tokenize(text)
    if not q:
        return 0
    overlap = q & t
    score = len(overlap) / len(q)
    if query.lower() in text.lower():
        score += 1.0
    return score
 def query(question, top_k=3, initial_k=10):
    model = get_embedding_model()
    query_embedding = normalize(
        model.create_embedding([question])["data"][0]["embedding"]
    )
    rows = conn.execute(
        """
        SELECT id, text FROM chunks WHERE embedding MATCH ? AND k = ?
        """,
        (serialize_float32(query_embedding), initial_k),  # type: ignore
    ).fetchall()
    scored = []
    for cid, text in rows:
        score = keyword_score(question, text)
        scored.append((cid, text, score))
    scored.sort(key=lambda x: x[2], reverse=True)
    if DEBUG:
        print("\n--- RETRIEVAL DEBUG ---")
        for cid, text, s in scored[:5]:
            print(f"[{cid}] score={s:.2f} | {text[:120]}\n")
    return [(cid, text) for cid, text, _ in scored[:top_k]]
 def ask_llm(context_chunks, question):
    context = "\n\n".join(f"[{cid}] {text}" for cid, text in context_chunks)
    prompt = f"""You are a precise assistant.
    Use ONLY the provided context to answer.
    Cite sources at the end of your sentences using bracket IDs.
    If unsure , say "I don't know based on the provided context."
    Context:
    {context}
    Question:
    {question}
    Answer:"""
    stream = llm(
        prompt, max_tokens=200, stop=["</s>", "<|end|>", "Question:"], stream=True
    )
    print("\nANSWER:\n")
    for chunk in stream:
        token = chunk["choices"][0]["text"]  # type: ignore
        print(token, end="", flush=True)
    print()
 def main():
    print("Loading DB...")
    exists = conn.execute(
        "SELECT name FROM sqlite_master WHERE type='table' AND name='chunks'"
    ).fetchone()
    count = 0
    if exists:
        count = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
    if count == 0:
        print("No data found. Ingesting PDF...")
        text = load_pdf(f"./documents/{PDF_DOCUMENT}")
        chunks = chunk_text(text)
        embeddings = embed_chunks(chunks)
        store_embeddings(chunks, embeddings)
    print("\nRAG is ready. Ask questions (type 'exit' to quit)")
    while True:
        print()
        question = input("Question: ").strip()
        if question.lower() in ["exit", "quit"]:
            break
        results = query(question)
        ask_llm(results, question)
 if __name__ == "__main__":
    main()
    print("Goodbye!")
    sys.exit()
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,83 @@
 annotated-doc==0.0.4
 annotated-types==0.7.0
 anyio==4.13.0
 attrs==26.1.0
 bcrypt==5.0.0
 build==1.5.0
 certifi==2026.4.22
 charset-normalizer==3.4.7
 click==8.3.3
 diskcache==5.6.3
 durationpy==0.10
 fastapi==0.136.1
 filelock==3.29.0
 flatbuffers==25.12.19
 fsspec==2026.4.0
 googleapis-common-protos==1.74.0
 grpcio==1.80.0
 h11==0.16.0
 hf-xet==1.4.3
 httpcore==1.0.9
 httptools==0.7.1
 httpx==0.28.1
 huggingface_hub==1.13.0
 idna==3.13
 importlib_metadata==8.7.1
 importlib_resources==7.1.0
 Jinja2==3.1.6
 jsonschema==4.26.0
 jsonschema-specifications==2025.9.1
 kubernetes==35.0.0
 llama_cpp_python==0.3.21
 markdown-it-py==4.0.0
 MarkupSafe==3.0.3
 mdurl==0.1.2
 mmh3==5.2.1
 numpy==2.4.4
 oauthlib==3.3.1
 onnxruntime==1.25.1
 opentelemetry-api==1.41.1
 opentelemetry-exporter-otlp-proto-common==1.41.1
 opentelemetry-exporter-otlp-proto-grpc==1.41.1
 opentelemetry-proto==1.41.1
 opentelemetry-sdk==1.41.1
 opentelemetry-semantic-conventions==0.62b1
 orjson==3.11.8
 overrides==7.7.0
 packaging==26.2
 protobuf==6.33.6
 pybase64==1.4.3
 pydantic==2.13.3
 pydantic-settings==2.14.0
 pydantic_core==2.46.3
 Pygments==2.20.0
 pypdf==6.10.2
 PyPika==0.51.1
 pyproject_hooks==1.2.0
 python-dateutil==2.9.0.post0
 python-dotenv==1.2.2
 PyYAML==6.0.3
 referencing==0.37.0
 requests==2.33.1
 requests-oauthlib==2.0.0
 rich==15.0.0
 rpds-py==0.30.0
 shellingham==1.5.4
 six==1.17.0
 sqlite-vec==0.1.9
 sse-starlette==3.4.1
 starlette==1.0.0
 starlette-context==0.3.6
 tenacity==9.1.4
 tokenizers==0.23.1
 tqdm==4.67.3
 typer==0.25.1
 typing-inspection==0.4.2
 typing_extensions==4.15.0
 urllib3==2.6.3
 uvicorn==0.46.0
 uvloop==0.22.1
 watchfiles==1.1.1
 websocket-client==1.9.0
 websockets==16.0
 zipp==3.23.1