commit f12cc7aaf8bf396598165bd88acea75cbfcc1674 Author: tomit4 Date: Sat May 2 10:22:56 2026 -0700 :tada: Initial commit! diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..a2b9bda --- /dev/null +++ b/.gitignore @@ -0,0 +1,7 @@ +.env* +.editorconfig +*.vim +.venv/ +models/ +documents/ +*.sqlite diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..2a610c6 --- /dev/null +++ b/LICENSE @@ -0,0 +1,32 @@ +The Clear BSD License + +Copyright (c) 2026 Brian Hayes +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted (subject to the limitations in the disclaimer +below) provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, + this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + * Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from this + software without specific prior written permission. + +NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY +THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND +CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR +CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, +PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR +BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER +IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) +ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..4ed3b43 --- /dev/null +++ b/README.md @@ -0,0 +1,130 @@ +# Minimal RAG PDF Reader + +## Introduction + +This repository contains a begginer implemention of a very basic Retrieval +Augmented Generated (RAG) LLM for PDFs. It is meant as a simple exercise with +RAGs, but also demonstrates my attempts at creating a minimal RAG implementation +that runs locally without API usage during execution. + +## Setup + +This document is mainly meant for personal use, and thusly there will not be +extensive explanation or instruction for how to setup this repository. Those +familiary with git and python should be well versed in these procedures. + +**Cloning the repo:** + +```sh +git clone && \ +cd minimal_rag_pdf +``` + +**Starting the environment:** + +```sh +python -m venv .venv && \ +source .venv/bin/activate +``` + +**Upgrading pip** + +```sh +python -m pip install --upgrade pip +``` + +**CUDA GCC Version mismatch solve:** + +There is potentially a mismatch when setting up `llama-cpp-python` on different +systems. Please refer to their +[documentation](https://llama-cpp-python.readthedocs.io/en/latest/). + +The following environment variable declarations and subsequent installation with +proper flags is what got it working on my personal machine. Note that depending +on your system the compile time can take a while: + +```sh +CC=gcc-14 CXX=g++-14 \ +CUDACXX=/opt/cuda/bin/nvcc \ +CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14" \ +python -m pip install llama-cpp-python --no-cache-dir --force-reinstall +``` + +**Installing requirements:** + +```sh +python -m pip install -r requirements.txt +``` + +**Environment variables:** + +```sh +cp env.sample .env +``` + +Note that if you don't use the exact same LLM model, embedding model, and PDF +that I used, this application will not work without you changing the environment +variable names. + +## Downloading the models + +You can use more powerful models than the ones I used if you so choose, but if +you want to just run what I tried, you can find the instructions here. Please +note that I have a very low end GPU and low end CPU, so I could only use very +low parameter LLMs. + +Head over to [HuggingFace](https://huggingface.co/) and download the +[bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF](https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf) +LLM model, and the +[CompendiumLabs/bge-small-en-v1.5-q8_0](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/blob/main/bge-small-en-v1.5-q8_0.gguf) +embedding model. + +Note that these models should be placed in the `models` folder. If it doesn't +exist, go ahead and make it: + +```sh +mkdir models +``` + +And if you used a different LLM model and/or embedding model, make sure to +change the name(s) in the `.env` file. + +## Finding PDFs + +I made this script just as a novelty, and currently it only reads a single PDF +as data for the RAG. If you want to replicate what I did exactly, I ended up +feeding the RAG the Linux Essentials Study Guide from +[LPI](https://learning.lpi.org/en/learning-materials/010-160/). Any PDF that you +do want to use should be placed in the `documents` folder. Again, if it doesn't +exist, go ahead and make it: + +```sh +mkdir documents +``` + +And if you used a different PDF document, make sure to change the name in the +`.env` file. + +## Running the application + +```sh +python main.py +``` + +The first time running the application, it will populate the sqlite DB with the +vectorized embeddings, so just let it do its thing. After that initial +populating of the database, it should run much faster (especially with GPU +acceleration). + +## Notes/Disclaimer + +It's worth noting this is a very very basic RAG application. It uses +[sqlite-vec](https://www.sqlite.ai/sqlite-vector) instead of ChromaDB just as an +exploration into alternatives. It doesn't utilize LangChain or LlamaIndex or +bring in a bunch of APIs. It does utilize [LLama CPP](https://llama-cpp.com/) +via [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) to +bring in the LLM and embedding models, as well as +[pypdf](https://pypdf.readthedocs.io/en/stable/) to read the PDF file. + +This project is not meant to be utilized in any commercial way, but is purely +educational in purpose. diff --git a/env.sample b/env.sample new file mode 100644 index 0000000..3174978 --- /dev/null +++ b/env.sample @@ -0,0 +1,3 @@ +LLM_MODEL="Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf" +EMBEDDING_MODEL="bge-small-en-v1.5-q8_0.gguf" +PDF_DOCUMENT="LPI-Learning-Material-010-160-en.pdf" diff --git a/main.py b/main.py new file mode 100644 index 0000000..a4d4a5d --- /dev/null +++ b/main.py @@ -0,0 +1,286 @@ +import os +import re +import sqlite3 +import sys + +import numpy as np +import sqlite_vec +from dotenv import load_dotenv +from llama_cpp import Llama +from pypdf import PdfReader +from sqlite_vec import serialize_float32 + +load_dotenv() + +DEBUG = False +LLM_MODEL = os.getenv("LLM_MODEL") +EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL") +PDF_DOCUMENT = os.getenv("PDF_DOCUMENT") + +os.makedirs("./models", exist_ok=True) +os.makedirs("./documents", exist_ok=True) + +conn = sqlite3.connect("./vector_db.sqlite") +conn.enable_load_extension(True) +sqlite_vec.load(conn) + + +llm = Llama( + model_path=f"./models/{LLM_MODEL}", + n_gpu_layers=-1, # Uncomment to use GPU acceleration + n_ctx=6096, # Uncomment to increase the context window + verbose=False, + log_level="error", + # seed = 1337, # Uncommment to set a specific seed + # temperature=0.2, + # repeat_penalty=1.15, + # top_p=0.9, + # top_k=40, +) + + +_embedding_model = None + + +def get_embedding_model(): + global _embedding_model + if _embedding_model is None: + print("Loading embedded model...") + _embedding_model = Llama( + model_path=f"./models/{EMBEDDING_MODEL}", + embedding=True, + verbose=False, + log_level="error", + ) + return _embedding_model + + +def init_db(dim: int): + conn.execute("PRAGMA journal_mode=WAL;") + + conn.execute(f""" + CREATE VIRTUAL TABLE IF NOT EXISTS chunks USING vec0( + id INTEGER, + embedding float[{dim}], + text TEXT + ); + """) + + +def load_pdf(path): + reader = PdfReader(path) + text = "" + empty_pages = 0 + + for page in reader.pages: + page_text = page.extract_text() + + if not page_text: + empty_pages += 1 + continue + + text += page_text + "\n" + + print(f"Empty pages: {empty_pages}/{len(reader.pages)}") + print(f"Total extracted chars: {len(text)}") + + return text + + +def chunk_text(text, max_chars=1200): + paragraphs = text.split("\n") + chunks = [] + current = "" + + for p in paragraphs: + if len(current) + len(p) < max_chars: + current += p + "\n" + else: + chunks.append(current.strip()) + current = p + "\n" + + if current: + chunks.append(current.strip()) + + return chunks + + +def normalize(vec): + v = np.array(vec, dtype=np.float32) + return (v / np.linalg.norm(v)).tolist() + + +def embed_chunks(chunks, batch_size=1): + all_embeddings = [] + + model = get_embedding_model() + + for i in range(0, len(chunks), batch_size): + batch = chunks[i : i + batch_size] + + result = model.create_embedding(batch) + batch_embeddings = [normalize(e["embedding"]) for e in result["data"]] + + all_embeddings.extend(batch_embeddings) + + print(f"Embedded {i + len(batch)}/{len(chunks)}") + + return all_embeddings + + +def store_embeddings(chunks, embeddings): + dim = len(embeddings[0]) + init_db(dim) + + for i, (chunk, emb) in enumerate(zip(chunks, embeddings)): + conn.execute( + "INSERT INTO chunks (id, embedding, text) VALUES (?, ?, ?)", + (i, serialize_float32(emb), chunk), + ) + + conn.commit() + + +def tokenize(text): + # TODO: put this in a config file or something + stop_words = { + "the", + "is", + "a", + "an", + "who", + "what", + "when", + "where", + "why", + "how", + "and", + "or", + "to", + "of", + "in", + "on", + "for", + "with", + "as", + "by", + } + words = set(re.findall(r"\b\w+\b", text.lower())) + return {w for w in words if w not in stop_words} + + +def keyword_score(query, text): + q = tokenize(query) + t = tokenize(text) + + if not q: + return 0 + + overlap = q & t + + score = len(overlap) / len(q) + + if query.lower() in text.lower(): + score += 1.0 + + return score + + +def query(question, top_k=3, initial_k=10): + model = get_embedding_model() + + query_embedding = normalize( + model.create_embedding([question])["data"][0]["embedding"] + ) + + rows = conn.execute( + """ + SELECT id, text FROM chunks WHERE embedding MATCH ? AND k = ? + """, + (serialize_float32(query_embedding), initial_k), # type: ignore + ).fetchall() + + scored = [] + for cid, text in rows: + score = keyword_score(question, text) + scored.append((cid, text, score)) + + scored.sort(key=lambda x: x[2], reverse=True) + + if DEBUG: + print("\n--- RETRIEVAL DEBUG ---") + for cid, text, s in scored[:5]: + print(f"[{cid}] score={s:.2f} | {text[:120]}\n") + + return [(cid, text) for cid, text, _ in scored[:top_k]] + + +def ask_llm(context_chunks, question): + context = "\n\n".join(f"[{cid}] {text}" for cid, text in context_chunks) + + prompt = f"""You are a precise assistant. + + Use ONLY the provided context to answer. + + Cite sources at the end of your sentences using bracket IDs. + + If unsure , say "I don't know based on the provided context." + + Context: + {context} + + Question: + {question} + + Answer:""" + + stream = llm( + prompt, max_tokens=200, stop=["", "<|end|>", "Question:"], stream=True + ) + + print("\nANSWER:\n") + + for chunk in stream: + token = chunk["choices"][0]["text"] # type: ignore + print(token, end="", flush=True) + + print() + + +def main(): + print("Loading DB...") + + exists = conn.execute( + "SELECT name FROM sqlite_master WHERE type='table' AND name='chunks'" + ).fetchone() + + count = 0 + if exists: + count = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0] + + if count == 0: + print("No data found. Ingesting PDF...") + + text = load_pdf(f"./documents/{PDF_DOCUMENT}") + chunks = chunk_text(text) + embeddings = embed_chunks(chunks) + store_embeddings(chunks, embeddings) + + print("\nRAG is ready. Ask questions (type 'exit' to quit)") + + while True: + print() + + question = input("Question: ").strip() + + if question.lower() in ["exit", "quit"]: + break + + results = query(question) + ask_llm(results, question) + + +if __name__ == "__main__": + main() + print("Goodbye!") + sys.exit() diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..8c0bff5 --- /dev/null +++ b/requirements.txt @@ -0,0 +1,83 @@ +annotated-doc==0.0.4 +annotated-types==0.7.0 +anyio==4.13.0 +attrs==26.1.0 +bcrypt==5.0.0 +build==1.5.0 +certifi==2026.4.22 +charset-normalizer==3.4.7 +click==8.3.3 +diskcache==5.6.3 +durationpy==0.10 +fastapi==0.136.1 +filelock==3.29.0 +flatbuffers==25.12.19 +fsspec==2026.4.0 +googleapis-common-protos==1.74.0 +grpcio==1.80.0 +h11==0.16.0 +hf-xet==1.4.3 +httpcore==1.0.9 +httptools==0.7.1 +httpx==0.28.1 +huggingface_hub==1.13.0 +idna==3.13 +importlib_metadata==8.7.1 +importlib_resources==7.1.0 +Jinja2==3.1.6 +jsonschema==4.26.0 +jsonschema-specifications==2025.9.1 +kubernetes==35.0.0 +llama_cpp_python==0.3.21 +markdown-it-py==4.0.0 +MarkupSafe==3.0.3 +mdurl==0.1.2 +mmh3==5.2.1 +numpy==2.4.4 +oauthlib==3.3.1 +onnxruntime==1.25.1 +opentelemetry-api==1.41.1 +opentelemetry-exporter-otlp-proto-common==1.41.1 +opentelemetry-exporter-otlp-proto-grpc==1.41.1 +opentelemetry-proto==1.41.1 +opentelemetry-sdk==1.41.1 +opentelemetry-semantic-conventions==0.62b1 +orjson==3.11.8 +overrides==7.7.0 +packaging==26.2 +protobuf==6.33.6 +pybase64==1.4.3 +pydantic==2.13.3 +pydantic-settings==2.14.0 +pydantic_core==2.46.3 +Pygments==2.20.0 +pypdf==6.10.2 +PyPika==0.51.1 +pyproject_hooks==1.2.0 +python-dateutil==2.9.0.post0 +python-dotenv==1.2.2 +PyYAML==6.0.3 +referencing==0.37.0 +requests==2.33.1 +requests-oauthlib==2.0.0 +rich==15.0.0 +rpds-py==0.30.0 +shellingham==1.5.4 +six==1.17.0 +sqlite-vec==0.1.9 +sse-starlette==3.4.1 +starlette==1.0.0 +starlette-context==0.3.6 +tenacity==9.1.4 +tokenizers==0.23.1 +tqdm==4.67.3 +typer==0.25.1 +typing-inspection==0.4.2 +typing_extensions==4.15.0 +urllib3==2.6.3 +uvicorn==0.46.0 +uvloop==0.22.1 +watchfiles==1.1.1 +websocket-client==1.9.0 +websockets==16.0 +zipp==3.23.1