🎉 Initial commit!

This commit is contained in:
tomit4 2026-05-02 10:22:56 -07:00
commit f12cc7aaf8
6 changed files with 541 additions and 0 deletions

7
.gitignore vendored Normal file
View file

@ -0,0 +1,7 @@
.env*
.editorconfig
*.vim
.venv/
models/
documents/
*.sqlite

32
LICENSE Normal file
View file

@ -0,0 +1,32 @@
The Clear BSD License
Copyright (c) 2026 Brian Hayes
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted (subject to the limitations in the disclaimer
below) provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from this
software without specific prior written permission.
NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

130
README.md Normal file
View file

@ -0,0 +1,130 @@
# Minimal RAG PDF Reader
## Introduction
This repository contains a begginer implemention of a very basic Retrieval
Augmented Generated (RAG) LLM for PDFs. It is meant as a simple exercise with
RAGs, but also demonstrates my attempts at creating a minimal RAG implementation
that runs locally without API usage during execution.
## Setup
This document is mainly meant for personal use, and thusly there will not be
extensive explanation or instruction for how to setup this repository. Those
familiary with git and python should be well versed in these procedures.
**Cloning the repo:**
```sh
git clone <this_url> && \
cd minimal_rag_pdf
```
**Starting the environment:**
```sh
python -m venv .venv && \
source .venv/bin/activate
```
**Upgrading pip**
```sh
python -m pip install --upgrade pip
```
**CUDA GCC Version mismatch solve:**
There is potentially a mismatch when setting up `llama-cpp-python` on different
systems. Please refer to their
[documentation](https://llama-cpp-python.readthedocs.io/en/latest/).
The following environment variable declarations and subsequent installation with
proper flags is what got it working on my personal machine. Note that depending
on your system the compile time can take a while:
```sh
CC=gcc-14 CXX=g++-14 \
CUDACXX=/opt/cuda/bin/nvcc \
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14" \
python -m pip install llama-cpp-python --no-cache-dir --force-reinstall
```
**Installing requirements:**
```sh
python -m pip install -r requirements.txt
```
**Environment variables:**
```sh
cp env.sample .env
```
Note that if you don't use the exact same LLM model, embedding model, and PDF
that I used, this application will not work without you changing the environment
variable names.
## Downloading the models
You can use more powerful models than the ones I used if you so choose, but if
you want to just run what I tried, you can find the instructions here. Please
note that I have a very low end GPU and low end CPU, so I could only use very
low parameter LLMs.
Head over to [HuggingFace](https://huggingface.co/) and download the
[bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF](https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf)
LLM model, and the
[CompendiumLabs/bge-small-en-v1.5-q8_0](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/blob/main/bge-small-en-v1.5-q8_0.gguf)
embedding model.
Note that these models should be placed in the `models` folder. If it doesn't
exist, go ahead and make it:
```sh
mkdir models
```
And if you used a different LLM model and/or embedding model, make sure to
change the name(s) in the `.env` file.
## Finding PDFs
I made this script just as a novelty, and currently it only reads a single PDF
as data for the RAG. If you want to replicate what I did exactly, I ended up
feeding the RAG the Linux Essentials Study Guide from
[LPI](https://learning.lpi.org/en/learning-materials/010-160/). Any PDF that you
do want to use should be placed in the `documents` folder. Again, if it doesn't
exist, go ahead and make it:
```sh
mkdir documents
```
And if you used a different PDF document, make sure to change the name in the
`.env` file.
## Running the application
```sh
python main.py
```
The first time running the application, it will populate the sqlite DB with the
vectorized embeddings, so just let it do its thing. After that initial
populating of the database, it should run much faster (especially with GPU
acceleration).
## Notes/Disclaimer
It's worth noting this is a very very basic RAG application. It uses
[sqlite-vec](https://www.sqlite.ai/sqlite-vector) instead of ChromaDB just as an
exploration into alternatives. It doesn't utilize LangChain or LlamaIndex or
bring in a bunch of APIs. It does utilize [LLama CPP](https://llama-cpp.com/)
via [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) to
bring in the LLM and embedding models, as well as
[pypdf](https://pypdf.readthedocs.io/en/stable/) to read the PDF file.
This project is not meant to be utilized in any commercial way, but is purely
educational in purpose.

3
env.sample Normal file
View file

@ -0,0 +1,3 @@
LLM_MODEL="Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf"
EMBEDDING_MODEL="bge-small-en-v1.5-q8_0.gguf"
PDF_DOCUMENT="LPI-Learning-Material-010-160-en.pdf"

286
main.py Normal file
View file

@ -0,0 +1,286 @@
import os
import re
import sqlite3
import sys
import numpy as np
import sqlite_vec
from dotenv import load_dotenv
from llama_cpp import Llama
from pypdf import PdfReader
from sqlite_vec import serialize_float32
load_dotenv()
DEBUG = False
LLM_MODEL = os.getenv("LLM_MODEL")
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
PDF_DOCUMENT = os.getenv("PDF_DOCUMENT")
os.makedirs("./models", exist_ok=True)
os.makedirs("./documents", exist_ok=True)
conn = sqlite3.connect("./vector_db.sqlite")
conn.enable_load_extension(True)
sqlite_vec.load(conn)
llm = Llama(
model_path=f"./models/{LLM_MODEL}",
n_gpu_layers=-1, # Uncomment to use GPU acceleration
n_ctx=6096, # Uncomment to increase the context window
verbose=False,
log_level="error",
# seed = 1337, # Uncommment to set a specific seed
# temperature=0.2,
# repeat_penalty=1.15,
# top_p=0.9,
# top_k=40,
)
_embedding_model = None
def get_embedding_model():
global _embedding_model
if _embedding_model is None:
print("Loading embedded model...")
_embedding_model = Llama(
model_path=f"./models/{EMBEDDING_MODEL}",
embedding=True,
verbose=False,
log_level="error",
)
return _embedding_model
def init_db(dim: int):
conn.execute("PRAGMA journal_mode=WAL;")
conn.execute(f"""
CREATE VIRTUAL TABLE IF NOT EXISTS chunks USING vec0(
id INTEGER,
embedding float[{dim}],
text TEXT
);
""")
def load_pdf(path):
reader = PdfReader(path)
text = ""
empty_pages = 0
for page in reader.pages:
page_text = page.extract_text()
if not page_text:
empty_pages += 1
continue
text += page_text + "\n"
print(f"Empty pages: {empty_pages}/{len(reader.pages)}")
print(f"Total extracted chars: {len(text)}")
return text
def chunk_text(text, max_chars=1200):
paragraphs = text.split("\n")
chunks = []
current = ""
for p in paragraphs:
if len(current) + len(p) < max_chars:
current += p + "\n"
else:
chunks.append(current.strip())
current = p + "\n"
if current:
chunks.append(current.strip())
return chunks
def normalize(vec):
v = np.array(vec, dtype=np.float32)
return (v / np.linalg.norm(v)).tolist()
def embed_chunks(chunks, batch_size=1):
all_embeddings = []
model = get_embedding_model()
for i in range(0, len(chunks), batch_size):
batch = chunks[i : i + batch_size]
result = model.create_embedding(batch)
batch_embeddings = [normalize(e["embedding"]) for e in result["data"]]
all_embeddings.extend(batch_embeddings)
print(f"Embedded {i + len(batch)}/{len(chunks)}")
return all_embeddings
def store_embeddings(chunks, embeddings):
dim = len(embeddings[0])
init_db(dim)
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
conn.execute(
"INSERT INTO chunks (id, embedding, text) VALUES (?, ?, ?)",
(i, serialize_float32(emb), chunk),
)
conn.commit()
def tokenize(text):
# TODO: put this in a config file or something
stop_words = {
"the",
"is",
"a",
"an",
"who",
"what",
"when",
"where",
"why",
"how",
"and",
"or",
"to",
"of",
"in",
"on",
"for",
"with",
"as",
"by",
}
words = set(re.findall(r"\b\w+\b", text.lower()))
return {w for w in words if w not in stop_words}
def keyword_score(query, text):
q = tokenize(query)
t = tokenize(text)
if not q:
return 0
overlap = q & t
score = len(overlap) / len(q)
if query.lower() in text.lower():
score += 1.0
return score
def query(question, top_k=3, initial_k=10):
model = get_embedding_model()
query_embedding = normalize(
model.create_embedding([question])["data"][0]["embedding"]
)
rows = conn.execute(
"""
SELECT id, text FROM chunks WHERE embedding MATCH ? AND k = ?
""",
(serialize_float32(query_embedding), initial_k), # type: ignore
).fetchall()
scored = []
for cid, text in rows:
score = keyword_score(question, text)
scored.append((cid, text, score))
scored.sort(key=lambda x: x[2], reverse=True)
if DEBUG:
print("\n--- RETRIEVAL DEBUG ---")
for cid, text, s in scored[:5]:
print(f"[{cid}] score={s:.2f} | {text[:120]}\n")
return [(cid, text) for cid, text, _ in scored[:top_k]]
def ask_llm(context_chunks, question):
context = "\n\n".join(f"[{cid}] {text}" for cid, text in context_chunks)
prompt = f"""You are a precise assistant.
Use ONLY the provided context to answer.
Cite sources at the end of your sentences using bracket IDs.
If unsure , say "I don't know based on the provided context."
Context:
{context}
Question:
{question}
Answer:"""
stream = llm(
prompt, max_tokens=200, stop=["</s>", "<|end|>", "Question:"], stream=True
)
print("\nANSWER:\n")
for chunk in stream:
token = chunk["choices"][0]["text"] # type: ignore
print(token, end="", flush=True)
print()
def main():
print("Loading DB...")
exists = conn.execute(
"SELECT name FROM sqlite_master WHERE type='table' AND name='chunks'"
).fetchone()
count = 0
if exists:
count = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
if count == 0:
print("No data found. Ingesting PDF...")
text = load_pdf(f"./documents/{PDF_DOCUMENT}")
chunks = chunk_text(text)
embeddings = embed_chunks(chunks)
store_embeddings(chunks, embeddings)
print("\nRAG is ready. Ask questions (type 'exit' to quit)")
while True:
print()
question = input("Question: ").strip()
if question.lower() in ["exit", "quit"]:
break
results = query(question)
ask_llm(results, question)
if __name__ == "__main__":
main()
print("Goodbye!")
sys.exit()

83
requirements.txt Normal file
View file

@ -0,0 +1,83 @@
annotated-doc==0.0.4
annotated-types==0.7.0
anyio==4.13.0
attrs==26.1.0
bcrypt==5.0.0
build==1.5.0
certifi==2026.4.22
charset-normalizer==3.4.7
click==8.3.3
diskcache==5.6.3
durationpy==0.10
fastapi==0.136.1
filelock==3.29.0
flatbuffers==25.12.19
fsspec==2026.4.0
googleapis-common-protos==1.74.0
grpcio==1.80.0
h11==0.16.0
hf-xet==1.4.3
httpcore==1.0.9
httptools==0.7.1
httpx==0.28.1
huggingface_hub==1.13.0
idna==3.13
importlib_metadata==8.7.1
importlib_resources==7.1.0
Jinja2==3.1.6
jsonschema==4.26.0
jsonschema-specifications==2025.9.1
kubernetes==35.0.0
llama_cpp_python==0.3.21
markdown-it-py==4.0.0
MarkupSafe==3.0.3
mdurl==0.1.2
mmh3==5.2.1
numpy==2.4.4
oauthlib==3.3.1
onnxruntime==1.25.1
opentelemetry-api==1.41.1
opentelemetry-exporter-otlp-proto-common==1.41.1
opentelemetry-exporter-otlp-proto-grpc==1.41.1
opentelemetry-proto==1.41.1
opentelemetry-sdk==1.41.1
opentelemetry-semantic-conventions==0.62b1
orjson==3.11.8
overrides==7.7.0
packaging==26.2
protobuf==6.33.6
pybase64==1.4.3
pydantic==2.13.3
pydantic-settings==2.14.0
pydantic_core==2.46.3
Pygments==2.20.0
pypdf==6.10.2
PyPika==0.51.1
pyproject_hooks==1.2.0
python-dateutil==2.9.0.post0
python-dotenv==1.2.2
PyYAML==6.0.3
referencing==0.37.0
requests==2.33.1
requests-oauthlib==2.0.0
rich==15.0.0
rpds-py==0.30.0
shellingham==1.5.4
six==1.17.0
sqlite-vec==0.1.9
sse-starlette==3.4.1
starlette==1.0.0
starlette-context==0.3.6
tenacity==9.1.4
tokenizers==0.23.1
tqdm==4.67.3
typer==0.25.1
typing-inspection==0.4.2
typing_extensions==4.15.0
urllib3==2.6.3
uvicorn==0.46.0
uvloop==0.22.1
watchfiles==1.1.1
websocket-client==1.9.0
websockets==16.0
zipp==3.23.1