🎉 Initial commit!
This commit is contained in:
commit
f12cc7aaf8
6 changed files with 541 additions and 0 deletions
7
.gitignore
vendored
Normal file
7
.gitignore
vendored
Normal file
|
|
@ -0,0 +1,7 @@
|
|||
.env*
|
||||
.editorconfig
|
||||
*.vim
|
||||
.venv/
|
||||
models/
|
||||
documents/
|
||||
*.sqlite
|
||||
32
LICENSE
Normal file
32
LICENSE
Normal file
|
|
@ -0,0 +1,32 @@
|
|||
The Clear BSD License
|
||||
|
||||
Copyright (c) 2026 Brian Hayes
|
||||
All rights reserved.
|
||||
|
||||
Redistribution and use in source and binary forms, with or without
|
||||
modification, are permitted (subject to the limitations in the disclaimer
|
||||
below) provided that the following conditions are met:
|
||||
|
||||
* Redistributions of source code must retain the above copyright notice,
|
||||
this list of conditions and the following disclaimer.
|
||||
|
||||
* Redistributions in binary form must reproduce the above copyright
|
||||
notice, this list of conditions and the following disclaimer in the
|
||||
documentation and/or other materials provided with the distribution.
|
||||
|
||||
* Neither the name of the copyright holder nor the names of its
|
||||
contributors may be used to endorse or promote products derived from this
|
||||
software without specific prior written permission.
|
||||
|
||||
NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY
|
||||
THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
|
||||
CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
||||
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
|
||||
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
|
||||
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
|
||||
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
|
||||
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
|
||||
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
|
||||
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
|
||||
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
||||
POSSIBILITY OF SUCH DAMAGE.
|
||||
130
README.md
Normal file
130
README.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
# Minimal RAG PDF Reader
|
||||
|
||||
## Introduction
|
||||
|
||||
This repository contains a begginer implemention of a very basic Retrieval
|
||||
Augmented Generated (RAG) LLM for PDFs. It is meant as a simple exercise with
|
||||
RAGs, but also demonstrates my attempts at creating a minimal RAG implementation
|
||||
that runs locally without API usage during execution.
|
||||
|
||||
## Setup
|
||||
|
||||
This document is mainly meant for personal use, and thusly there will not be
|
||||
extensive explanation or instruction for how to setup this repository. Those
|
||||
familiary with git and python should be well versed in these procedures.
|
||||
|
||||
**Cloning the repo:**
|
||||
|
||||
```sh
|
||||
git clone <this_url> && \
|
||||
cd minimal_rag_pdf
|
||||
```
|
||||
|
||||
**Starting the environment:**
|
||||
|
||||
```sh
|
||||
python -m venv .venv && \
|
||||
source .venv/bin/activate
|
||||
```
|
||||
|
||||
**Upgrading pip**
|
||||
|
||||
```sh
|
||||
python -m pip install --upgrade pip
|
||||
```
|
||||
|
||||
**CUDA GCC Version mismatch solve:**
|
||||
|
||||
There is potentially a mismatch when setting up `llama-cpp-python` on different
|
||||
systems. Please refer to their
|
||||
[documentation](https://llama-cpp-python.readthedocs.io/en/latest/).
|
||||
|
||||
The following environment variable declarations and subsequent installation with
|
||||
proper flags is what got it working on my personal machine. Note that depending
|
||||
on your system the compile time can take a while:
|
||||
|
||||
```sh
|
||||
CC=gcc-14 CXX=g++-14 \
|
||||
CUDACXX=/opt/cuda/bin/nvcc \
|
||||
CMAKE_ARGS="-DGGML_CUDA=on -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/gcc-14" \
|
||||
python -m pip install llama-cpp-python --no-cache-dir --force-reinstall
|
||||
```
|
||||
|
||||
**Installing requirements:**
|
||||
|
||||
```sh
|
||||
python -m pip install -r requirements.txt
|
||||
```
|
||||
|
||||
**Environment variables:**
|
||||
|
||||
```sh
|
||||
cp env.sample .env
|
||||
```
|
||||
|
||||
Note that if you don't use the exact same LLM model, embedding model, and PDF
|
||||
that I used, this application will not work without you changing the environment
|
||||
variable names.
|
||||
|
||||
## Downloading the models
|
||||
|
||||
You can use more powerful models than the ones I used if you so choose, but if
|
||||
you want to just run what I tried, you can find the instructions here. Please
|
||||
note that I have a very low end GPU and low end CPU, so I could only use very
|
||||
low parameter LLMs.
|
||||
|
||||
Head over to [HuggingFace](https://huggingface.co/) and download the
|
||||
[bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF](https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-abliterated-GGUF/blob/main/Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf)
|
||||
LLM model, and the
|
||||
[CompendiumLabs/bge-small-en-v1.5-q8_0](https://huggingface.co/CompendiumLabs/bge-small-en-v1.5-gguf/blob/main/bge-small-en-v1.5-q8_0.gguf)
|
||||
embedding model.
|
||||
|
||||
Note that these models should be placed in the `models` folder. If it doesn't
|
||||
exist, go ahead and make it:
|
||||
|
||||
```sh
|
||||
mkdir models
|
||||
```
|
||||
|
||||
And if you used a different LLM model and/or embedding model, make sure to
|
||||
change the name(s) in the `.env` file.
|
||||
|
||||
## Finding PDFs
|
||||
|
||||
I made this script just as a novelty, and currently it only reads a single PDF
|
||||
as data for the RAG. If you want to replicate what I did exactly, I ended up
|
||||
feeding the RAG the Linux Essentials Study Guide from
|
||||
[LPI](https://learning.lpi.org/en/learning-materials/010-160/). Any PDF that you
|
||||
do want to use should be placed in the `documents` folder. Again, if it doesn't
|
||||
exist, go ahead and make it:
|
||||
|
||||
```sh
|
||||
mkdir documents
|
||||
```
|
||||
|
||||
And if you used a different PDF document, make sure to change the name in the
|
||||
`.env` file.
|
||||
|
||||
## Running the application
|
||||
|
||||
```sh
|
||||
python main.py
|
||||
```
|
||||
|
||||
The first time running the application, it will populate the sqlite DB with the
|
||||
vectorized embeddings, so just let it do its thing. After that initial
|
||||
populating of the database, it should run much faster (especially with GPU
|
||||
acceleration).
|
||||
|
||||
## Notes/Disclaimer
|
||||
|
||||
It's worth noting this is a very very basic RAG application. It uses
|
||||
[sqlite-vec](https://www.sqlite.ai/sqlite-vector) instead of ChromaDB just as an
|
||||
exploration into alternatives. It doesn't utilize LangChain or LlamaIndex or
|
||||
bring in a bunch of APIs. It does utilize [LLama CPP](https://llama-cpp.com/)
|
||||
via [llama-cpp-python](https://llama-cpp-python.readthedocs.io/en/latest/) to
|
||||
bring in the LLM and embedding models, as well as
|
||||
[pypdf](https://pypdf.readthedocs.io/en/stable/) to read the PDF file.
|
||||
|
||||
This project is not meant to be utilized in any commercial way, but is purely
|
||||
educational in purpose.
|
||||
3
env.sample
Normal file
3
env.sample
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
LLM_MODEL="Qwen2.5-Coder-7B-Instruct-abliterated-Q4_K_L.gguf"
|
||||
EMBEDDING_MODEL="bge-small-en-v1.5-q8_0.gguf"
|
||||
PDF_DOCUMENT="LPI-Learning-Material-010-160-en.pdf"
|
||||
286
main.py
Normal file
286
main.py
Normal file
|
|
@ -0,0 +1,286 @@
|
|||
import os
|
||||
import re
|
||||
import sqlite3
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
import sqlite_vec
|
||||
from dotenv import load_dotenv
|
||||
from llama_cpp import Llama
|
||||
from pypdf import PdfReader
|
||||
from sqlite_vec import serialize_float32
|
||||
|
||||
load_dotenv()
|
||||
|
||||
DEBUG = False
|
||||
LLM_MODEL = os.getenv("LLM_MODEL")
|
||||
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL")
|
||||
PDF_DOCUMENT = os.getenv("PDF_DOCUMENT")
|
||||
|
||||
os.makedirs("./models", exist_ok=True)
|
||||
os.makedirs("./documents", exist_ok=True)
|
||||
|
||||
conn = sqlite3.connect("./vector_db.sqlite")
|
||||
conn.enable_load_extension(True)
|
||||
sqlite_vec.load(conn)
|
||||
|
||||
|
||||
llm = Llama(
|
||||
model_path=f"./models/{LLM_MODEL}",
|
||||
n_gpu_layers=-1, # Uncomment to use GPU acceleration
|
||||
n_ctx=6096, # Uncomment to increase the context window
|
||||
verbose=False,
|
||||
log_level="error",
|
||||
# seed = 1337, # Uncommment to set a specific seed
|
||||
# temperature=0.2,
|
||||
# repeat_penalty=1.15,
|
||||
# top_p=0.9,
|
||||
# top_k=40,
|
||||
)
|
||||
|
||||
|
||||
_embedding_model = None
|
||||
|
||||
|
||||
def get_embedding_model():
|
||||
global _embedding_model
|
||||
if _embedding_model is None:
|
||||
print("Loading embedded model...")
|
||||
_embedding_model = Llama(
|
||||
model_path=f"./models/{EMBEDDING_MODEL}",
|
||||
embedding=True,
|
||||
verbose=False,
|
||||
log_level="error",
|
||||
)
|
||||
return _embedding_model
|
||||
|
||||
|
||||
def init_db(dim: int):
|
||||
conn.execute("PRAGMA journal_mode=WAL;")
|
||||
|
||||
conn.execute(f"""
|
||||
CREATE VIRTUAL TABLE IF NOT EXISTS chunks USING vec0(
|
||||
id INTEGER,
|
||||
embedding float[{dim}],
|
||||
text TEXT
|
||||
);
|
||||
""")
|
||||
|
||||
|
||||
def load_pdf(path):
|
||||
reader = PdfReader(path)
|
||||
text = ""
|
||||
empty_pages = 0
|
||||
|
||||
for page in reader.pages:
|
||||
page_text = page.extract_text()
|
||||
|
||||
if not page_text:
|
||||
empty_pages += 1
|
||||
continue
|
||||
|
||||
text += page_text + "\n"
|
||||
|
||||
print(f"Empty pages: {empty_pages}/{len(reader.pages)}")
|
||||
print(f"Total extracted chars: {len(text)}")
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def chunk_text(text, max_chars=1200):
|
||||
paragraphs = text.split("\n")
|
||||
chunks = []
|
||||
current = ""
|
||||
|
||||
for p in paragraphs:
|
||||
if len(current) + len(p) < max_chars:
|
||||
current += p + "\n"
|
||||
else:
|
||||
chunks.append(current.strip())
|
||||
current = p + "\n"
|
||||
|
||||
if current:
|
||||
chunks.append(current.strip())
|
||||
|
||||
return chunks
|
||||
|
||||
|
||||
def normalize(vec):
|
||||
v = np.array(vec, dtype=np.float32)
|
||||
return (v / np.linalg.norm(v)).tolist()
|
||||
|
||||
|
||||
def embed_chunks(chunks, batch_size=1):
|
||||
all_embeddings = []
|
||||
|
||||
model = get_embedding_model()
|
||||
|
||||
for i in range(0, len(chunks), batch_size):
|
||||
batch = chunks[i : i + batch_size]
|
||||
|
||||
result = model.create_embedding(batch)
|
||||
batch_embeddings = [normalize(e["embedding"]) for e in result["data"]]
|
||||
|
||||
all_embeddings.extend(batch_embeddings)
|
||||
|
||||
print(f"Embedded {i + len(batch)}/{len(chunks)}")
|
||||
|
||||
return all_embeddings
|
||||
|
||||
|
||||
def store_embeddings(chunks, embeddings):
|
||||
dim = len(embeddings[0])
|
||||
init_db(dim)
|
||||
|
||||
for i, (chunk, emb) in enumerate(zip(chunks, embeddings)):
|
||||
conn.execute(
|
||||
"INSERT INTO chunks (id, embedding, text) VALUES (?, ?, ?)",
|
||||
(i, serialize_float32(emb), chunk),
|
||||
)
|
||||
|
||||
conn.commit()
|
||||
|
||||
|
||||
def tokenize(text):
|
||||
# TODO: put this in a config file or something
|
||||
stop_words = {
|
||||
"the",
|
||||
"is",
|
||||
"a",
|
||||
"an",
|
||||
"who",
|
||||
"what",
|
||||
"when",
|
||||
"where",
|
||||
"why",
|
||||
"how",
|
||||
"and",
|
||||
"or",
|
||||
"to",
|
||||
"of",
|
||||
"in",
|
||||
"on",
|
||||
"for",
|
||||
"with",
|
||||
"as",
|
||||
"by",
|
||||
}
|
||||
words = set(re.findall(r"\b\w+\b", text.lower()))
|
||||
return {w for w in words if w not in stop_words}
|
||||
|
||||
|
||||
def keyword_score(query, text):
|
||||
q = tokenize(query)
|
||||
t = tokenize(text)
|
||||
|
||||
if not q:
|
||||
return 0
|
||||
|
||||
overlap = q & t
|
||||
|
||||
score = len(overlap) / len(q)
|
||||
|
||||
if query.lower() in text.lower():
|
||||
score += 1.0
|
||||
|
||||
return score
|
||||
|
||||
|
||||
def query(question, top_k=3, initial_k=10):
|
||||
model = get_embedding_model()
|
||||
|
||||
query_embedding = normalize(
|
||||
model.create_embedding([question])["data"][0]["embedding"]
|
||||
)
|
||||
|
||||
rows = conn.execute(
|
||||
"""
|
||||
SELECT id, text FROM chunks WHERE embedding MATCH ? AND k = ?
|
||||
""",
|
||||
(serialize_float32(query_embedding), initial_k), # type: ignore
|
||||
).fetchall()
|
||||
|
||||
scored = []
|
||||
for cid, text in rows:
|
||||
score = keyword_score(question, text)
|
||||
scored.append((cid, text, score))
|
||||
|
||||
scored.sort(key=lambda x: x[2], reverse=True)
|
||||
|
||||
if DEBUG:
|
||||
print("\n--- RETRIEVAL DEBUG ---")
|
||||
for cid, text, s in scored[:5]:
|
||||
print(f"[{cid}] score={s:.2f} | {text[:120]}\n")
|
||||
|
||||
return [(cid, text) for cid, text, _ in scored[:top_k]]
|
||||
|
||||
|
||||
def ask_llm(context_chunks, question):
|
||||
context = "\n\n".join(f"[{cid}] {text}" for cid, text in context_chunks)
|
||||
|
||||
prompt = f"""You are a precise assistant.
|
||||
|
||||
Use ONLY the provided context to answer.
|
||||
|
||||
Cite sources at the end of your sentences using bracket IDs.
|
||||
|
||||
If unsure , say "I don't know based on the provided context."
|
||||
|
||||
Context:
|
||||
{context}
|
||||
|
||||
Question:
|
||||
{question}
|
||||
|
||||
Answer:"""
|
||||
|
||||
stream = llm(
|
||||
prompt, max_tokens=200, stop=["</s>", "<|end|>", "Question:"], stream=True
|
||||
)
|
||||
|
||||
print("\nANSWER:\n")
|
||||
|
||||
for chunk in stream:
|
||||
token = chunk["choices"][0]["text"] # type: ignore
|
||||
print(token, end="", flush=True)
|
||||
|
||||
print()
|
||||
|
||||
|
||||
def main():
|
||||
print("Loading DB...")
|
||||
|
||||
exists = conn.execute(
|
||||
"SELECT name FROM sqlite_master WHERE type='table' AND name='chunks'"
|
||||
).fetchone()
|
||||
|
||||
count = 0
|
||||
if exists:
|
||||
count = conn.execute("SELECT COUNT(*) FROM chunks").fetchone()[0]
|
||||
|
||||
if count == 0:
|
||||
print("No data found. Ingesting PDF...")
|
||||
|
||||
text = load_pdf(f"./documents/{PDF_DOCUMENT}")
|
||||
chunks = chunk_text(text)
|
||||
embeddings = embed_chunks(chunks)
|
||||
store_embeddings(chunks, embeddings)
|
||||
|
||||
print("\nRAG is ready. Ask questions (type 'exit' to quit)")
|
||||
|
||||
while True:
|
||||
print()
|
||||
|
||||
question = input("Question: ").strip()
|
||||
|
||||
if question.lower() in ["exit", "quit"]:
|
||||
break
|
||||
|
||||
results = query(question)
|
||||
ask_llm(results, question)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
print("Goodbye!")
|
||||
sys.exit()
|
||||
83
requirements.txt
Normal file
83
requirements.txt
Normal file
|
|
@ -0,0 +1,83 @@
|
|||
annotated-doc==0.0.4
|
||||
annotated-types==0.7.0
|
||||
anyio==4.13.0
|
||||
attrs==26.1.0
|
||||
bcrypt==5.0.0
|
||||
build==1.5.0
|
||||
certifi==2026.4.22
|
||||
charset-normalizer==3.4.7
|
||||
click==8.3.3
|
||||
diskcache==5.6.3
|
||||
durationpy==0.10
|
||||
fastapi==0.136.1
|
||||
filelock==3.29.0
|
||||
flatbuffers==25.12.19
|
||||
fsspec==2026.4.0
|
||||
googleapis-common-protos==1.74.0
|
||||
grpcio==1.80.0
|
||||
h11==0.16.0
|
||||
hf-xet==1.4.3
|
||||
httpcore==1.0.9
|
||||
httptools==0.7.1
|
||||
httpx==0.28.1
|
||||
huggingface_hub==1.13.0
|
||||
idna==3.13
|
||||
importlib_metadata==8.7.1
|
||||
importlib_resources==7.1.0
|
||||
Jinja2==3.1.6
|
||||
jsonschema==4.26.0
|
||||
jsonschema-specifications==2025.9.1
|
||||
kubernetes==35.0.0
|
||||
llama_cpp_python==0.3.21
|
||||
markdown-it-py==4.0.0
|
||||
MarkupSafe==3.0.3
|
||||
mdurl==0.1.2
|
||||
mmh3==5.2.1
|
||||
numpy==2.4.4
|
||||
oauthlib==3.3.1
|
||||
onnxruntime==1.25.1
|
||||
opentelemetry-api==1.41.1
|
||||
opentelemetry-exporter-otlp-proto-common==1.41.1
|
||||
opentelemetry-exporter-otlp-proto-grpc==1.41.1
|
||||
opentelemetry-proto==1.41.1
|
||||
opentelemetry-sdk==1.41.1
|
||||
opentelemetry-semantic-conventions==0.62b1
|
||||
orjson==3.11.8
|
||||
overrides==7.7.0
|
||||
packaging==26.2
|
||||
protobuf==6.33.6
|
||||
pybase64==1.4.3
|
||||
pydantic==2.13.3
|
||||
pydantic-settings==2.14.0
|
||||
pydantic_core==2.46.3
|
||||
Pygments==2.20.0
|
||||
pypdf==6.10.2
|
||||
PyPika==0.51.1
|
||||
pyproject_hooks==1.2.0
|
||||
python-dateutil==2.9.0.post0
|
||||
python-dotenv==1.2.2
|
||||
PyYAML==6.0.3
|
||||
referencing==0.37.0
|
||||
requests==2.33.1
|
||||
requests-oauthlib==2.0.0
|
||||
rich==15.0.0
|
||||
rpds-py==0.30.0
|
||||
shellingham==1.5.4
|
||||
six==1.17.0
|
||||
sqlite-vec==0.1.9
|
||||
sse-starlette==3.4.1
|
||||
starlette==1.0.0
|
||||
starlette-context==0.3.6
|
||||
tenacity==9.1.4
|
||||
tokenizers==0.23.1
|
||||
tqdm==4.67.3
|
||||
typer==0.25.1
|
||||
typing-inspection==0.4.2
|
||||
typing_extensions==4.15.0
|
||||
urllib3==2.6.3
|
||||
uvicorn==0.46.0
|
||||
uvloop==0.22.1
|
||||
watchfiles==1.1.1
|
||||
websocket-client==1.9.0
|
||||
websockets==16.0
|
||||
zipp==3.23.1
|
||||
Loading…
Add table
Add a link
Reference in a new issue