Project Overview
This academic project implements a sophisticated Hierarchical RAG (Retrieval-Augmented Generation) system specifically designed for analyzing Base Erosion and Profit Shifting (BEPS) action reports. The system features intelligent query routing, multi-layer retrieval, and production-ready deployment options.
Hierarchical Architecture
Two-Layer Design
┌─────────────────────────────────────────────────────────────┐
│ Layer 1: Keyword/Summary Store │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Keywords │ │ Summaries │ │ Metadata │ │
│ │ - BEPS │ │ - Action 1 │ │ - Doc ID │ │
│ │ - Transfer │ │ - Action 5 │ │ - Page │ │
│ │ - Pricing │ │ - Action 13 │ │ - Section │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Layer 2: Document Store │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Full Document Chunks (512 tokens, 50 overlap) │ │
│ │ - Complete BEPS Action Reports │ │
│ │ - Detailed explanations and examples │ │
│ │ - Regulatory text and guidelines │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Decision-Making Agent Flow
User Query → Query Classifier → Decision Engine → Response Router
↓
[RAG] ← → [Direct] ← → [Web Search]
Technology Stack
Python 3.8+
Core language
Sentence-Transformers
Embeddings
Models & Methods
Embedding Models
# Primary Embedding Model
sentence-transformers/all-MiniLM-L6-v2
- 384-dimensional vectors
- Optimized for semantic similarity
- Fast inference (critical for hierarchical retrieval)
- Multilingual support for international BEPS documents
Retrieval Methods
Hierarchical Retrieval
Two-layer approach: keywords → full documents
Confidence Scoring
Intelligent routing based on query confidence
Web Fallback
Internet search for latest updates
Chunk Processing
512 tokens with 50-token overlap
Deployment Options
CPU Deployment
Backend: llama.cpp with GGUF models
Container: Ubuntu 20.04 + llama.cpp
Model: Quantized 4-bit for efficiency
cd deployment/cpu
./deploy_cpu.sh
# Access: http://localhost:8000
GPU Deployment
Backend: vLLM for high-throughput inference
Container: CUDA 11.8 + vLLM
Model: Full precision for accuracy
cd deployment/gpu
./deploy_gpu.sh
# Access: http://localhost:8000
Quick Start Guide
Clone Repository
git clone https://github.com/mk-knight23/hierarchical-rag-beps.git
cd hierarchical-rag-beps
Choose Deployment
Select CPU or GPU deployment based on your hardware
Run Deployment Script
# For CPU
cd deployment/cpu && ./deploy_cpu.sh
# For GPU
cd deployment/gpu && ./deploy_gpu.sh
Test the API
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "What are BEPS Action 13 requirements?"}'
Test Questions
Hierarchical RAG Evaluation
- Factual: "What are the four minimum standards under BEPS?"
- Analytical: "How does BEPS Action 5 affect transfer pricing documentation?"
- Procedural: "What are the steps for implementing Country-by-Country reporting?"
Agent Decision Evaluation
- RAG-Preferred: "Explain BEPS Action 1 regarding digital economy challenges"
- Direct Answer: "What does BEPS stand for?"
- Web Search: "Latest BEPS implementation updates for 2024"