
Knowledge Distillation Pipeline
Extract, distill, and interact with knowledge from any document corpus plus prepare it for other AI use cases
The Knowledge Distillation Pipeline is a modular, end-to-end framework for document ingestion, semantic summarization, and RAG-based question answering. With built-in PDF processing, multi-level summarization, and a conversational UI, it turns unstructured files into structured insight.
The Problem
I had 1,000 pages of new content -- too much to read with too little time. I needed a summary with no hallucination. I wanted to be able to ask questions about the content and get back information along with citations so I could validate myself. I dreamed of feeding this knowledge to other AI applications for numerous use cases, which would mean I would need the content translated to JSON. A few of these use cases would require retrieval augmented generation (RAG). So I built it.
This pipeline distills document sets—PDFs, DOCX, images—into semantic chunks and multi-level summaries that power a fast, context-aware RAG chatbot.
Built for research, policy, enterprise workflows or fun, it turns ad hoc files into persistent knowledge assets.
Core Capabilities
- Google Drive sync with DOCX-to-PDF conversion
- PDF splitting and multimodal (text/image) extraction
- FAISS vector search with semantic chunking
- Multi-level summarization: page, document, executive
- RAG-based chatbot with Gradio UI and source traceability
- Custom prompt templates for different summarization styles
- Configurable retrieval and summarization parameters
- CLI and web-based chat interfaces
Architecture Overview
KnowledgeDistillationPipeline/
├── ingest/ # Google Drive sync, DOCX to PDF
├── extract/ # PDF splitting and text/image parsing
├── summarize/ # Page, document, executive summaries
├── index/ # Chunking and FAISS indexing
├── chat/ # CLI and Gradio chatbot frontends
├── prompts/ # Summary prompt templates
├── config/ # Chunking and summarization parameters
├── main.py # Full pipeline entrypoint
├── main_chatbot.py # Unified chatbot interface
└── utils/ # Helpers and support functions