Knowledge Distillation Pipeline logo

Knowledge Distillation Pipeline

Extract, distill, and interact with knowledge from any document corpus plus prepare it for other AI use cases

The Knowledge Distillation Pipeline is a modular, end-to-end framework for document ingestion, semantic summarization, and RAG-based question answering. With built-in PDF processing, multi-level summarization, and a conversational UI, it turns unstructured files into structured insight.

The Problem

I had 1,000 pages of new content -- too much to read with too little time. I needed a summary with no hallucination. I wanted to be able to ask questions about the content and get back information along with citations so I could validate myself. I dreamed of feeding this knowledge to other AI applications for numerous use cases, which would mean I would need the content translated to JSON. A few of these use cases would require retrieval augmented generation (RAG). So I built it.

This pipeline distills document sets—PDFs, DOCX, images—into semantic chunks and multi-level summaries that power a fast, context-aware RAG chatbot.

Built for research, policy, enterprise workflows or fun, it turns ad hoc files into persistent knowledge assets.

Core Capabilities

Architecture Overview

KnowledgeDistillationPipeline/
├── ingest/               # Google Drive sync, DOCX to PDF
├── extract/              # PDF splitting and text/image parsing
├── summarize/            # Page, document, executive summaries
├── index/                # Chunking and FAISS indexing
├── chat/                 # CLI and Gradio chatbot frontends
├── prompts/              # Summary prompt templates
├── config/               # Chunking and summarization parameters
├── main.py               # Full pipeline entrypoint
├── main_chatbot.py       # Unified chatbot interface
└── utils/                # Helpers and support functions