video-answer-agent
AI answer engine that powers Q&A for a video feed through agentic AI workflow with RAG + MCP
GitHub Stars
1
User Rating
Not Rated
Forks
0
Issues
0
Views
1
Favorites
0
Video Answer AI Agent (@AskAI Feature)
Demo: Watch Here
Description
This application allows users to browse a vertical video feed (similar to TikTok) and ask specific questions about the content by tagging @AskAI
in the comments and submitting a question. Then our AI Agent provides LLM-generated answers asynchronously, grounded in both the video's content and relevant external knowledge.
It demonstrates a modern agentic AI workflow that leverages Retrieval-Augmented Generation (RAG) to extract relevant video context and Model Context Protocol (MCP) to communicate via Server-Sent Events (SSE) with our Perplexity MCP Server, executing live web search tools (perplexity_ask
or perplexity_reason
) powered by Perplexity's Sonar API. The video context and tool call result is synthesized along with the user query into a high quality prompt to an OpenAI LLM, which generates the final answer to return back to the user. The user sees the answer as a reply to their comment from the @AskAI
account.
Note: This project is inspired by Perplexity's recent initiative to bring the @AskPerplexity feature to the TikTok For You page: (Blog Post: Rebuilding TikTok in America)
How It Works: The Two Core Flows
This system operates via two primary workflows: processing video to build knowledge base for RAG, and the agentic RAG + MCP Pipeline to answer questions about a specific video.
1. Video Processing & Knowledge Base Creation (Pre-computation)
Video Processing Pipeline Diagram:
This process happens before a video appears on the main feed.
Python script to run video processing pipeline is the
/video-processing-pipeline/process_video_pipeline.py
. Read more atVIDEO_PROCESSING_PIPELINE.md
(in same direcory as script).Acquisition & Chunking: Downloads video from URL, splits video into chunks intelligently (scene detection with PySceneDetect with fallback to fixed duration chunking), and stores important metadata about the video and chunks (timestamps, duration, chunk number, etc).
Generating Chunk Captions and Overall Summary: Generates detailed captions for each video chunk (using Gemini 2.5 Pro), then uses captions to query OpenAI to generate overall summary and extract key themes.
Indexing for Retrieval: Creates vector embeddings with OpenAI's
text-embedding-ada-002
from captions and indexes them in Pinecone along with relevant metadata, building a searchable knowledge base specific to the video.Upload to S3: Video data (.mp4 file for full video, .mp4 files for video chunks,
.json files with generated chunk captions, overall video summary, key themes, and key metadata about video/chunks) are uploaded into our S3 bucket. This S3 data, along with our Pinecone index, forms our knowledge base for retrieval augmented generation (RAG).
2. Real-time: Answering a Query (@AskAI
on the Feed)
RAG + MCP Answer Generation Pipeline
- This is triggered when a user tags
@AskAI
on a video from the main feed (which is already processed). - Read full details in
backend/app/BACKEND.md
- Query Understanding: Embeds the user's question.
- Retrieval (RAG): Queries Pinecone to find the video captions that are most semantically similar to the user query. Extracts video summary and key themes from our
.json file in S3. - Context Assembly: Builds relevant video context using the query, video summary/themes, and retrieved captions.
- MCP Tool Selection & Use (Claude + MCP): We initialize an MCP Client and an orchestrator (Claude) selects the best tool available from our Perplexity MCP Server (
perplexity_ask
orperplexity_reason
), and synthesizes arguments to feed the tool based on our assembled context from the previous step. Then we call the tool with our MCP connection, getting an answer from the Perplexity MCP Server that is grounded in high quality internet search results. Readbackend/app/perplexity-mcp-server/PERPLEXITY_MCP_SERVER.md
for details on our MCP Server. - Synthesis: An LLM (
gpt-4o-mini
) generates the final answer using the query, RAG context, and MCP results. - Asynchronous Delivery: The answer is generated in the background. The user doesn't have to wait for the answer, they can continue watching videos, and once the answer is generated they will see it as a reply to their comment.
Core Next.js Application features
- Interactive Video Feed: Scrollable feed displaying pre-processed videos ready for Q&A. The main viewable page will containt consist of the current video played, and a comment button (for user to comment and @AskAI a query about the video). Users can scroll down to see a new video.
@AskAI
Feature: Asynchronous Q&A on feed videos via comment tag (@AskAI
), leveraging RAG + MCP for answers.- Read more about our next.js application at
frontend/for-you-feed/FRONTEND.md
User Experience Flow
- Launch & Browse: Open app, scroll through the main feed containing fully processed videos.
- Ask on Feed Video (
@AskAI
):- Tap comment icon on a feed video.
- Tag
@AskAI
, type question, submit. - Continue browsing; the RAG+MCP answer generation runs in the background. Answer will be ready within few seconds.
- View Answer: Navigate to the relevant video's comment section and read the AI-generated reply.
Architecture Overview
- Frontend (Next.js web application): UI, user interactions, API calls to communicate with Backend
- Backend (FastAPI / Python): API handling, background task management, AI RAG + MCP Pipeline for AI answer generation.
- Storage (AWS S3): Stores raw videos, chunks, and JSON metadata (
<video_id>.json
,interactions.json
). - Vector DB (Pinecone): Stores and searches video caption embeddings for RAG.
- AI Services: Google Gemini (Captioning), OpenAI (Embedding, Summarization, Final Answer Synthesis), Anthropic Claude (Tool Orchestration), Perplexity (MCP Tools).
Technology Stack
- Frontend: Next.js 15+, TypeScript, React, Tailwind CSS, App Router
- Backend: Python 3.10+, FastAPI, Uvicorn, Docker
- Data Storage and State Representation: AWS S3
- Database: Pinecone (Vector DB)
- AI APIs: OpenAI, Google Cloud (Gemini), Perplexity, Anthropic
- Protocols: Model Context Protocol (MCP)
- Deployment: Docker, AWS ECR, AWS App Runner
Key Dependencies
- Libraries/SDKs: Boto3, Pinecone Client, OpenAI Clients, Google GenAI Clients, Anthropic Client, FastAPI, Next.js,
PySceneDetect
,yt-dlp
,mcp
,fastmcp
library. - External APIs: AWS, Pinecone, OpenAI, Google Gemini, Perplexity, Anthropic.
Setup & Running Locally
Prerequisites:
- Node.js (v18+), Python (v3.10+), Docker, Git
- AWS Account & configured CLI
- API Keys: Pinecone, OpenAI, Google Cloud (Gemini), Perplexity, Anthropic
1. Clone Repository:
git clone https://github.com/mhuang448/video-answer-agent.git
cd video-answer-agent
2. Backend Setup (backend/
):
cd backend
python -m venv venv && source venv/bin/activate # or .\venv\Scripts\activate
pip install -r requirements.txt
# Create .env with API keys, AWS config, S3 bucket, Pinecone details
uvicorn app.main:app --reload --port 8000
3. Frontend Setup (frontend/
):
cd ../frontend
npm install # or yarn install
# Create .env.local with BACKEND_API_URL=http://localhost:8000
npm run dev # or yarn dev
Access at http://localhost:3000
.
4. MCP Server Setup:
- See documentation in
/backend/perplexity-mcp-server/PERPLEXITY_MCP_SERVER.md
5. AWS/S3 Configuration:
_ Ensure S3 bucket exists with public read policy for .mp4
and CORS configured for localhost:3000
and deployed frontend URL.
_ Ensure Backend IAM role/user has S3 permissions.
6. Initial Content: * Process initial videos using scripts or the "Process New Video" feature/API endpoint (/api/process_and_query/async
) to populate the main feed.
Key Trade-offs (for Development Speed)
- Polling: Frontend uses
setInterval
for status checks (vs. WebSockets). - S3 State: JSON files in S3 manage state (vs. a database like DynamoDB).
- Basic Background Tasks: FastAPI's
BackgroundTasks
used (vs. Celery/RQ).