Web Scraper MCP Server
Overview
MCP server (buddai-scraper
) capable of scraping webpages, Instagram posts, and YouTube videos, storing their content in MongoDB with semantic embeddings for later retrieval. The Docker host configuration is scrapper.mcp.yml
(note the name) which runs the server over STDIO.
Components
index.js
– registers tools and dispatches scraping tasks based on the input URL.web.js
– headless Chromium (puppeteer
) fetch that converts raw HTML into Markdown viaturndown
.instagram.js
/youtube.js
– wrappers around Apify actors to collect social media content and subtitles.apify.js
– initialises the Apify client usingAPIFY_TOKEN
and executes actors, returning dataset items.mongo.js
– persists documents in thescraper
collection, generates embeddings, and performs vector search.embeddings.js
/models/
– local ONNX (nomic-embed-text-v1.5
) embeddings, identical stack to the notes service.Dockerfile
– Node 22 multi-stage build; bundles production deps, the embedding model, and runsnode index.js
as usermcp
.
Exposed MCP tools
Tool | Purpose | Input |
---|---|---|
addWebsiteToDB | Scrapes a URL (web/Instagram/YouTube) and stores the processed content if it has not been saved before. | url (must be valid). |
searchDocumentationByPrompt | Semantic search across stored documents in MongoDB. | prompt . |
Saved documents include fields such as url
, content
, media-specific metadata, embedding
, timestamp
, and the Apify output footprint. saveWebData
upserts records keyed by URL to avoid duplicates.
Configuration
scrapper.mcp.yml
launches the container:
yaml
scraper:
transport: stdio
command: docker
args:
- run
- --init
- -i
- --rm
- --network
- buddai_net
- -e
- APIFY_TOKEN
- -e
- MONGO_URI
- buddai/mcp-scraper
env:
APIFY_TOKEN: "${APIFY_TOKEN}"
MONGO_URI: "${MONGO_URI}"
Required environment variables:
APIFY_TOKEN
– authorises Apify actors (apify/instagram-scraper
,streamers/youtube-scraper
).MONGO_URI
– target MongoDB connection string (Atlas Search enabled for vector queries).DEBUG
– optional (mcp_scraper:*
).
Puppeteer depends on Chromium packages bundled with the Node image; ensure the container has the necessary system dependencies when running outside Docker.
Running locally
- Install Node.js ≥ 20 and run
npm install
(installs puppeteer and Apify SDK). - Export
MONGO_URI
,APIFY_TOKEN
, optionalDEBUG
. - Download the embedding model into
models/
if not present. - Start MongoDB (Atlas/replica set) and run
node index.js
to expose the STDIO server.
Docker usage
bash
docker build -t buddai/mcp-scraper mcp_servers/scraper
Run per the host configuration:
bash
docker network create buddai_net # once
APIFY_TOKEN=your-apify-token \
MONGO_URI=mongodb://mongo:27017/buddai \
docker run --rm --init -i \
--network buddai_net \
-e APIFY_TOKEN \
-e MONGO_URI \
-e DEBUG=mcp_scraper:* \
buddai/mcp-scraper
Development notes
existsUrl
closes the Mongo client after each call, so repeated writes reacquire connections; optimise if batch throughput is required.- Apify actors can be slow or rate limited; consider queueing or caching results when integrating in a live agent loop.
- Puppeteer launches in
headless: 'new'
; adjust launch options if your environment needs a different Chromium binary. - Tests under
test.js
and friends reference outdated tooling.
Troubleshooting
- Puppeteer launch errors: install required system libraries (
libnss3
,libatk1.0
, etc.) or run inside the provided Docker image. - Apify actor failures: confirm the token has access to the actors and that usage limits are not exceeded.
- Semantic search returns empty: verify
vectorSearchOnEmbedding
index creation (automatic) and ensure embeddings exist for stored entries. - Duplicate scrape warning: the service de-duplicates by URL; remove the document from MongoDB if you need to scrape again.