Skip to content

Web Scraper MCP Server

Overview

MCP server (buddai-scraper) capable of scraping webpages, Instagram posts, and YouTube videos, storing their content in MongoDB with semantic embeddings for later retrieval. The Docker host configuration is scrapper.mcp.yml (note the name) which runs the server over STDIO.

Components

  • index.js – registers tools and dispatches scraping tasks based on the input URL.
  • web.js – headless Chromium (puppeteer) fetch that converts raw HTML into Markdown via turndown.
  • instagram.js / youtube.js – wrappers around Apify actors to collect social media content and subtitles.
  • apify.js – initialises the Apify client using APIFY_TOKEN and executes actors, returning dataset items.
  • mongo.js – persists documents in the scraper collection, generates embeddings, and performs vector search.
  • embeddings.js / models/ – local ONNX (nomic-embed-text-v1.5) embeddings, identical stack to the notes service.
  • Dockerfile – Node 22 multi-stage build; bundles production deps, the embedding model, and runs node index.js as user mcp.

Exposed MCP tools

ToolPurposeInput
addWebsiteToDBScrapes a URL (web/Instagram/YouTube) and stores the processed content if it has not been saved before.url (must be valid).
searchDocumentationByPromptSemantic search across stored documents in MongoDB.prompt.

Saved documents include fields such as url, content, media-specific metadata, embedding, timestamp, and the Apify output footprint. saveWebData upserts records keyed by URL to avoid duplicates.

Configuration

scrapper.mcp.yml launches the container:

yaml
scraper:
  transport: stdio
  command: docker
  args:
    - run
    - --init
    - -i
    - --rm
    - --network
    - buddai_net
    - -e
    - APIFY_TOKEN
    - -e
    - MONGO_URI
    - buddai/mcp-scraper
  env:
    APIFY_TOKEN: "${APIFY_TOKEN}"
    MONGO_URI: "${MONGO_URI}"

Required environment variables:

  • APIFY_TOKEN – authorises Apify actors (apify/instagram-scraper, streamers/youtube-scraper).
  • MONGO_URI – target MongoDB connection string (Atlas Search enabled for vector queries).
  • DEBUG – optional (mcp_scraper:*).

Puppeteer depends on Chromium packages bundled with the Node image; ensure the container has the necessary system dependencies when running outside Docker.

Running locally

  1. Install Node.js ≥ 20 and run npm install (installs puppeteer and Apify SDK).
  2. Export MONGO_URI, APIFY_TOKEN, optional DEBUG.
  3. Download the embedding model into models/ if not present.
  4. Start MongoDB (Atlas/replica set) and run node index.js to expose the STDIO server.

Docker usage

bash
docker build -t buddai/mcp-scraper mcp_servers/scraper

Run per the host configuration:

bash
docker network create buddai_net                  # once
APIFY_TOKEN=your-apify-token \
MONGO_URI=mongodb://mongo:27017/buddai \
docker run --rm --init -i \
  --network buddai_net \
  -e APIFY_TOKEN \
  -e MONGO_URI \
  -e DEBUG=mcp_scraper:* \
  buddai/mcp-scraper

Development notes

  • existsUrl closes the Mongo client after each call, so repeated writes reacquire connections; optimise if batch throughput is required.
  • Apify actors can be slow or rate limited; consider queueing or caching results when integrating in a live agent loop.
  • Puppeteer launches in headless: 'new'; adjust launch options if your environment needs a different Chromium binary.
  • Tests under test.js and friends reference outdated tooling.

Troubleshooting

  • Puppeteer launch errors: install required system libraries (libnss3, libatk1.0, etc.) or run inside the provided Docker image.
  • Apify actor failures: confirm the token has access to the actors and that usage limits are not exceeded.
  • Semantic search returns empty: verify vectorSearchOnEmbedding index creation (automatic) and ensure embeddings exist for stored entries.
  • Duplicate scrape warning: the service de-duplicates by URL; remove the document from MongoDB if you need to scrape again.