Skip to main content
Generaldvcrn

scraper

Structured extraction and cleanup for public, user-authorized web pages. Use when the user wants to collect, clean, summarize, or transform content from accessible pages into reusable text or data. Do not use to bypass logins, paywalls, captchas, robots restrictions, or access controls. Local-only output.

Stars
15
Source
dvcrn/openclaw-skills-marketplace
Updated
2026-05-29
Slug
dvcrn--openclaw-skills-marketplace--scraper
View on GitHubRaw SKILL.md

// install — copy + paste into any project

mkdir -p .claude/skills && curl -fsSL https://raw.githubusercontent.com/dvcrn/openclaw-skills-marketplace/HEAD/plugins/agistack--scraper/skills/scraper/SKILL.md -o .claude/skills/scraper.md

Drops the SKILL.md into .claude/skills/scraper.md. Works with Claude Code, Cursor, and any agent that loads SKILL.md files from .claude/skills/.

Scraper

Turn messy public pages into clean, reusable data.

Core Purpose

Scraper is a safe extraction skill for public, user-authorized pages. It helps the agent:

  • fetch page content from a URL
  • extract readable text
  • strip boilerplate where possible
  • save clean output locally
  • prepare content for later summarization or analysis

Safety Boundaries

  • Only use on public or user-authorized pages
  • Do not bypass logins, paywalls, captchas, robots restrictions, or rate limits
  • Do not request or store credentials
  • Do not perform stealth scraping, account creation, or identity evasion
  • Save outputs locally only

Runtime Requirements

  • Python 3 must be available as python3
  • No external packages required

Local Storage

All outputs are stored locally under:

  • ~/.openclaw/workspace/memory/scraper/jobs.json
  • ~/.openclaw/workspace/memory/scraper/output/

Key Workflows

  • Capture a page: fetch_page.py --url "https://example.com"
  • Extract readable text: extract_text.py --url "https://example.com"
  • Save cleaned content: save_output.py --url "https://example.com" --title "Example"
  • List prior jobs: list_jobs.py

Scripts

Script Purpose
init_storage.py Initialize scraper storage
fetch_page.py Download a page with standard headers
extract_text.py Convert HTML into cleaned plain text
save_output.py Save extracted output and register a job
list_jobs.py Show past scraping jobs