Archive Acquisition Skill
Specialized patterns and techniques for acquiring media content from Internet Archive (archive.org) and other archival sources. Focuses on bulk downloads, quality selection, format filtering, and collection management.
Internet Archive Overview
Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics:
- Open access: Most content freely downloadable
- Multiple formats: Same item often available in FLAC, MP3, OGG, etc.
- Collections: Items grouped by uploader, topic, or curator
- Metadata: Rich JSON metadata for every item
- API access: Full programmatic access via REST API
Discovery Patterns
Search API
Find items matching specific criteria:
# Search for items by keyword and mediatype
search_archive() {
local query="$1"
local mediatype="$2" # audio, movies, texts, etc.
local rows="${3:-100}" # Results per page
curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND mediatype:$mediatype" \
-d "fl[]=identifier,title,creator,date,format" \
-d "rows=$rows" \
-d "output=json" \
| jq '.response.docs[]'
}
# Example: Search for Grateful Dead concerts
search_archive "grateful dead" "audio" 50
Response Format
{
"identifier": "gd1977-05-08.sbd.miller.97065.flac16",
"title": "Grateful Dead Live at Barton Hall on 1977-05-08",
"creator": "Grateful Dead",
"date": "1977-05-08",
"format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"]
}
Collection Browsing
List all items in a collection:
# Get items from a specific collection
get_collection_items() {
local collection="$1"
local rows="${2:-1000}"
curl -s "https://archive.org/advancedsearch.php" \
-d "q=collection:$collection" \
-d "fl[]=identifier,title,format" \
-d "rows=$rows" \
-d "output=json" \
| jq -r '.response.docs[] | .identifier'
}
# Example: Get all items from GratefulDead collection
get_collection_items "GratefulDead" 500
Advanced Query Patterns
# Find items by date range
search_by_date() {
local creator="$1"
local start_date="$2" # YYYY-MM-DD
local end_date="$3"
curl -s "https://archive.org/advancedsearch.php" \
-d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \
-d "fl[]=identifier,title,date" \
-d "rows=100" \
-d "output=json" \
| jq '.response.docs[]'
}
# Find items with specific format available
search_by_format() {
local query="$1"
local format="$2" # FLAC, MP3, etc.
curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND format:$format" \
-d "fl[]=identifier,title,format" \
-d "rows=100" \
-d "output=json" \
| jq '.response.docs[]'
}
# Example: Find all FLAC recordings of specific artist
search_by_format "pink floyd" "Flac"
Item Metadata Retrieval
Get Item Details
# Retrieve full metadata for an item
get_item_metadata() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq '.'
}
# Extract specific metadata fields
get_item_files() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"'
}
# Example output:
# gd77-05-08d1t01.flac Flac 45123456
# gd77-05-08d1t01.mp3 VBR MP3 12345678
Filter by Format
# Get only FLAC files from an item
get_flac_files() {
local identifier="$1"
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "Flac") | .name'
}
# Get best quality audio files (prefer FLAC, fallback to 320K MP3)
get_best_audio_files() {
local identifier="$1"
# Try FLAC first
local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "Flac") | .name')
if [[ -n "$flac_files" ]]; then
echo "$flac_files"
return 0
fi
# Fallback to highest bitrate MP3
curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name'
}
Download Patterns
Single Item Download
# Download all files from a single item
download_item() {
local identifier="$1"
local output_dir="$2"
local format_filter="$3" # Optional: FLAC, MP3, etc.
mkdir -p "$output_dir"
# Get file list
local files
if [[ -n "$format_filter" ]]; then
files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r ".files[] | select(.format == \"$format_filter\") | .name")
else
files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | select(.format != "Metadata") | .name')
fi
# Download each file
while IFS= read -r file; do
echo "Downloading: $file"
wget -q --show-progress \
"https://archive.org/download/$identifier/$file" \
-P "$output_dir"
done <<< "$files"
}
# Example: Download all FLAC files from an item
download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac"
Recursive Collection Download
# Download entire collection with format filtering
download_collection() {
local collection="$1"
local output_base="$2"
local format="$3" # FLAC, MP3, etc.
local max_items="${4:-100}"
# Get collection items
local items=$(get_collection_items "$collection" "$max_items")
local count=0
while IFS= read -r identifier; do
count=$((count + 1))
echo "[$count/$max_items] Downloading: $identifier"
# Create item directory
local item_dir="$output_base/$identifier"
mkdir -p "$item_dir"
# Download with format filter
download_item "$identifier" "$item_dir" "$format"
# Download metadata
curl -s "https://archive.org/metadata/$identifier" \
> "$item_dir/.curator/metadata.json"
# Rate limiting (be nice to archive.org)
sleep 2
done <<< "$items"
}
# Example: Download first 50 FLAC items from collection
download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50
Parallel Collection Download
# Download collection items in parallel (with concurrency limit)
download_collection_parallel() {
local collection="$1"
local output_base="$2"
local format="$3"
local max_concurrent="${4:-3}"
# Get collection items
local items=$(get_collection_items "$collection" 1000)
# Create job queue
local queue_file="/tmp/curator-archive-queue-$$.txt"
echo "$items" > "$queue_file"
# Process queue with concurrency limit
cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c "
identifier={}
item_dir=\"$output_base/\$identifier\"
mkdir -p \"\$item_dir\"
echo \"Downloading: \$identifier\"
# Download best quality audio
files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \
| jq -r '.files[] | select(.format == \"$format\") | .name')
while IFS= read -r file; do
wget -q --show-progress \
\"https://archive.org/download/\$identifier/\$file\" \
-P \"\$item_dir\"
done <<< \"\$files\"
# Save metadata
curl -s \"https://archive.org/metadata/\$identifier\" \
> \"\$item_dir/.curator/metadata.json\"
sleep 2 # Rate limiting
"
rm "$queue_file"
}
wget Recursive Download
# Use wget's recursive mode for efficient bulk download
download_with_wget_recursive() {
local identifier="$1"
local output_dir="$2"
local format_pattern="$3" # e.g., "*.flac" or "*.mp3"
mkdir -p "$output_dir"
wget --recursive \
--no-parent \
--no-directories \
--accept "$format_pattern" \
--directory-prefix="$output_dir" \
--wait=2 \
--random-wait \
"https://archive.org/download/$identifier/"
}
# Example: Download all FLAC files from item
download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \
"/mnt/archive/gd-1977-05-08" \
"*.flac"
Quality Selection Strategy
Format Preference Hierarchy
For audio archival, prefer formats in this order:
- FLAC - Lossless, widely supported
- 320Kbps MP3 - High quality lossy, universal compatibility
- VBR MP3 - Variable bitrate, good quality/size ratio
- Ogg Vorbis - Good quality, open format
- 128Kbps MP3 - Acceptable for spoken word, not music
Automatic Quality Selection
# Select best available format for an item
select_best_format() {
local identifier="$1"
# Get available formats
local formats=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | .format' | sort -u)
# Check in preference order
if echo "$formats" | grep -q "Flac"; then
echo "Flac"
elif echo "$formats" | grep -q "320Kbps MP3"; then
echo "320Kbps MP3"
elif echo "$formats" | grep -q "VBR MP3"; then
echo "VBR MP3"
elif echo "$formats" | grep -q "Ogg Vorbis"; then
echo "Ogg Vorbis"
else
echo "$(echo "$formats" | grep -i mp3 | head -1)"
fi
}
# Download using best available format
download_best_quality() {
local identifier="$1"
local output_dir="$2"
local best_format=$(select_best_format "$identifier")
echo "Selected format: $best_format"
download_item "$identifier" "$output_dir" "$best_format"
}
Size vs Quality Trade-offs
# Estimate download size before acquiring
estimate_download_size() {
local identifier="$1"
local format="$2"
local total_size=$(curl -s "https://archive.org/metadata/$identifier" \
| jq "[.files[] | select(.format == \"$format\") | .size | tonumber] | add")
# Convert to human-readable
local size_mb=$((total_size / 1048576))
local size_gb=$((size_mb / 1024))
if [[ $size_gb -gt 0 ]]; then
echo "${size_gb}GB"
else
echo "${size_mb}MB"
fi
}
# Check if size is acceptable before downloading
download_if_acceptable() {
local identifier="$1"
local output_dir="$2"
local format="$3"
local max_size_gb="${4:-5}" # Default 5GB limit
local estimated_size=$(estimate_download_size "$identifier" "$format")
local size_value=$(echo "$estimated_size" | sed 's/[^0-9]//g')
local size_unit=$(echo "$estimated_size" | sed 's/[0-9]//g')
if [[ "$size_unit" == "GB" && $size_value -gt $max_size_gb ]]; then
echo "SKIPPED: $identifier ($estimated_size exceeds ${max_size_gb}GB limit)"
return 1
fi
echo "Downloading: $identifier ($estimated_size)"
download_item "$identifier" "$output_dir" "$format"
}
Metadata Extraction and Storage
Extract Archive Metadata
# Convert Archive.org metadata to curator format
extract_archive_metadata() {
local identifier="$1"
local output_file="$2"
curl -s "https://archive.org/metadata/$identifier" | jq '{
source: "archive.org",
identifier: .metadata.identifier,
title: .metadata.title,
creator: .metadata.creator,
date: .metadata.date,
venue: .metadata.venue,
coverage: .metadata.coverage,
description: .metadata.description,
collection: .metadata.collection,
downloads: .item_size.downloads,
avg_rating: .reviews.info.avg_rating,
num_reviews: .reviews.info.num_reviews,
files: [
.files[] | {
name: .name,
format: .format,
size: .size,
md5: .md5,
sha1: .sha1
}
]
}' > "$output_file"
}
# Example usage
extract_archive_metadata "gd1977-05-08.sbd.miller.97065.flac16" \
".curator/metadata/gd-1977-05-08.json"
Merge with Curator Metadata
# Merge Archive.org metadata with existing curator metadata
merge_metadata() {
local curator_metadata="$1"
local archive_metadata="$2"
local output_file="$3"
jq -s '.[0] * .[1]' "$curator_metadata" "$archive_metadata" > "$output_file"
}
Verification and Quality Assurance
Checksum Verification
Archive.org provides MD5 and SHA1 checksums for all files:
# Verify downloaded files against Archive.org checksums
verify_archive_downloads() {
local identifier="$1"
local download_dir="$2"
# Get checksums from metadata
local checksums=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r '.files[] | "\(.md5) \(.name)"')
# Verify each file
local failed=0
while IFS= read -r line; do
local expected_md5="${line%% *}"
local filename="${line#* }"
local filepath="$download_dir/$filename"
if [[ ! -f "$filepath" ]]; then
echo "MISSING: $filename"
failed=$((failed + 1))
continue
fi
local actual_md5=$(md5sum "$filepath" | awk '{print $1}')
if [[ "$actual_md5" != "$expected_md5" ]]; then
echo "CHECKSUM FAILED: $filename"
echo " Expected: $expected_md5"
echo " Actual: $actual_md5"
failed=$((failed + 1))
else
echo "VERIFIED: $filename"
fi
done <<< "$checksums"
if [[ $failed -gt 0 ]]; then
echo "Verification failed for $failed files"
return 1
fi
echo "All files verified successfully"
return 0
}
File Integrity Check
# Verify audio file integrity after download
verify_audio_integrity() {
local file="$1"
# Check file is not zero-byte
if [[ ! -s "$file" ]]; then
echo "FAILED: Zero-byte file"
return 1
fi
# Verify with ffmpeg
if command -v ffmpeg >/dev/null; then
if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then
echo "FAILED: Corrupted audio file"
return 1
fi
fi
# FLAC specific: verify with flac tool
if [[ "$file" == *.flac ]] && command -v flac >/dev/null; then
if ! flac -t "$file" 2>&1 | grep -q "ok"; then
echo "FAILED: FLAC verification failed"
return 1
fi
fi
echo "VERIFIED: $file"
return 0
}
# Verify all files in directory
verify_directory_integrity() {
local dir="$1"
local total=0
local verified=0
local failed=0
find "$dir" -type f \( -name "*.flac" -o -name "*.mp3" -o -name "*.ogg" \) | while read -r file; do
total=$((total + 1))
if verify_audio_integrity "$file"; then
verified=$((verified + 1))
else
failed=$((failed + 1))
fi
done
echo "Integrity check complete: $verified/$total verified, $failed failed"
}
Rate Limiting and Politeness
Archive.org requests responsible use of their bandwidth:
# Implement polite download behavior
polite_download() {
local identifier="$1"
local output_dir="$2"
local format="$3"
# Add delays between requests
local wait_time=2 # seconds
# Get file list
local files=$(curl -s "https://archive.org/metadata/$identifier" \
| jq -r ".files[] | select(.format == \"$format\") | .name")
while IFS= read -r file; do
echo "Downloading: $file"
wget --quiet \
--show-progress \
--wait="$wait_time" \
--random-wait \
--user-agent="AIWG-MediaCurator/1.0 (research/personal use)" \
"https://archive.org/download/$identifier/$file" \
-P "$output_dir"
done <<< "$files"
}
Bulk Download Best Practices
- Rate limit: Wait 2-5 seconds between requests
- Random wait: Add jitter to avoid patterns
- User agent: Identify your tool
- Off-peak hours: Download during US night hours
- Resume support: Use
wget -cto resume interrupted downloads - Error handling: Retry failed downloads after delay
Collection Management
Build Local Collection Index
# Create searchable index of downloaded items
build_collection_index() {
local collection_dir="$1"
local index_file="$2"
echo '{"items": []}' > "$index_file"
find "$collection_dir" -name "metadata.json" -type f | while read -r metadata_file; do
local item_dir=$(dirname "$metadata_file")
# Extract key metadata fields
local item_data=$(jq '{
identifier: .metadata.identifier,
title: .metadata.title,
creator: .metadata.creator,
date: .metadata.date,
local_path: "'"$item_dir"'",
file_count: (.files | length),
total_size: ([.files[] | .size | tonumber] | add)
}' "$metadata_file")
# Append to index
jq ".items += [$item_data]" "$index_file" > "$index_file.tmp"
mv "$index_file.tmp" "$index_file"
done
echo "Index built: $index_file"
}
# Search local collection
search_local_collection() {
local index_file="$1"
local query="$2"
jq -r ".items[] | select(.title | test(\"$query\"; \"i\")) | \"\(.date) - \(.creator) - \(.title) (\(.local_path))\"" \
"$index_file"
}
Deduplicate Collection
# Find and remove duplicate items in collection
deduplicate_collection() {
local collection_dir="$1"
# Build checksum database
local checksum_file="/tmp/curator-checksums-$$.txt"
find "$collection_dir" -type f \( -name "*.flac" -o -name "*.mp3" \) | while read -r file; do
local checksum=$(md5sum "$file" | awk '{print $1}')
echo "$checksum $file" >> "$checksum_file"
done
# Find duplicates
sort "$checksum_file" | uniq -d -w 32 | while read -r line; do
local checksum="${line%% *}"
echo "Duplicate found (checksum: $checksum):"
grep "^$checksum" "$checksum_file" | while read -r dup_line; do
local dup_file="${dup_line#* }"
echo " - $dup_file"
done
# Offer to delete all but first
local files=($(grep "^$checksum" "$checksum_file" | cut -d' ' -f3-))
local keep="${files[0]}"
echo "Keep: $keep"
for ((i=1; i<${#files[@]}; i++)); do
echo "Delete: ${files[$i]}"
rm "${files[$i]}"
done
done
rm "$checksum_file"
}
Advanced Patterns
Incremental Collection Sync
# Download only new items added to collection since last sync
sync_collection_incremental() {
local collection="$1"
local output_base="$2"
local format="$3"
local sync_state_file="$output_base/.curator/sync-state.json"
# Get last sync timestamp
local last_sync
if [[ -f "$sync_state_file" ]]; then
last_sync=$(jq -r '.last_sync' "$sync_state_file")
else
last_sync="1970-01-01"
fi
# Search for items added/updated since last sync
local new_items=$(curl -s "https://archive.org/advancedsearch.php" \
-d "q=collection:$collection AND publicdate:[$last_sync TO null]" \
-d "fl[]=identifier" \
-d "rows=1000" \
-d "output=json" \
| jq -r '.response.docs[] | .identifier')
# Download new items
local count=0
while IFS= read -r identifier; do
count=$((count + 1))
echo "Downloading new item [$count]: $identifier"
download_item "$identifier" "$output_base/$identifier" "$format"
sleep 2
done <<< "$new_items"
# Update sync state
echo "{\"last_sync\": \"$(date -Iseconds)\", \"items_added\": $count}" > "$sync_state_file"
echo "Sync complete: $count new items added"
}
Filtered Batch Download
# Download items matching complex filter criteria
download_filtered_batch() {
local query="$1"
local output_base="$2"
local format="$3"
local min_rating="${4:-4.0}"
local min_year="${5:-1960}"
# Search with filters
local items=$(curl -s "https://archive.org/advancedsearch.php" \
-d "q=$query AND avg_rating:[$min_rating TO 5.0] AND year:[$min_year TO 2026]" \
-d "fl[]=identifier,avg_rating,year" \
-d "rows=500" \
-d "output=json")
# Download each item
echo "$items" | jq -r '.response.docs[] | .identifier' | while read -r identifier; do
local rating=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .avg_rating")
local year=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .year")
echo "Downloading: $identifier (rating: $rating, year: $year)"
download_item "$identifier" "$output_base/$identifier" "$format"
sleep 2
done
}
Integration with Media Curator Framework
Source Plan Generation
# Generate source plan for /acquire command
generate_archive_source_plan() {
local collection="$1"
local output_file="$2"
local format="${3:-Flac}"
# Get collection items
local items=$(get_collection_items "$collection" 100)
# Build YAML plan
cat > "$output_file" <<EOF
source_plan:
id: archive-$(date +%Y%m%d-%H%M%S)
source: archive.org
collection: $collection
format: $format
created_at: $(date -Iseconds)
sources:
EOF
while IFS= read -r identifier; do
# Get item metadata
local metadata=$(curl -s "https://archive.org/metadata/$identifier")
local title=$(echo "$metadata" | jq -r '.metadata.title')
local creator=$(echo "$metadata" | jq -r '.metadata.creator')
local date=$(echo "$metadata" | jq -r '.metadata.date')
cat >> "$output_file" <<EOF
- identifier: $identifier
title: "$title"
creator: "$creator"
date: "$date"
url: "https://archive.org/details/$identifier"
format: $format
EOF
done <<< "$items"
echo "Source plan generated: $output_file"
}
# Usage with /acquire
# generate_archive_source_plan "GratefulDead" ".curator/sources/gd-plan.yaml" "Flac"
# /acquire --plan .curator/sources/gd-plan.yaml
Troubleshooting
Common Issues and Solutions
| Issue | Cause | Solution |
|---|---|---|
| Download fails immediately | Invalid identifier | Verify identifier on archive.org website |
| Slow download speeds | Network throttling | Use --limit-rate with wget |
| Incomplete files | Connection timeout | Use wget -c to resume |
| Missing files | Format not available | Check available formats with metadata API |
| Checksum mismatch | Corrupted download | Delete and re-download |
| Rate limiting | Too many requests | Increase wait time between downloads |
Debug Mode
# Enable verbose output for troubleshooting
export CURATOR_DEBUG=1
download_item_debug() {
local identifier="$1"
local output_dir="$2"
if [[ "$CURATOR_DEBUG" == "1" ]]; then
set -x # Enable bash tracing
fi
# Get metadata with verbose curl
curl -v "https://archive.org/metadata/$identifier" 2>&1 | tee "$output_dir/.curator/metadata-debug.log"
# Download with verbose wget
wget --verbose \
--debug \
"https://archive.org/download/$identifier/" \
2>&1 | tee "$output_dir/.curator/download-debug.log"
set +x
}
Performance Optimization
Parallel Download Tuning
# Optimal concurrent downloads based on connection speed
determine_optimal_parallelism() {
local speed_mbps="$1"
if [[ $speed_mbps -lt 10 ]]; then
echo 2 # Slow connection
elif [[ $speed_mbps -lt 50 ]]; then
echo 3 # Medium connection
elif [[ $speed_mbps -lt 100 ]]; then
echo 5 # Fast connection
else
echo 8 # Very fast connection
fi
}
# Test connection speed
test_connection_speed() {
# Download test file and measure speed
local test_file="https://archive.org/download/test_item/test.mp3"
local start_time=$(date +%s)
wget -q -O /tmp/speed-test.tmp "$test_file"
local end_time=$(date +%s)
local duration=$((end_time - start_time))
local size=$(stat -c%s /tmp/speed-test.tmp 2>/dev/null || stat -f%z /tmp/speed-test.tmp)
local speed_mbps=$(( (size * 8) / (duration * 1000000) ))
rm /tmp/speed-test.tmp
echo "$speed_mbps"
}
Caching Strategy
# Cache metadata to reduce API calls
cache_metadata() {
local identifier="$1"
local cache_dir=".curator/cache/metadata"
local cache_file="$cache_dir/$identifier.json"
mkdir -p "$cache_dir"
# Check cache age
if [[ -f "$cache_file" ]]; then
local cache_age=$(( $(date +%s) - $(stat -c%Y "$cache_file" 2>/dev/null || stat -f%m "$cache_file") ))
# Cache valid for 7 days
if [[ $cache_age -lt 604800 ]]; then
cat "$cache_file"
return 0
fi
fi
# Fetch and cache
curl -s "https://archive.org/metadata/$identifier" | tee "$cache_file"
}
Summary
Archive Acquisition skill provides:
- Discovery: Search API patterns for finding content
- Quality Selection: Automatic best-format selection (prefer FLAC, accept MP3 320K)
- Bulk Download: Recursive wget and parallel download patterns
- Verification: Checksum validation and integrity checks
- Collection Management: Indexing, deduplication, incremental sync
- Integration: Source plan generation for
/acquirecommand - Performance: Parallel tuning, caching, rate limiting
Use this skill whenever working with Internet Archive content in the Media Curator framework.
References
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/human-authorization.md — Seek explicit authorization before bulk downloads and irreversible archive operations
- @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Evaluate archive item quality and metadata before committing to download
- @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/acquire/SKILL.md — General acquisition skill that delegates to archive-acquisition for archive.org sources
- @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md — Verify checksums after archive downloads
- @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/quality-filtering/SKILL.md — Quality scoring used to select best format from archive items