Book SFT Pipeline
Convert long-form books into supervised fine-tuning data for literary style transfer. Keep the entrypoint lightweight: use this file to route the work, then open only the references needed for the current phase.
Activate When
- Building fine-tuning datasets from literary works
- Creating author-voice or style-transfer models
- Preparing training data for Tinker or similar SFT platforms
- Designing text segmentation pipelines for long-form content
- Training small models on limited literary data
Fast Path
- Confirm the source is suitable: prefer ePub over PDF, remove front/back matter, and preserve paragraph breaks.
- Segment into coherent 150-400 word chunks. Never break mid-sentence.
- Generate synthetic scene descriptions without quoting the source text.
- Build JSONL conversation examples with varied system prompts and user templates.
- Train a LoRA on a base model, not an instruction-tuned model.
- Validate on modern scenarios and grep training data for suspicious phrases.
Default Parameters
| Setting | Default |
|---|---|
| Chunk size | 150-400 words |
| Prompt diversity | 15+ templates, 5+ system prompts |
| Variants | 2 per chunk |
| Model | Qwen/Qwen3-8B-Base or another base 8B-class model |
| LoRA rank | 32 |
| Epochs | 3 |
| Test set | 50 examples minimum |
Core Rules
- Source ePub before PDF because OCR noise becomes learned behavior.
- Keep chunks semantically complete and paragraph-bounded where possible.
- Teach style, not plot: instructions should describe scenes without quoting.
- Rotate prompt and system templates to reduce memorization.
- Use base models for malleable style transfer.
- Validate originality before claiming the model learned the author's voice.
Progressive Disclosure
Open these only when the task reaches that layer:
- Pipeline Workflow - full phase-by-phase workflow with extraction, segmentation, instruction generation, dataset construction, training, validation commands, costs, and troubleshooting.
- Segmentation Strategies - advanced paragraph, scene, dialogue, and LLM-assisted chunking patterns.
- Tinker Format Specification - Datum, renderer, token-weight, JSONL, and training loop details.
- Tinker API Documentation - full API reference.
- Gertrude Stein Case Study - complete working example with outputs and configuration.
Implementation Starter
Use the sample script when the user wants executable scaffolding:
python plugins/khuym/skills/book-sft-pipeline/scripts/pipeline_example.py
The script demonstrates the same pipeline semantics as this skill: segmentation, diverse prompt construction, Tinker datum construction, and originality checks.
Validation Checklist
- Chunks end at natural grammatical boundaries.
- JSONL rows contain system, user, and assistant messages.
- Prompt variants are distributed across chunks.
- Held-out test examples are excluded from training.
- Modern scenario outputs contain style markers without original plot content.
- Exact output phrases do not appear in the training JSONL.
References
Internal references:
External resources:
Skill Metadata
Created: 2025-12-26 Last Updated: 2025-12-28 Author: Muratcan Koylan Version: 2.0.0 Standalone: Yes