remote-gpu-trainer
Deploy, monitor, and debug long GPU jobs on RENTED/remote instances (AutoDL, RunPod, vast.ai, Lambda, Slurm, K8s): teardown/billing safety, spot resilience, resumable checkpointing, OOM/NaN triage.
Details
- Path
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer
- License
- MIT
- Bundled scripts
- 9
- Dependencies
- 9
Bundled scripts
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/evals/run_evals.py
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/mem_monitor.sh
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/download_loop.sh
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/gpu_health.sh
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/check_staleness.py
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/verify_local.py
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh
- plugins/antigravity-awesome-skills/skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh