Skip to content

remote-gpu-trainer

SkillMITby sickn33

Deploy, monitor, and debug long GPU jobs on RENTED/remote instances (AutoDL, RunPod, vast.ai, Lambda, Slurm, K8s): teardown/billing safety, spot resilience, resumable checkpointing, OOM/NaN triage.

Repository Source folder

Details

Path
skills/remote-gpu-trainer
License
MIT
Bundled scripts
9
Dependencies
9

Bundled scripts

  • skills/remote-gpu-trainer/evals/run_evals.py
  • skills/remote-gpu-trainer/scripts/mem_monitor.sh
  • skills/remote-gpu-trainer/scripts/download_loop.sh
  • skills/remote-gpu-trainer/scripts/reap_vram_zombies.sh
  • skills/remote-gpu-trainer/scripts/gpu_health.sh
  • skills/remote-gpu-trainer/scripts/check_staleness.py
  • skills/remote-gpu-trainer/scripts/verify_local.py
  • skills/remote-gpu-trainer/scripts/setup-china-mirrors.sh
  • skills/remote-gpu-trainer/scripts/aggregate_to_fs.sh

FAQ