Running Llama.cpp Agents on AMD RX 9060 XT with Docker and gVisor
Hardware Setup
- 2x AMD RX 9060 XT GPUs
- 16GB VRAM per GPU
- Total 32GB VRAM available for model loading
Software Stack
llama.cpp
- Quantized model inference using GGUF format
- Multi-GPU support for AMD RDNA architecture
- CPU fallback for non-GPU workloads
Docker with gVisor
- gVisor provides user-space container runtime for enhanced security
- Isolated from host kernel for agent sandboxing
- Suitable for running untrusted or experimental AI agents
Docker Configuration
Base Image
FROM gcr.io/go-containerregistry/docker:27.3.1-dind
GPU Passthrough
--gpus all
Memory Limits
--memory="32g"
--memory-swap="32g"
Volume Mounts
-v /path/to/models:/models:ro
-v /path/to/agents:/agents
Model Loading
Supported Quantizations
- Q4_K_M (4-bit, good balance of speed/accuracy)
- Q5_K_M (5-bit, improved accuracy)
- Q8_0 (8-bit, near-precision)
Model Size Considerations
- 7B models: ~4-5GB at Q4_K_M
- 13B models: ~8-9GB at Q4_K_M
- Can load multiple smaller models simultaneously
Agent Architecture
Container Structure
agents/
├── agent1/
│ ├── Dockerfile
│ └── main.py
├── agent2/
│ ├── Dockerfile
│ └── main.py
└── shared/
└── models/
Communication
- Inter-agent communication via shared volumes
- Model sharing between agents to reduce memory footprint
- Centralized logging and monitoring
VRAM Management
- Monitor VRAM usage with
nvidia-smi (for AMD: radeontop)
- Implement model unloading for idle agents
- Use quantization to fit larger models
Inference Speed
- Batch processing for improved throughput
- Context window management
- Temperature and top-p tuning for response quality
Security Notes
- gVisor provides kernel isolation but not network isolation
- Additional network policies recommended
- Regular container image scanning
- Minimal base images reduce attack surface
Troubleshooting
Common Issues
- VRAM fragmentation: Use smaller quantizations
- Slow inference: Check CPU fallback is not being used
- Container crashes: Verify GPU driver compatibility
AMD GPU Specific
- Verify ROCm support in Docker image
- Check
dmesg for GPU errors
- Monitor with
rocm-smi
Future Improvements
- Implement model caching between agents
- Add persistent storage for conversation history
- Integrate with existing LLM orchestration tools