11 May 2026

Running Llama.cpp Agents on AMD RX 9060 XT with Docker and gVisor

Hardware Setup

2x AMD RX 9060 XT GPUs
16GB VRAM per GPU
Total 32GB VRAM available for model loading

Software Stack

llama.cpp

Quantized model inference using GGUF format
Multi-GPU support for AMD RDNA architecture
CPU fallback for non-GPU workloads

Docker with gVisor

gVisor provides user-space container runtime for enhanced security
Isolated from host kernel for agent sandboxing
Suitable for running untrusted or experimental AI agents

Docker Configuration

Base Image

FROM gcr.io/go-containerregistry/docker:27.3.1-dind

GPU Passthrough

--gpus all

Memory Limits

--memory="32g"
--memory-swap="32g"

Volume Mounts

-v /path/to/models:/models:ro
-v /path/to/agents:/agents

Model Loading

Supported Quantizations

Q4_K_M (4-bit, good balance of speed/accuracy)
Q5_K_M (5-bit, improved accuracy)
Q8_0 (8-bit, near-precision)

Model Size Considerations

7B models: ~4-5GB at Q4_K_M
13B models: ~8-9GB at Q4_K_M
Can load multiple smaller models simultaneously

Agent Architecture

Container Structure

agents/
├── agent1/
│   ├── Dockerfile
│   └── main.py
├── agent2/
│   ├── Dockerfile
│   └── main.py
└── shared/
    └── models/

Communication

Inter-agent communication via shared volumes
Model sharing between agents to reduce memory footprint
Centralized logging and monitoring

Performance Considerations

VRAM Management

Monitor VRAM usage with nvidia-smi (for AMD: radeontop)
Implement model unloading for idle agents
Use quantization to fit larger models

Inference Speed

Batch processing for improved throughput
Context window management
Temperature and top-p tuning for response quality

Security Notes

gVisor provides kernel isolation but not network isolation
Additional network policies recommended
Regular container image scanning
Minimal base images reduce attack surface

Troubleshooting

Common Issues

VRAM fragmentation: Use smaller quantizations
Slow inference: Check CPU fallback is not being used
Container crashes: Verify GPU driver compatibility

AMD GPU Specific

Verify ROCm support in Docker image
Check dmesg for GPU errors
Monitor with rocm-smi

Future Improvements

Implement model caching between agents
Add persistent storage for conversation history
Integrate with existing LLM orchestration tools