Vicuna 13B Model on AMD GPU with ROCm: GPU Memory and Accelerated Performance Quiz and Flashcards

Study Notes

Running Vicuna 13B Chatbot Model on Single AMD GPU with ROCm

Vicuna is an open-source chatbot model with 13 billion parameters, achieving over 90% quality compared to OpenAI ChatGPT, developed by a team from UC Berkeley, CMU, Stanford, and UC San Diego.
Vicuna was created by fine-tuning a LLAMA base model using about 70K user-shared conversations collected from ShareGPT.com via public APIs.
Vicuna was released on Github on Apr 11, 2021, and the dataset, training code, evaluation metrics, and training cost are known.
A quantized GPT model is necessary to reduce the memory footprint of running Vicuna-13B model in fp16, which requires around 28GB GPU RAM.
The GPTQ paper proposed accurate post-training quantization for GPT models with lower bit precision, achieving comparable accuracy with fp16 for models with parameters larger than 10B.
Several 4-bit quantized Vicuna models are available from Hugging Face.
To run the Vicuna 13B model on an AMD GPU, ROCm (Radeon Open Compute), an open-source software platform that provides AMD GPU acceleration for deep learning and high-performance computing applications, must be leveraged.
System requirements for running the Vicuna 13B model on an AMD GPU with ROCm include Ubuntu 22.04, ROCm5.4.3, and Pytorch2.0.
The model can be quantized by either downloading the quantized Vicuna-13B model from Hugging Face or quantizing the floating-point model.
The 4-bit quantized Vicuna-13B model can be fitted in RX6900XT GPU DDR memory, which has 16GB DDR, and only 7.52GB of DDR is needed to run the model.
The latency penalty and accuracy penalty for using the 4-bit quantized model are minimal.
The Vicuna model can be exposed to the Web API server and tested for language translation or answering questions, with both fp16 and 4-bit quantized models providing accurate and fast results.