Tiny LLama: Compact LLM with WebAssembly

all

Tiny LLama is an ambitious initiative aimed at pretraining a language model on a dataset of 3 trillion tokens. What sets this project apart is not just the size of the data but the efficiency and speed of its processing. Utilizing 16 A100-40G GPUs, the training of Tiny LLama started in September and is planned to span just 90 days.

Before you will continue reading please watch short introduction:

The compactness of Tiny LLama is its standout feature. With only 1.1 billion parameters, it is uniquely tailored for scenarios where computational and memory resources are limited. This makes it an ideal solution for edge devices.

edge devices

For ease, I’ve prepared a Docker image containing all the necessary tools, including CUDA, mlc-llm, and Emscripten, which are crucial for preparing the model for WebAssembly.

Dockerfile:

 FROM alpine/git:2.36.2 as download

RUN git clone https://github.com/mlc-ai/mlc-llm.git --recursive /mlc-llm

FROM nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04

RUN apt update && \
    apt install -yq curl git cmake ack tmux \
        python3-dev vim python3-venv python3-pip \
        protobuf-compiler build-essential


RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu122 mlc-ai-nightly-cu122

RUN apt install gcc
COPY --from=download /mlc-llm /opt/mlc-llm

RUN cd /opt/mlc-llm && pip3 install .

RUN apt-get install git-lfs -yq

ENV TVM_HOME="/opt/venv/lib/python3.10/site-packages/tvm/"

RUN git clone https://github.com/emscripten-core/emsdk.git /opt/emsdk
RUN cd /opt/emsdk && ./emsdk install latest

ENV PATH="/opt/emsdk:/opt/emsdk/upstream/emscripten:/opt/emsdk/node/16.20.0_64bit/bin:/opt/venv/bin:$PATH"
RUN cd /opt/emsdk/ && ./emsdk activate latest
ENV TVM_HOME=/opt/mlc-llm/3rdparty/tvm

RUN cd /opt/mlc-llm/3rdparty/tvm \
  && git checkout 5828f1e9e \
  && git submodule init \
  && git submodule update --recursive \
  && make webclean \
  && make web


RUN python3 -m pip install auto_gptq>=0.2.0 transformers

CMD /bin/bash

To build docker image we need to run:

docker build -t onceuponai/mlc-llm .

Now we are ready to run container:

docker run --rm -it --name mlc-llm -v $(pwd)/data:/data --gpus all onceuponai/mlc-llm

and execute mlc-llm command:

python3 -m mlc_llm.build --hf-path TinyLlama/TinyLlama-1.1B-Chat-v0.6  --target webgpu --quantization q4f32_0 --use-safetensors

where (Documentation): hf-path - is huggingface model name in this case TinyLlama/TinyLlama-1.1B-Chat-v0.6 target - is platfrom for which we prepare the model available options:

quantization - is quantization mode: available options: quantization: qAfB(_0) A - number of bits for weights B - number of bits for activations available options: autogptq_llama_q4f16_0, autogptq_llama_q4f16_1, q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f16_2, q4f16_ft, q4f32_0, q4f32_1 q8f16_ft, q8f16_1

In our case we will use webgpu target and q4f32_0 quantization to obtaind wasm file and converted model. I have shared several converted models on HuggingFace and Github.

Model can be simply used on web application.

Example typescript code is available here: https://github.com/onceuponai-dev/stories-thumbellama

You can also quickly test model on: https://stories.onceuponai.dev/stories-thumbellama/