Petals: LLMs of the people, by the people, for the people!

LLM

(Image from https://petals.ml/.)

I’ve been recently posting about the democratization of Large Language Models (LLMs) and it seems that there’s been, in parallel, an incident that brought it in the center of public AI discourse: an allegedly leaked Google document titled “We have no moat”. I’m still thinking about its content (you may find a link to the article and a discussion about it in LessWrong, a forum originally founded by the AI researcher Eliezer Yudkowsky), but I’m surprised to see it has missed one of the most interesting AI democratization initiatives that I’ve come across lately. It goes by the name of “Petals”.

There are a lot of summaries about what A. Karpathy calls the “cambrian explosion” of the LLM ecosystem in the past few months, regarding both closed and open-source LLM development (and my favourite is Sebastian Raschka’s Ahead of AI #8). We’re all aware of OpenAI’s ChatGPT (powered by closed-source GPT-3.5-turbo as well as the crown jewel GPT-4), for which have incomplete or, at best, partial knowledge of the dataset, methodology, code, or actual values of the parameters to replicate the model. Meta’s LLaMa model was briefly the best attempt at a reasonably-sized open-source LLM, and a lot of models have built on top of it to adapt it for conversational and instructional purposes. The main issue with creating applications based on LLaMa is that the license to use the weights is not really permissive, and they can’t be used to power commercial applications.

Petals aims to simplify access to the latest research in deep learning and broaden access to LLMs. It is brought together by BigScience, who describe themselves as follows:

BigScience is not a consortium nor an officially incorporated entity. It’s an open collaboration boot-strapped by HuggingFace, GENCI and IDRIS, and organised as a research workshop. This research workshop gathers academic, industrial and independent researchers from many affiliations and whose research interests span many fields of research across AI, NLP, social sciences, legal, ethics and public policy.

Brief History of BigScience, BLOOM and Petals

(Image from https://huggingface.co/bigscience/bloom.)

The BigScience workshop has developed and made available a number of LLMs, with its biggest non fine-tuned model BLOOM reaching 176 billion parameters. For perspective, it is reportedly estimated that OpenAI’s GPT-4 contains around 0.5 to 1 trillion tunable weights. BLOOM’s most exciting features are that its training focused heavily on containing non-English text, and in containing code examples, as seen in the pie chart with training data below.

(Image from https://huggingface.co/bigscience/bloom.)

Afterwards, BigScience contributed BLOOM-Z to the open-source community, an LLM building on BLOOM but that was fine-tuned on multi-task instructions in English as well as machine-translated from English, that can carry over the desired behaviour when prompted with tasks in other languages as well. Specifically, the BLOOM-Z paper notes:

We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. […] We investigate whether English-only multitask finetuning also improves performance on non-English held-out tasks using the multilingual BLOOM […] We find that after finetuning on the English-only multitask mixture […] performance on a diverse set of non-English held-out tasks increases.

Next, the team behind BigScience takes ease-of-access to the next level by developing the Petals framework. The paradigm shift that Petals introduces is bypassing the hardware requirements to run large language models (such as BLOOM) while delegating computations to a collaborative online cluster. This decentralized cluster consists of the interconnected components (GPUs) of volunteers, each offering to perform part of the computation necessary for any user’s output from the LLM.

Think of it as an orchestra where each musician (volunteer) plays their instrument (GPU), contributing to the harmonious melody (LLM output) that is effortlessly enjoyed by the audience (users) without needing to own or maintain the instruments themselves.

(Image from https://github.com/bigscience-workshop/petals.)

Petals’ main features

Petals supports inference and fine-tuning of its supported LLMs (currently BLOOM and it’s fine-tuned BLOOM-Z counterpart, but aim to support more in the future). In practice, the client (user) specifies the model to use, loads a slice of its entire architecture (a part with inexpensive computations) and sends packets to connected servers (the swarm), in “BitTorrent style”.

For example, when setting up Petals with the BLOOM-Z model, only a few layers of the Transformer architecture are loaded on your machine, as seen inspecting with model.transformer:

DistributedBloomModel(
  (word_embeddings): Embedding(250880, 14336)
  (word_embeddings_layernorm): LayerNorm((14336,), eps=1e-05, elementwise_affine=True)
  (h): RemoteSequential(modules=bigscience/bloom-petals.0..bigscience/bloom-petals.69)
 (ln_f): LayerNorm((14336,), eps=1e-05, elementwise_affine=True)
)

The project reports inference at a rate of approx. 1 token per second, when a single batch is broadcasted from the client, facilitating the creation of chatbots or related applications. You can even use the built-in model.inference_session() context manager as a means of easily implementing your own customization of the model’s output.

A set of inference and fine-tuning benchmarks regarding the BLOOM-176B model have been published. The benchmarks describe 3 setup configurations, with various network performances and resource availability. They also compare Petals performance with two confurations of parameter offloading (see later below), an alternative method of performing inference on machines with limited resources where large models are stored in both the CPU RAM, the GPU vRAM or possibly even storage units, and computations are sent back and forth. Offloading is reported to be an order of magnitude slower than using Petals as regards inference, and competitive only as regards fine-tuning with many GPUs.

Network	Single-batch inference (steps/s)		Parallel forward (tokens/s)
Network	Sequence length		Batch size
Bandwidth, RTT	128	2048	1	64
$\rm P{\small ETALS}$ on 3 physical servers, with one A100 each
1 Gbit/s, $\lt5$ ms	1.71	1.54	70.0	253.6
100 Mbit/s, $\lt5$ ms	1.66	1.49	56.4	182.0
100 Mbit/s, 100 ms	1.23	1.11	19.7	112.2
$\rm P{\small ETALS}$ on 12 virtual servers
1 Gbit/s, $\lt5$ ms	1.24	1.06	37.9	180.0
100 Mbit/s, $\lt5$ ms	1.24	1.05	25.6	66.6
100 Mbit/s, 100 ms	0.57	0.53	5.8	44.3
$\rm P{\small ETALS}$ on 14 real servers in Europe and North America
Real world	0.83	0.79	32.6	179.4
Offloading, max. speed on 1x A100
256 Gbit/s	0.18	0.18	2.7	170.3
128 Gbit/s	0.09	0.09	2.4	152.8
Offloading, max. speed on 3x A100
256 Gbit/s	0.09	0.09	5.1	325.1
128 Gbit/s	0.05	0.05	3.5	226.3

(Table from the Petals paper.)

Fine-tuning can also be performed via parameter-efficient methods, such as LoRA (discussed later below), which has become the go-to standard. The added trainable layers’ weights and the optimizer are stored locally, while the main model’s forward propagation happens remotely in the swarm.

Finally, as a true open-source decentralized effort, a project like Petals relies on the goodwill of volunteers that provide their computational resources to the server swarm. They provide an easy and parametrizable way for users to share their GPUs with the collaborative resource pool, thus increasing the capability of the network. For example, users can share many GPUs or even share only a portion of a single GPU by specifying how many Transformer blocks to host. The team is also considering introducing other incentives (called “bloom points”) to attract volunteers, potentially treating them with higher priority or security or other rewards.

Privacy

Petals developers have taken note of the privacy of user’s data. To address this, a private server swarm can be created among trusted parties, and the distribution of input for processing will be performed in seclusion from the outside world.

Your data will be processed by other people in the public swarm. Learn more about privacy here. For sensitive data, you can set up a private swarm among people you trust.

Alternatives and LLM pre-training

The open-source community has performed great strides in refining its approaches in order to allow LLMs to be run on consumer-level hardware, and in some cases allowing fine-tuning. See below for some links regarding the most widely used approaches:

Inference: HuggingFace pipelines with offloading, 8-bit Quantization
Fine-tuning: Low-Rank Adaptation (LoRA), DreamBooth

However, unless one has access to a research facility, the pre-training step, i.e. the most time and resource consuming part of developing of modern LLMs from scratch, is accessible mostly via renting GPU pods from third party providers online, and at a considerable price.

Petals developers note that decentralization of the LLM training process from scratch is also on their radar, and they have developed hivemind, a PyTorch library for deep learning over the internet, that is built using Petals but that has so far only been used for smaller models.

Petals brings hope of fine-tuning of very large models to the largest audience possible. I am very much excited and hopeful to see whether this approach gains more ground in the future!

Try Petals!

You can try Petals yourself with this Google Colab notebook:

Getting started with Petals

try the example conversational application that forwards HTTP requests to the swarm:

Petals Chatbot

or, if you’re feeling like contributing your own compute resources over the Internet, read the docs and join the cluster via these links: