Unleash the Power of AI on Your Laptop with GPT-4All

all

The world of artificial intelligence (AI) has seen significant advancements in recent years, with OpenAI’s GPT-4 being one of the most groundbreaking language models to date. However, harnessing the full potential of GPT-4 often requires high-end GPUs and expensive hardware, making it inaccessible for many users. That’s where GPT-4All comes into play! In this comprehensive guide, we’ll introduce you to GPT-4All, an optimized AI model that runs smoothly on your laptop using just your CPU.

Before you will continue reading please watch short introduction:

GPT-4All was trained on a massive, curated corpus of assistant interactions, covering a diverse range of tasks and scenarios. This includes word problems, story descriptions, multi-turn dialogues, and even code.

The authors have shared data and the code used to traind the model https://github.com/nomic-ai/gpt4all they have also prepared the technical report which describes all details.

At the first stage the authors collected one million prompt-response pairs using the GPT OpenAI API. Then they have cleaned and curated the data using Atlas project.

training

Finally the released model was trained using Low-Rank Adaptation approach which reduce the number of trainable parameters and required resources.

The authors have shared awesome library which automatially downloads the model and expose simple python API and additionally expose console interface.

To simplify interactions I have added simple Web UI interface. https://github.com/qooba/gpt4all-ui

To install it you have to clone the repository. Install requirements and you are ready to run the app (open: http://localhost:8000/) and prompt

git clone https://github.com/qooba/gpt4all-ui.git
cd gpt4all-ui
pip install -r requiremnets.txt

uvicorn main:app --reload

ui interface

Now you are ready to run GPT4All on your everyday laptop without requiring expensive hardware or high-end GPUs and prompt it in the browser.

Discover a Delicious Way to Use Delta Lake! Yummy Delta - #1 Introduction

yummy delta

Delta lake is an open source storage framework for building lake house architectures on top of data lakes.

Additionally it brings reliability to data lakes with features like: ACID transactions, scalable metadata handling, schema enforcement, time travel and many more.

Before you will continue reading please watch short introduction:

Delta lake can be used with compute engines like Spark, Flink, Presto, Trino and Hive. It also has API for Scala, Java, Rust , Ruby and Python.

delta lake

To simplify integrations with delta lake I have built a REST API layer called Yummy Delta.

Which abstracts multiple delta lake tables providing operations like: creating new delta table, writing and querying, but also optimizing and vacuuming. I have coded an overall solution in rust based on the delta-rs project.

Delta lake keeps the data in parquet files which is an open source, column-oriented data file format.

Additionally it writes the metadata in the transaction log, json files containing information about all performed operations.

The transaction log is stored in the delta lake _delta_log subdirectory.

delta lake

For example, every data write will create a new parquet file. After data write is done a new transaction log file will be created which finishes the transaction. Update and delete operations will be conducted in a similar way. On the other hand when we read data from delta lake at the first stage transaction files are read and then according to the transaction data appropriate parquet files are loaded.

Thanks to this mechanism the delta lake guarantees ACID transactions.

There are several delta lake integrations and one of them is delta-rs rust library.

Currently in delta-rs implementation we can use multiple storage backends including: Local filesystem, AWS S3, Azure Blob Storage and Azure Deltalake Storage Gen 2 and also Google Cloud Storage.

To be able to manage multiple delta tables on multiple stores I have built Yummy delta server which expose Rest API.

delta lake

Using API we can: list and create delta tables, inspect delta tables schema, append or override data in delta tables and additional operations like optimize or vacuum.

You can find API reference here: https://www.yummyml.com/delta

Moreover we can query data using Data Fusion sql-s. Query results will be returned as a stream thus we can process it in batches.

You can simply install Yummy delta as a python package:

pip3 install yummy[delta]

Then we need to prepare config file:

stores:
  - name: local
    path: "/tmp/delta-test-1/"
  - name: az
    path: "az://delta-test-1/"

And you are ready run server using command line:

yummy delta server -h 0.0.0.0 -p 8080 -f config.yaml

Now we are able to perform all operations using the REST API.

Improve the performance of MLflow models with Rust

motorboat

Realtime models deployment is a stage where performance is critical. In this article I will show how to speedup MLflow models serving and decrease resource consumption.

Additionally benchmark results will be presented.

Before you will continue reading please watch short introduction:

The Mlflow is opensource platform which covers end to end machine learning lifecycle

Including: Tracking experiments, Organizing code into reusable projects, Models versioning and finally models deployment.

mlops circle

With Mlflow we can easily serve versioned models.

Moreover it supports multiple ML frameworks and abstracts them with consistent Rest API.

Thanks to this we can experiment with multiple models flavors without affecting existing integration.

mlflow serving

Mlflow is written in python and uses python to serve real-time models. This simplifies the integration with ML frameworks which expose python API.

On the other hand real-time models serving is a stage where prediction latency and resource consumption is critical.

Additionally serving robustness is required even for higher load.

To check how the rust implementation will perform I have implemented the ML models server which can read Mlflow models and expose the same Rest API.

For test purposes I have implemented integration with LightGBM and Catboost models flavors. Where I have used Rust bindings to the native libraries.

architecture

I have used Vegeta attack to perform load tests and measure p99 response time for a different number of requests per seconds. Additionally I have measured the CPU and memory usage of the model serving container. All tests have been performed on my local machine.

The performance tests show that rust implementation is very promising.

benchmark results

For all models even for 1000 requests per second the response time is low. CPU usage increases linearly as traffic increases. And memory usage is constant.

On the other hand Mlflow serving python implementation performs much worse and for higher traffic the response times are higher than 5 seconds which exceeds timeout value. CPU usage quickly consumes available machine resources. The memory usage is stable for all cases.

The Rust implementation is wrapped with the python api and available in yummy. Thus you can simply install and run it through the command line or using python code.

pip install yummy-mlflow
import yummy_mlflow

# yummy_mlflow.model_serve(MODEL_PATH, HOST, POST, LOG_LEVEL)

yummy_mlflow.model_serve(model_path, '0.0.0.0', 8080, 'error')

Example requests:

curl -X POST "http://localhost:8080/invocations" \
-H "Content-Type: application/json" \
-d '{
    "columns": ["0","1","2","3","4","5","6","7","8","9","10",
               "11","12"],
    "data": [
     [ 0.913333, -0.598156, -0.425909, -0.929365,  1.281985,
       0.488531,  0.874184, -1.223610,  0.050988,  0.342557,
      -0.164303,  0.830961,  0.997086,
    ]]
}'

Example response:

[[0.9849612333276241, 0.008531186707393178, 0.006507579964982725]]

The whole implementation and benchmark code is available on Github. Currently LightGBM and Catboost local models are supported.

The Yummy mlflow models server usage description is available on: https://www.yummyml.com/yummy-mlflow-models-serving

Speedup features serving with Rust - Yummy serve

slime

In this article I will introduce Yummy feature server implemented in Rust. The feature server is fully compatible with Feast implementation. Additionally benchmark results will be presented.

Before you will continue reading please watch short introduction:

Another step during MLOps process creation is features serving.

A historical feature store is used during model training to fetch a large range of entities and a large dataset with small numbers of queries. For this process the data fetch latency is important but not critical.

On the other hand when we serve the model features, fetching latency is crucial and determines prediction time.

feature store

That’s why we use very fast online stores like Redis or DynamoDb.

stores

The question which appears at this point is shall we call online store directly or use feature server ?

Because multiple models can reuse already prepared features we don’t want to add feature store dependencies to the models. Thus we abstract an online store with a feature server which serves features using for example REST api.

architecture

On the other hand latency due to additional layer should be minimized.

Using Feast, we can manage features lifecycle and we can serve features using built-in features server implemented in: python, java or go.

assumptions

According to the provided benchmark Feast feature server is very fast. But can we go faster with the smaller number of computing resources ?

To answer this question I have implemented feature server using Rust which is known for its speed and safety.

One of the basic assumptions was to ensure full compatibility with Feast and usage simplicity.

I have also decided to start implementation with Redis as an online store.

The whole benchmark code is available on github.

To reproduce benchmark we will clone the repository:

git clone https://github.com/yummyml/feature-servers-benchmarks.git
cd feature-servers-benchmarks

For simplicity I will use docker. Thus in the first step we will prepare all required images: Feast and Yummy feature server, Vegeta attack load generator and Redis.

./build.sh

Then I will use data generator to prepare dataset apply feature store and materialize it to Redis.

./materialize.sh

Now we are ready to start the benchmark.

In contrast to the Feast benchmark where they used sixteen feature store server instances I will perform it with a single instance to simulate behavior on the smaller number of resources.

The whole benchmark contains multiple scenarios like changing number of entities, number of features or increasing number of requests per second.

# single_run <entities> <features> <concurrency> <rps>

echo "Change only number of rows"

single_run 1 50 $CONCURRENCY 10

for i in $(seq 10 10 100); do single_run $i 50 $CONCURRENCY 10; done


echo "Change only number of features"

for i in $(seq 50 50 250); do single_run 1 $i $CONCURRENCY 10; done


echo "Change only number of requests"

for i in $(seq 10 10 100); do single_run 1 50 $CONCURRENCY $i; done

for i in $(seq 100 100 1000); do single_run 1 50 $CONCURRENCY $i; done

for i in $(seq 10 10 100); do single_run 1 250 $CONCURRENCY $i; done

for i in $(seq 10 10 100); do single_run 100 50 $CONCURRENCY $i; done

for i in $(seq 10 10 100); do single_run 100 250 $CONCURRENCY $i; done

All results are available on GitHub but here I will limit it to p99 response time analysis for different numbers of requests.

All results were performed on my local machine with 6 cpu cores 2.59 GHz and 32 GB of memory.

During these tests I will fetch a single entity with fifty features using feature service.

To run Rust feature server benchmark we will run:

./run_test_yummy.sh

For Rust implementation p99 response times are stable and less than 4 ms going from 10 requests per seconds to 100 requests per second.

yummy benchmark results

For Feast following documentation I have set go_feature_retrieval to True in feature_store.yaml

registry: registry.db
project: feature_repo
provider: local
online_store:
  type: redis
  connection_string: redis:6379
offline_store:
  type: file
go_feature_retrieval: True
entity_key_serialization_version: 2

Additionally go option in feast serve command line.

feast serve --host "0.0.0.0" --port 6566 --no-access-log --no-feature-log --go

Thus I assume that go implementation of the feature server will be used. In this part I have used the official feastdev/feature-server:0.26.0 Feast docker image.

Again I will fetch a single entity with fifty features using feature service. For 10 requests per second the p99 response time is 92 ms.

Unfortunately for 20 requests per seconds and above the p99 response time is above 5s which exceeds our timeout value.

feast benchmark results

Additionally during Feast benchmark run I have noticed increasing memory allocation which can be caused by the memory leak.

This benchmark indicates that rust implementation is very promising because response times are small and stable, additionally the resources consumption is low.

The Yummy feature server usage description is available on: https://www.yummyml.com/feature-server

Graph Embeddings with Feature Store

embeddings

In this video I will show how to generate and use graph embeddings with feature store.

Before you will continue reading please watch short introduction:

Graphs are structures, which contain sets of entity nodes and edges, which represent the interaction between them. Such data structures, can be used in many areas like social networks, web data, or even molecular biology, for modeling real-life interactions.

To use properties contained in the graphs, in the machine learning algorithms, we need to map them, to more accessible representations, called embeddings.

embeddings

In contrast to the graphs, the embeddings are structures, representing the nodes features, and can be easily used, as an input of the machine learning algorithms.

Because graphs are frequently represented by the large datasets, embeddings calculation can be challenging. To solve this problem, I will use a very efficient open source project, Cleora which is entirely written in rust.

theory

Let’s follow the Cleora algorithm. In the first step we need to determine the number of features which will determine the embedding dimensionality. Then we initialize the embeddings matrix. In the next step based on the input data we calculate the random walk transition matrix. The matrix describes the relations between nodes and is defined as a ratio of number of edges running from first to second node, and the degree of the first node. The training phase is iterative multiplication of the embeddings matrix and the transition matrix followed by L2 normalization of the embeddings rows.

Finally we get embedding matrix for the defined number of iterations.

theory

Moreover, to be able to simply build a solution, I have extended the project, with possibility of reading and writing to S3 store, and Apache Parquet format usage, which significantly reduce embedding size.

theory

Additionally, I have wrapped the rust code, with the python bindings, thus we can simply install it and use it as a python package.

Based on the Cleora example, I will use the Facebook dataset from SNAP, to calculate embeddings from page to page graph, and train a machine learning model, which classifies page category.

curl -LO https://snap.stanford.edu/data/facebook_large.zip
unzip facebook_large.zip

As a s3 store we will use minio storage:

docker run --rm -it -p 9000:9000 \
 -p 9001:9001 --name minio \
 -v $(pwd)/minio-data:/data \
 --network app_default \
 minio/minio server /data --console-address ":9001"
import os 
import boto3
from botocore.client import Config

os.environ["AWS_ACCESS_KEY_ID"]= "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"]= "minioadmin"
os.environ["FEAST_S3_ENDPOINT_URL"]="http://minio:9000"
os.environ["S3_ENDPOINT_URL"]= "http://minio:9000"

s3 = boto3.resource('s3', endpoint_url='http://minio:9000')
s3.create_bucket(Bucket="input")
s3.create_bucket(Bucket="output")
s3.create_bucket(Bucket="data")

In the first step, we need to prepare the input file, in the appropriate click, or star expansion format.

# based on: https://github.com/Synerise/cleora/blob/master/example_classification.ipynb
import pandas as pd
import s3fs
import numpy as np
import random
from sklearn.model_selection import train_test_split
random.seed(0)
np.random.seed(0)

df_cleora = pd.read_csv("./facebook_large/musae_facebook_edges.csv")
train_cleora, test_cleora = train_test_split(df_cleora, test_size=0.2)

fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"

fs = s3fs.S3FileSystem(client_kwargs={'endpoint_url': "http://minio:9000"})

with fs.open(fb_cleora_input_clique_filename, "w") as f_cleora_clique, fs.open(fb_cleora_input_star_filename, "w") as f_cleora_star:
    grouped_train = train_cleora.groupby('id_1')
    for n, (name, group) in enumerate(grouped_train):
        group_list = group['id_2'].tolist()
        group_elems = list(map(str, group_list))
        f_cleora_clique.write("{} {}\n".format(name, ' '.join(group_elems)))
        f_cleora_star.write("{}\t{}\n".format(n, name))
        for elem in group_elems:
            f_cleora_star.write("{}\t{}\n".format(n, elem))

Then, we use Cleora python bindings, to calculate embeddings, and write them as a parquet file in the s3 minio store.

Cleora star expansion training:

import time
import cleora
output_dir = 's3://output'
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"

start_time = time.time()
cleora.run(
    input=[fb_cleora_input_star_filename],
    type_name="tsv",
    dimension=1024,
    max_iter=5,
    seed=None,
    prepend_field=False,
    log_every=1000,
    in_memory_embedding_calculation=True,
    cols_str="transient::cluster_id StarNode",
    output_dir=output_dir,
    output_format="parquet",
    relation_name="emb",
    chunk_size=3000,
)
print("--- %s seconds ---" % (time.time() - start_time))

Cleora clique expansion training

fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
start_time = time.time()

cleora.run(
    input=[fb_cleora_input_clique_filename],
    type_name="tsv",
    dimension=1024,
    max_iter=5,
    seed=None,
    prepend_field=False,
    log_every=1000,
    in_memory_embedding_calculation=True,
    cols_str="complex::reflexive::CliqueNode",
    output_dir=output_dir,
    output_format="parquet",
    relation_name="emb",
    chunk_size=3000,
)
print("--- %s seconds ---" % (time.time() - start_time))

For each node, I have added an additional column datetime which represents timestamp, and will help to check how calculated embeddings, will change over time. Additionaly every embeddings recalculation will be saved as a separate parquet file eg. emb__CliqueNode__CliqueNode_20220910T204145.parquet. Thus we will be able to follow embeddings history.

Now, we are ready to consume the calculated embeddings, with Feast feature store, and Yummy extension.

feature_store.yaml

project: repo
registry: s3://data/registry.db
provider: yummy.YummyProvider
backend: polars
online_store:
    type: sqlite
    path: data/online_store.db
offline_store:
    type: yummy.YummyOfflineStore

features.py

from datetime import timedelta
from feast import Entity, Field, FeatureView
from yummy import ParquetSource
from feast.types import Float32, Int32

my_stats_parquet = ParquetSource(
    name="my_stats",
    path="s3://output/emb__CliqueNode__CliqueNode_*.parquet",
    timestamp_field="datetime",
    s3_endpoint_override="http://minio:9000",
)

my_entity = Entity(name="entity", description="entity",)

schema = [Field(name="entity", dtype=Int32)] + [Field(name=f"f{i}", dtype=Float32) for i in range(0,1024)]

mystats_view_parquet = FeatureView(
    name="my_statistics_parquet",
    entities=[my_entity],
    ttl=timedelta(seconds=3600*24*20),
    schema=schema,
    online=True, source=my_stats_parquet, tags={},)

Then we apply feature store definition:

feast apply

Now we are ready to fetch ebeddings for defined timestamp.

from feast import FeatureStore
import polars as pl
import pandas as pd
import time
import os
from datetime import datetime
import yummy

store = FeatureStore(repo_path=".")
start_time = time.time()

features = [f"my_statistics_parquet:f{i}" for i in range(0,1024)]

training_df = store.get_historical_features(
    entity_df=yummy.select_all(datetime(2022, 9, 14, 23, 59, 42)),
    features = features,
).to_df()

print("--- %s seconds ---" % (time.time() - start_time))
training_df

Moreover I have introduced method:

yummy.select_all(datetime(2022, 9, 14, 23, 59, 42))

which will fetch all entities.

Then we prepare training data for data for the SNAP dataset:

import numpy as np
from sklearn.model_selection import train_test_split
df = pd.read_csv("../facebook_large/musae_facebook_target.csv")

classes = df['page_type'].unique()
class_ids = list(range(0, len(classes)))
class_dict = {k:v for k,v in zip(classes, class_ids)}
df['page_type'] = [class_dict[item] for item in df['page_type']]

train_filename = "fb_classification_train.txt"
test_filename = "fb_classification_test.txt"

train, test = train_test_split(df, test_size=0.2)

training_df=training_df.astype({"entity": "int32"})

entities = training_df["entity"].to_numpy()

train = train[["id","page_type"]].to_numpy()
test = test[["id","page_type"]].to_numpy()

df_embeddings=training_df.drop(columns=["event_timestamp"])\
    .rename(columns={ f"f{i}":i+2 for i in range(1024) })\
    .rename(columns={"entity": 0}).set_index(0)

valid_idx = df_embeddings.index.to_numpy()
train = np.array(train[np.isin(train[:,0], valid_idx) & np.isin(train[:,1], valid_idx)])
test = np.array([t for t in test if (t[0] in valid_idx) and (t[1] in valid_idx)])

Finally, we will train page classifiers.

from sklearn.linear_model import SGDClassifier
from sklearn.metrics import f1_score
from tqdm import tqdm
epochs=[20]
batch_size = 256
test_batch_size = 1000
embeddings=df_embeddings
y_train = train[:, 1]
y_test = test[:, 1]

clf = SGDClassifier(random_state=0, loss='log_loss', alpha=0.0001)
for e in tqdm(range(0, max(epochs))):
    for idx in range(0,train.shape[0],batch_size):
        ex=train[idx:min(idx+batch_size,train.shape[0]),:]
        ex_emb_in = embeddings.loc[ex[:,0]].to_numpy()
        ex_y = y_train[idx:min(idx+batch_size,train.shape[0])]
        clf.partial_fit(ex_emb_in, ex_y, classes=[0,1,2,3])
    
    if e+1 in epochs:
        acc = 0.0
        y_pred = []
        for n, idx in enumerate(range(0,test.shape[0],test_batch_size)):
            ex=test[idx:min(idx+test_batch_size,train.shape[0]),:]
            ex_emb_in = embeddings.loc[ex[:,0]].to_numpy()
            pred = clf.predict_proba(ex_emb_in)
            classes = np.argmax(pred, axis=1)
            y_pred.extend(classes)

        f1_micro = f1_score(y_test, y_pred, average='micro')
        f1_macro = f1_score(y_test, y_pred, average='macro')
        print(' epochs: {}, micro f1: {}, macro f1:{}'.format( e+1, f1_micro, f1_macro))

Because feature store can merge multiple sources, we can easily enrich graph embeddings, with additional features like additional page information.

We can also track, embeddings historical changes.

theory

Moreover, using feature store we can materialize embeddings to online store, which simplifies building a comprehensive MLOps process.

You can find the whole example.ipynb on github and yummy documentation.