Yummy - delicious Feast extension

yummy

In this article I’d like to present a really delicious Feast extension Yummy.

Before you will continue reading please watch short introduction:

Last time I showed the Feast integration with the Dask framework which helps to distribute ML solutions across the cluster but doesn’t solve other problems. Currently in Feast we have a warehouse based approach where Feast builds and executes query appropriate for specific database engines. Because of this architecture Feast can’t use multiple data sources at the same time. Moreover the logic which fetch historical features from offline data sources is duplicated for every datasource implementation which makes it difficult to maintain.

Feast

To solve this problems I have decided to create Yummy Feast extension, which is also published as a pypi package.

In Yummy I have used a backend based approach which centralizes the logic which fetches historical data from offline stores. Currently: Spark, Dask, Ray and Polars backends are supported. Moreover because the selected backend is responsible for joining the data we can use multiple different data sources at the same time.

Feast

Additionally with Yummy we can start using a feature store on a single machine and then distribute it using the selected cluster type. We can also use ready to use platforms like: Databricks, Coiled, Anyscale to scale our solution.

To use Yummy we have to install it:

pip install yummy

Then we have to prepare Feast configuration feature_store.yaml:

project: repo
registry: s3://feast/data/registry.db
provider: local
online_store:
    type: redis
    connection_string: "redis:6379"
offline_store:
    type: yummy.YummyOfflineStore
    backend: dask

In this case we will use s3 as a feature store registry and redis as an online store. The Yummy takes offline store responsibility and in this case we have selected dask backend. For dask, ray and polars backends we don’t have to set up the cluster to work. In this case if we don’t provide cluster configuration they will run locally. For Apache Spark additional configuration is required for local machines.

In the next step we need to provide feature store definition in the python file eg. features.py

from google.protobuf.duration_pb2 import Duration
from feast import Entity, Feature, FeatureView, ValueType
from yummy import ParquetDataSource, CsvDataSource, DeltaDataSource

my_stats_parquet = ParquetDataSource(path="/mnt/dataset/all_data.parquet", event_timestamp_column="datetime",)
my_stats_delta = DeltaDataSource(path="/mnt/dataset/all/", event_timestamp_column="datetime",)
my_stats_csv = CsvDataSource(path="/mnt/dataset/all_data.csv", event_timestamp_column="datetime",)

my_entity = Entity(name="entity_id", value_type=ValueType.INT64, description="entity id",)

mystats_view_parquet = FeatureView(name="my_statistics_parquet", entities=["entity_id"], ttl=Duration(seconds=3600*24*20),
    features=[
        Feature(name="f0", dtype=ValueType.FLOAT),
        Feature(name="f1", dtype=ValueType.FLOAT),
        Feature(name="y", dtype=ValueType.FLOAT),
    ], online=True, input=my_stats_parquet, tags={},)

mystats_view_delta = FeatureView(name="my_statistics_delta", entities=["entity_id"], ttl=Duration(seconds=3600*24*20),
    features=[
        Feature(name="f2", dtype=ValueType.FLOAT),
        Feature(name="f3", dtype=ValueType.FLOAT),
    ], online=True, input=my_stats_delta, tags={},)

mystats_view_csv = FeatureView(name="my_statistics_csv", entities=["entity_id"],
    ttl=Duration(seconds=3600*24*20),
    features=[
        Feature(name="f11", dtype=ValueType.FLOAT),
        Feature(name="f12", dtype=ValueType.FLOAT),
    ], online=True, input=my_stats_csv, tags={},)

In this case we have used three Yummy data sources: ParquetDataSource, DeltaDataSource, CsvDataSource. Before I have generated three data sources:

  • parquet file (/mnt/dataset/all_data.parquet)
  • delta lake (/mnt/dataset/all/)
  • csv file (/mnt/dataset/all_data.csv)

Currently Yummy won’t work with other Feast data sources like BigQuerySource or RedshiftSource.

Then we can apply our feature store definition and keep it on s3:

feast apply

Now we are ready to fetch required features from defined stores. To do this we simply run:

from feast import FeatureStore
import time

store = FeatureStore(repo_path='.')
start_time = time.time()
training_df = store.get_historical_features(
    entity_df=edf,
    features = [
        'my_statistics_parquet:f0',
        'my_statistics_parquet:f1',
        'my_statistics_parquet:y',
        'my_statistics_delta:f2',
        'my_statistics_delta:f3',
        'my_statistics_csv:f11',
        'my_statistics_csv:f12',
    ]
).to_df()
print("--- %s seconds --- " % (time.time() - start_time))
training_df

We have started with the dask backend but we can simply switch to ray changing feature_store.yaml configuration to:

project: repo
registry: s3://feast/data/registry.db
provider: local
online_store:
    type: redis
    connection_string: "redis:6379"
offline_store:
    type: yummy.YummyOfflineStore
    backend: ray

or to polars backend (which is currently the fastest option):

project: repo
registry: s3://feast/data/registry.db
provider: local
online_store:
    type: redis
    connection_string: "redis:6379"
offline_store:
    type: yummy.YummyOfflineStore
    backend: polars

we can also use spark cluster where additional configuration options are available (they are used during spark session initialization):

project: repo
registry: s3://feast/data/registry.db
provider: local
online_store:
    type: redis
    connection_string: "redis:6379"
offline_store:
    type: yummy.YummyOfflineStore
    backend: spark
    config:
        spark.master: "local[*]"
        spark.ui.enabled: "false"
        spark.eventLog.enabled: "false"
        spark.sql.session.timeZone: "UTC"

Finally we can materialize data from offline stores to online store using preferred backend:

feast materialize 2020-01-03T14:30:00 2023-01-03T14:30:00

Yummy solves several Feast limitations:

Feast

Distributed Feature Store with Feast and Dask

bubbles

In this article I will show how we combine Feast and Dask library to create distributed feature store.

Before you will continue reading please watch short introduction:

The Feature Store is very important component of the MLops process which helps to manage historical and online features. With the Feast we can for example read historical features from the parquet files and then materialize them to the Redis as a online store.

But what to do if historical data size exceeds our machine capabilities ? The Dask library can help to solve this problem. Using Dask we can distribute the data and calculations across multiple machines. The Dask can be run on the single machine or on the cluster (k8s, yarn, cloud, HPC, SSH, manual setup). We can start with the single machine and then smoothly pass to the cluster if needed. Moreover thanks to the Dask we can read bunch of parquets using path pattern and evaluate distributed training using libraries like scikit-learn or XGBoost

Feast with Dask

I have prepared ready to use docker image thus you can simply reproduce all steps.

docker run --name feast -d --rm -p 8888:8888 -p 8787:8787 qooba/feast:dask

Then check the Jupyter notebook token which you will need to login:

docker logs -f feast

And open (use the token to login):

http://localhost:8888/notebooks/feast-dask/feast-dask.ipynb#/slide-0-0

The notebook is also available on https://github.com/qooba/feast-dask/blob/main/docker/feast-dask.ipynb.

But with the docker you will have the whole environment ready.

In the notebook you will can find all the steps:

Random data generation

I have used numpy and scikit-learn to generate 1M entities end historical data (10 features generated with make_hastie_10_2 function) for 14 days which I save as a parquet file (1.34GB).

Feast configuration and registry

feature_store.yaml - where I use local registry and Sqlite database as a online store.

features.py - with one file source (generate parquet) and features definition.

The create the Feast registry we have to run:

feast apply

Additionally I have created simple library which helps to inspect feast schema directly in the Jupyter notebook

pip install feast-schema
from feast_schema import FeastSchema

FeastSchema('.').show_schema()

Feast schema

Dask cluster setup

Then I setup simple Dask cluster with scheduler and 4 workers.

dask-scheduler --host 0.0.0.0 --port 8786 --bokeh-port 8787 &

dask-worker --host 0.0.0.0 0.0.0.0:8786 --worker-port 8701 &
dask-worker --host 0.0.0.0 0.0.0.0:8786 --worker-port 8702 &
dask-worker --host 0.0.0.0 0.0.0.0:8786 --worker-port 8703 &
dask-worker --host 0.0.0.0 0.0.0.0:8786 --worker-port 8704 &

The Dask dashboard is exposed on port 8787 thus you can follow Dask metrics on:

http://localhost:8787/status

Dask dashboard

Fetching historical features

In the next step I have fetched the historical features using Feast with the Dask:

from feast import FeatureStore

store = FeatureStore(repo_path='.')
training_df = store.get_historical_features(
    entity_df=entity_df,
    feature_refs=[
        "my_statistics:f0",
        "my_statistics:f1",
        "my_statistics:f2",
        "my_statistics:f3",
        "my_statistics:f4",
        "my_statistics:f5",
        "my_statistics:f6",
        "my_statistics:f7",
        "my_statistics:f8",
        "my_statistics:f9",
        "my_statistics:y",
    ],
).to_df()
training_df

this takes about 14 seconds and is much more faster than Feast without the Dask.

Pandas
CPU times: user 2min 51s, sys: 6.64 s, total: 2min 57s
Wall time: 2min 52s

Dask
CPU times: user 458 ms, sys: 65.3 ms, total: 524 ms
Wall time: 14.7 s

Distributed training with Sklearn

After fetching the data we can start with the training. We can used fetched Pandas dataframe but we can also fetch Dask dataframe instead:

from feast import FeatureStore
store=FeatureStore(repo_path='.')
training_dd = store.get_historical_features(
    entity_df=entity_df,
    feature_refs=[
        "my_statistics:f0",
        "my_statistics:f1",
        "my_statistics:f2",
        "my_statistics:f3",
        "my_statistics:f4",
        "my_statistics:f5",
        "my_statistics:f6",
        "my_statistics:f7",
        "my_statistics:f8",
        "my_statistics:f9",
        "my_statistics:y",
    ]
).evaluation_function()

Using Dask dataframe we can continue distributed training with the distributed data. On the other hand if we will use Pandas dataframe the data will be computed to the one node.

To start distributed training with scikit-learn we can use Joblib library with the dask backend:

import joblib
from sklearn.ensemble import GradientBoostingClassifier
from dask_ml.model_selection import train_test_split

predictors = training_dd[["f0","f1","f2","f3","f4","f5","f6","f7","f8","f9"]]
targets = training_dd[["y"]]

X_train, X_test, y_train, y_test = train_test_split(predictors, targets, test_size=.3)

with joblib.parallel_backend('dask'):
    clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0, verbose=1).fit(X_train, y_train)
    
    score=clf.score(X_test, y_test)
    
score

Online features materialization

Finally I have materialized the data to the local Sqlite database:

feast materialize 2021-01-01T01:00:00 2021-01-31T23:59:00

In this case the materialization data is also prepared using Dask.

Toxicless texts with AI – how to measure text toxicity in the browser

internet

In this article I will show how to measure comments toxicity using Machine Learning models.

Before you will continue reading please watch short introduction:

Hate, rude and toxic comments are common problem in the internet which affects many people. Today, we will prepare neural network, which detects comments toxicity,  directly in the browser. The goal is to create solution which will detect toxicity in the realtime and warn the user during writing, which can discourage from writing toxic comments.

To do this, we will train the tensorflow lite model,  which will run in the browser using WebAssembly backend. The WebAssembly (WASM) allows running C, C++ or RUST code at native speed. Thanks to this, prediction performance will be better than running it using javascript tensorflowjs version. Moreover, we can serve the model, on the static page, with no additional backend servers required.

web assembly

To train the model, we will use  the Kaggle Toxic Comment Classification Challenge  training data, which contains the labeled comments, with toxicity types:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

data set

Our model, will only classify, if the text is toxic, or not. Thus we need to start with preprocessing training data. Then we will use the tensorflow lite model maker library. We will also use the Averaging Word Embedding specification which will create words embeddings and dictionary mappings using training data thus we can train the model in the different languages. The Averaging Word Embedding specification based model will be small <1MB. If we have small dataset we can use the pretrained embeddings. We can choose MobileBERT or BERT-Base specification. In this case models will much more bigger 25MB w/ quantization 100MB w/o quantization for MobileBERT and 300MB for BERT-Base (based on tutorial )

train

Using simple model architecture (Averaging Word Embedding), we can achieve about nighty five percent accuracy, and small model size, appropriate  for the web browser, and web assembly. 

tensorflow lite

Now, let’s prepare the non-toxic forum web application, where we can write the comments. When we write non-toxic comments, the model won’t block it.

On the other hand, the toxic comments will be blocked,  and the user warned.

Of course, this is only client side validation, which can discourage users, from writing toxic comments.

web application

To run the example simply clone git repository and run simple server to serve the static page:

git clone https://github.com/qooba/ai-toxicless-texts.git
cd ai-toxicless-texts
python3 -m http.server

The code to for preparing data, training and exporting model is here: https://github.com/qooba/ai-toxicless-texts/blob/master/Model_Maker_Toxicity.ipynb

How to extract music sources: bass, drums, vocals and other ? – music separation with AI

calculator

In this article I will show how we can extract music sources: bass, drums, vocals and other accompaniments using neural networks.

Before you will continue reading please watch short introduction:

Separation of individual instruments from arranged music is another area where machine learning algorithms could help. Demucs solves this problem using neural networks.

The trained model (https://arxiv.org/pdf/1909.01174v1.pdf) use U-NET architecture which contains two parts encoder and decoder. On the encoder input we put the original track and after processing we get bass, drums, vocals and other accompaniments at the decoder output.

The encoder, is connected to the decoder,  through additional LSTM layer, as well as residual connections between subsequent layers.

neural network architecture

Ok, we have neural network architecture but what about the training data ? This is another difficulty which can be handled by the unlabeled data remixing pipeline.

We start with another classifier, which can find the parts of music, which do not contain the specific instruments, for example drums. Then, we mix it with well known drums signal, and separate the tracks using the model. 

Now we can compare, the separation results, with known drums track and mixture of other instruments. 

According to this, we can calculate the loss (L1 loss), and use it during the training. 

Additionally, we set different loss weights, for known track and the other. 

training data

The whole UI is kept in the docker image thus you can simply try it:

#for CPU
docker run --name aiaudioseparation -it -p 8000:8000 -v $(pwd)/checkpoints:/root/.cache/torch/hub/checkpoints --rm qooba/aimusicseparation

#for GPU
docker run --name aiaudioseparation --gpus all -it -p 8000:8000 -v $(pwd)/checkpoints:/root/.cache/torch/hub/checkpoints --rm qooba/aimusicseparation

web UI

Bored with classical computers? – Quantum AI with OpenFermion

calculator

In this article I will show how we can prepare and perform calculations on quantum computers using OpenFermion, Cirq and PySCF.

Before you will continue reading please watch short introduction:

Currently, there are many supercomputing centers, where we can run complicated simulations. However, there are still problems that are beyond the capabilities of classical computers, which can be addressed by quantum computers.

materials science

Quantum chemistry and materials science problems which that are described by the laws of quantum mechanics can be mapped to the quantum computers and projected to qubits.

OpenFermion is the library which can help to perform such calculations on a quantum computer.

Additionally we will use the PySCF package which will help to perform initial structure optimization (if you are interested in PySCF package I have shared the example DFT based band structure calculation of the single layer graphene structure pyscf_graphene.ipynb).

materials science

In our example we will investigate [latex]H_2[/latex] molecule for simplicity. We will use the PySCF package to find optimal bond length of the molecule.

Thanks to the OpenFermion-PySCF plugin we can smoothly use the molecule initial state obtained from PySCF package run in OpenFermion library (openfermionpyscf_h2.ipynb).

from openfermion.chem import MolecularData
from openfermionpyscf import run_pyscf

geometry = create_molecule(bond_length)
basis = 'sto-3g'
multiplicity = 1

run_scf = 1
run_mp2 = 1
run_cisd = 0
run_ccsd = 0
run_fci = 1

molecule = MolecularData(geometry, basis, multiplicity)
 
# Run pyscf.
molecule = run_pyscf(molecule,
                     run_scf=run_scf,
                     run_mp2=run_mp2,
                     run_cisd=run_cisd,
                     run_ccsd=run_ccsd,
                     run_fci=run_fci)

materials science

Now it is time to compile the molecule to the representation readable by the quantum computer using OpenFermion and Cirq library. Currently you can use several methods to achieve this:

Using one of this methods we get optimized quantum circuit. In our case the quantum cirquit for [latex]H_2[/latex] system will be represented by 4 qubits and operations that act on them (moment is collection of operations that act at the same abstract time slice).

materials science

Finally we can use quantum circuit to run the calculations on the cirq simulator or on the real quantum computer.