Implementation of a Quantum Random Forest
This article will guide you through an exploration of a Quantum Random Forest (QRF) adapting information from papers, GitHub, and proprietary adjustments.
The paper "A kernel-based quantum random forest for improved classification" by Srikumar et al., presents a Quantum Random Forest (QRF) model. This model extends the linear quantum support vector machine (QSVM) by including a kernel function via quantum kernel estimation (QKE), forming a decision tree classifier. The QRF aims to address the limitations of previous quantum models. Key aspects include developing a decision tree structure with QSVM nodes, incorporating a low-rank Nyström approximation to mitigate overfitting, and theoretical guarantees to limit finite sampling errors. The QRF shows improved performance over QSVMs, especially in multi-class classification problems, and requires fewer kernel estimations.
The conclusion makes a point of saying that QRF is not linear like QNN and QSVM. Instead, it is non-linear and works better with datasets where quantum embedding does not perfectly separate instances. QRF's probabilistic output is beneficial for multi-class problems as well. The paper also suggests potential enhancements and acknowledges the need for further exploration of hyperparameters and quantum split function optimization.
Fortunately, the authors shared the code on GitHub for our tests and explorations. The repository can be found here.
In this article, we will follow this repository code step by step, testing at the end with the UCI Credit Card dataset. It is crucial to download all the *.py modules available in the repository so you don’t have issues testing the “example” Jupyter Notebook. The simplest way to get the whole repository is by clicking “<> Code” and “Download ZIP”. At the same time, remember that the libraries must be installed with the same version shared by the authors:
- cirq==0.11.0
- cirq-core==0.11.0
- matplotlib==3.4.2
- more-itertools==8.8.0
- numpy==1.19.5
- pandas==1.3.0
- qiskit==0.27.0
- scikit-learn==0.24.2
- scipy==1.7.0
- tqdm==4.61.1
- tensorflow==2.4.1
Setup your environment
from quantum_random_forest import QuantumRandomForest, set_multiprocessing
from split_function import SplitCriterion
from data_construction import data_preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics, datasets
from sklearn.model_selection import train_test_split
Load and adapt your dataset
Remember to adapt your datasets during the preprocessing phase. You can follow the instructions shared in this article:
Dimensionality Reduction for Quantum Machine Learning: Integrating LDA and K-Means
In Quantum Machine Learning (QML), the preparation of data is a crucial step that often determines the effectiveness and efficiency of the algorithms used. One of the key challenges in this process is dimensionality reduction, which involves transforming high-dimensional data into a lower-dimensional space. This is particularly important in quantum comp…
It is important to generate a training_set
and testing_set
for the further stages of the code. The example from the authors already includes a data_preprocessing
function, but I suggest editing that and creating your own transformations to play with some variations until you eventually have better results. If you made your own adjustments for the preprocessing phase, consider the following change:.
Remove this:
training_set, testing_set = data_preprocessing(X, y,
train_prop=0.75,
X_dim=2)
And add this:
training_set = pd.DataFrame(zip(X_train, y_train), columns=['X', 'y'])
test_set = pd.DataFrame(zip(X_test, y_test), columns=['X', 'y'])
Setting Model Parameters
n_qubits = 2
dt_type = 'qke'
ensemble_var = None
branch_var = ['eff_anz_pqc_arch',
'iqp_anz_pqc_arch',
'eff_anz_pqc_arch']
num_trees = 3
split_num = 2
pqc_sample_num = 2024
num_classes = 2
max_depth = 4
num_params_split = n_qubits*(n_qubits +1)
num_rand_gen = 1
num_rand_meas_q = n_qubits
svm_num_train = 5
svm_c = 20
min_samples_split = svm_num_train
embedding_type = ['as_params_all',
'as_params_iqp',
'as_params_all']
criterion = SplitCriterion.init_info_gain('clas')
device = 'cirq'
Purpose: To establish the necessary parameters for building and training the Quantum Random Forest model.
Parameters Details:
n_qubits
: The number of qubits used for quantum embedding.ensemble_var
,dt_type
,branch_var
: Various settings for the ensemble and decision tree types, including the anzatz types for different levels of the tree.num_trees
: The number of trees in the ensemble.split_num
,pqc_sample_num
,svm_num_train
,svm_c
: Parameters related to quantum circuit samples, SVM landmark number, and SVM optimization.max_depth
,num_params_split
,embedding_type
: Settings for maximum tree depth, number of parameters in embedding, and the type of quantum embedding.criterion
,device
: The criterion for splitting nodes and the quantum computing device or simulator to use.
Model setup
qrf = QuantumRandomForest(n_qubits, 'clas', num_trees, criterion, max_depth=max_depth, min_samples_split=min_samples_split, tree_split_num=split_num, num_rand_meas_q=num_rand_meas_q, ensemble_var=ensemble_var, dt_type=dt_type, num_classes=num_classes, ensemble_vote_type='ave', num_params_split=num_params_split, num_rand_gen=num_rand_gen, pqc_sample_num=pqc_sample_num, embed=embedding_type, branch_var=branch_var, svm_num_train=svm_num_train, svm_c=svm_c, nystrom_approx=True, device=device)
Purpose: To initialize the Quantum Random Forest model with the specified parameters.
How It Works:
An instance of
QuantumRandomForest
is created with the defined parameters. This includes the number of qubits, tree structure, embedding types, SVM settings, and the device for quantum computation.
Training the Model
cores = 6
set_multiprocessing(True, cores)
qrf.train(training_set, partition_sample_size=100)
Purpose: To train the Quantum Random Forest model on the training dataset.
How It Works:
Enables multiprocessing with the specified number of cores to parallelize the training process.
The
train
method of the QRF model is called with the training data.partition_sample_size
indicates the size of data each tree in the ensemble receives.
Testing the Model
acc, preds_qrf = qrf.test(testing_set, ret_pred=True, parallel=False, calc_tree_corr=True)
Purpose: To test the QRF model on the testing dataset and evaluate its performance.
How It Works:
The
test
method of the QRF model evaluates the model on the testing set.It returns the accuracy and the predictions. It also calculates the correlation between trees if
calc_tree_corr
is True.
Analyzing the Model
print(metrics.classification_report(testing_set.y, preds_qrf))
print(metrics.roc_auc_score(testing_set.y, preds_qrf))
Purpose: To provide a detailed classification report and calculate the AUC (Area Under the Curve) score.
How It Works:
Uses Scikit-Learn's metrics to print the classification report and compute the AUC score, offering insights into the model's performance.
Exploration with UCI dataset
After doing dimensionality reduction using LDA and K-means, a test was executed with the following parameters adjusted:
sample = 500
test_size = 0.3
n_qubits = 2
svm_c = 50
svm_num_train = 5
partition_sample_size = 100
And the results were the following:
Classification report for QRF:
precision recall f1-score support
0 0.80 0.90 0.85 114
1 0.50 0.31 0.38 36
accuracy 0.76 150
macro avg 0.65 0.60 0.62 150
weighted avg 0.73 0.76 0.74 150
AUC for QRF:
0.60453
The result is clearly not a satisfactory one, but this can definitely depend on the parameters you want to use at different stages of the code and also on the dataset you want to play with.