Skip to content

Examples

Fabio Cumbo edited this page May 17, 2024 · 13 revisions

Here are a few examples built over the hdlib library.

What is the Dollar of Mexico?

Let's try to reproduce the What is the Dollar of Mexico? example from the following paper:

Kanerva P. What we mean when we say" What's the dollar of Mexico?": Prototypes and mapping in concept space. In2010 AAAI fall symposium series 2010 Nov 1.

Just by simply looking at the question, we can extract different kind of information. We want to determine a currency of a specific country. Starting with the information Dollar, we know it is the currency of the USA, while the Peso is actually the answer to the question which is the currency of Mexico.

In order to build a vector-symbolic architecture, all these information must be encoded into Vectors in a high-dimensional space. We are going to use random bipolar vector to represent entities:

from hdlib.space import Vector, Space

space = Space(size=10000, vtype="bipolar")

# Define features and country information
entities = [
   "NAM", "MON", "CAP", # Features
   "USA", "DOL", "WDC", # USA
   "MEX", "PES", "MXC"  # Mexico
]

# Build a random bipolar vector for each feature and country information
# Finally, add vectors to the space
space.bulk_insert(entities)

Note that we also defined the Capital City information for both the USA and Mexico complex entities just to make the process of information retrieval (see the unbind operation at the end of this example) a bit more tricky.

We can now bind different kind of information together like the vectors called NAM (the name of a country) and USA (The United State of America), MON (the name of a currency) and DOL (the US dollar), and CAP (the name of the Capital City) and WDC (Washington DC), and finally collapse all these vectors together by bundling them and producing a single vector representation of the USA.

And we can then do the same with information about Mexico.

from hdlib.arithmetic import bind, bundle

# Collapse USA information into a single vector
# USTATES = [(NAM * USA) + (CAP * WDC) + (MON * DOL)]
ustates_nam = bind(space.get(names=["NAM"])[0], space.get(names=["USA"])[0]) # Bind NAM with USA
ustates_cap = bind(space.get(names=["CAP"])[0], space.get(names=["WDC"])[0]) # Bind CAP with WDC
ustates_mon = bind(space.get(names=["MON"])[0], space.get(names=["DOL"])[0]) # Bind MON with DOL
ustates = bundle(bundle(ustates_nam, ustates_cap), ustates_mon) # Bundle ustates_nam, ustates_cap, and ustates_mon

# Repeat the last step to encode MEX information in a single vector
# MEXICO = [(NAM * MEX) + (CAP * MXC) + (MON * PES)]
mexico_nam = bind(space.get(names=["NAM"])[0], space.get(names=["MEX"])[0]) # Bind NAM with MEX
mexico_cap = bind(space.get(names=["CAP"])[0], space.get(names=["MXC"])[0]) # Bind CAP with MXC
mexico_mon = bind(space.get(names=["MON"])[0], space.get(names=["PES"])[0]) # Bind MON with PES
mexico = bundle(bundle(mexico_nam, mexico_cap), mexico_mon) # Bundle mexico_nam, mexico_cap, and mexico_mon

We can finally bind ustates and mexico together to build a single vector representation of our relatively complex set of information:

# F_UM = USTATES * MEXICO
#      = [(USA * MEX) + (WDC * MXC) + (DOL * PES) + noise]
f_um = bind(ustates, mexico)

In order to answer the initial question What it the Dollar of Mexico?, we can now unbind the vector representation of the entity DOL from f_um in order to retrieve a vector mostly similar to the vector representation of PES:

# DOL * F_UM = DOL * [(USA * MEX) + (WDC * MXC) + (DOL * PES) + noise]
#            = [(DOL * USA * MEX) + (DOL * WDC * MXC) + (DOL * DOL * PES) + (DOL * noise)]
#            = [noise1 + noise2 + PES + noise3]
#            = [PES + noise4]
#            ≈ PES
guess_pes = bind(space.get(names=["DOL"])[0], f_um)

Let's search for the closest vector in the space:

closest = space.find(guess_pes)

assert closest[0] == "PES"

Note: The same example has been implemented as a unit test in test/test.py. You can run all the unit tests by running python -m unittest test.test from the root folder of hdlib.

Supervised Machine Learning Model

The hdlib library also integrates the supervised machine learning model described in the following paper and implemented in the chopin2 software:

Cumbo F, Cappelli E, Weitschek E. A brain-inspired hyperdimensional computing approach for classifying massive dna methylation data of cancer. Algorithms. 2020 Sep 17;13(9):233

Starting from an input numerical matrix, considering a space and vector dimensionality of 10,000, and 100 bipolar level vectors, we can easily initialise a supervised classification model built following the hyperdimensional computing paradigm as follow:

from hdlib.model import Model

model = Model(size=10000, levels=10, vtype="bipolar")

Note: Have a look at the aforementioned scientific paper for an in-depth explanation about what level vectors are and how the classification model is actually built.

We are going to use the iris dataset from the scikit-learn package that can be installed by typing the following command in your terminal pip install scikit-learn:

from sklearn import datasets

iris = datasets.load_iris()

points = iris.data.tolist()
classes = iris.target.tolist()

Time to fit the model:

model.fit(points, classes)

We can then cross-validate the classification model by running the following code:

# 5-folds cross-validation
# 10 retraining iterations
predictions = model.cross_val_predict(points, classes, cv=5, retrain=10)

Note: The retraining process is a quite important step that significantly affects the performances in terms of accuracy of your classification model. Have a look at the Cumbo F. et al. paper mentioned above for additional information about the retraining process.

We can then compute the accuracy of each fold-prediction and report the average accuracy. We are going to use the accuracy_score function provided by the sklearn.metrics for computing the accuracy:

from sklearn.metrics import accuracy_score

# Collect the accuracy scores computed on each fold
scores = list()

for y_indices, y_pred, _, _ in predictions:
    y_true = [label for position, label in enumerate(classes) if position in y_indices]
    scores.append(accuracy_score(y_true, y_pred))

print("Accuracy: {:.2f}".format(sum(scores) / len(scores)))

For any additional technical details about the classification model, please refer to the chopin2 GitHub repository.

Note: A version of chopin2 that makes use of hdlib for building the supervised classification model is provided under the examples folder of the hdlib repository.

Stepwise Feature Selection

The Model class also provides a stepwise_regression method for performing a backward variable elimination or a forward variable selection on the input dataset.

The following example shows how to run the feature selection on the same iris dataset used in the previous section:

# Get the set of features
features = iris.feature_names

# Run the feature selection in backward mode
importance, scores, top_importance, count_models = model.stepwise_regression(
    points,
    features,
    classes,
    method="backward",
    cv=5,
    retrain=5,
    n_jobs=2,
    metric="accuracy",
    threshold=0.6,
    uncertainty=1.0
)

The example above shows how to call the stepwise_regress method in order to run the feature selection with a backward variable elimination technique method="backward".

Note that every feature selection iteration works by building a Model with a specific subset of features in order to evaluate in cross-validation cv=5 whether the removal of a specific feature affects the accuracy of the model metric="accuracy", and thus assigning an importance score to every feature in the dataset.

The method stops running if the accuracy is lower a specific threshold threshold=0.6. It can also consider groups of features equally important based on the best accuracy reached in a specific iteration. Suboptimal models are considered if their accuracy is greater than the best accuracy minus its uncertainty percentage (1% in this case uncertainty=1.0).

As a result, it produces:

  • a importance dictionary with the mapping <feature-importance>, where the importance is an integer (the lower the better in case of method="backward"; the higher the better in case of method="forward");
  • a score dictionary with a mapping <importance-score> with the average accuracy reached in a specific iteration of the feature selection method (the importance number actually corresponds to a specific iteration);
  • the top feature importance;
  • and the total amount of models generated during the backward or forward variable elimination process.