Skip to content
Taylor G Smith edited this page May 1, 2016 · 4 revisions

Pre-processing your input data

As with all families of machine learning algorithms, performance is a function of the quality of the input data. Clust4j includes the following Transformer classes:

  • BoxCoxTransformer
  • MeanCenterer
  • MedianCenterer
  • MinMaxScaler
  • PCA
  • RobustScaler
  • StandardScaler
  • WeightTransformer
  • YeoJohnsonTransformer

As with many other clust4j classes, their interface is written to be familiar to sklearn users. All Transformer classes can be used in the following pseudo-code method:

RealMatrix X1 = some_data;
RealMatrix X2 = some_other_data;

// initialize and fit
Transformer t = new StandardScaler().fit(X1);

// transform train and test
RealMatrix train = t.transform(X1);
RealMatrix test  = t.transform(X2);

// you can also inverse transform
RealMatrix inverse_train = t.inverseTransform(train); // should equal X1

Pre-processing in practice (in conjunction with the Pipeline object)

Clust4j includes a toy DataSet of intertwining crescents for bench-marking various algorithms (see ExampleDataSets.loadToyMoons()):

Here's the setup for an example you can run:

// load the dataset
DataSet moons = ExampleDataSets.loadToyMoons();
final int[] actual_labels = moons.getLabels();

// Add a Z offset...
moons.addColumn("X3", VecUtils.scalarMultiply(VecUtils.asDouble(moons.getLabels()), 0.5));

So, the data head now looks like:

X1 X2 X3 labels
1.582023 -0.445815 0.5 0
0.066045 0.439207 0.5 1
0.736631 -0.398963 0.5 0
-1.056928 0.242456 0.0 1

And the z-axis offset looks like this:

Most algorithms cannot segment the two classes without any pre-processing:

RealMatrix data = moons.getData();
KMeansParameters params = new KMeansParameters(2);

KMeans model = params.fitNewModel(data);
int[] predicted_labels = model.getLabels();

However, using a WeightTransformer, we can emphasize the importance of the X3 feature over the others:

// With just a bit of preprocessing...
UnsupervisedPipeline<KMeans> pipe = new UnsupervisedPipeline<KMeans>(
    params,
    new WeightTransformer(new double[]{0.5, 0.0, 2.0})
);
		
predicted_labels = pipe.fit(data).getLabels();

Final thoughts

Though this is a trivial example, it emphasizes the importance of exploring your data before modeling, and applying transformations or pre-preprocessing techniques where appropriate to achieve maximal efficacy in your clustering.

Clone this wiki locally