Classification of Marbles¶

%matplotlib inline

import os
import zipfile
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from IPython.display import display, clear_output, Image, HTML

HTML("""
<h2>Machine Learning in Action</h2>
<video width="640" height="360" controls>
  <source src="http://files.point-8.de/mp4/point8_i40_demonstrator.mp4" type="video/mp4">
</video>
""")

In this example we are going to use machine learning to automatically sort marbles by color. We only have RGB values from a light sensor as our raw data. The data was taken from nine different types of marbles using a sampling rate of 20 milliseconds while rotating each marbles.

Our task is to set up a machine learning workflow. Let’s have a short overview on the basic steps of machine learning. We will see that there is no magic behind it. The overall workflow is:

Data Import and Preparation
Data Exploration
Feature Selection and Engineering
Model Definition
Training
Validation and Performance

The overall workflow has been taken as an iterative process. The scikit-learn package provides the relevant algorithms and other tools.

Let’s dig into the data!!!

Data Import and Preparation¶

The data preparation steps can take most of the time of the full workflow: In real world data, information is often missing, sanity checks have to be performed, data sets have to be joined from different sources and much more.

This function will help us to import the raw data - a list of tuples with color values ([(R,G,B),(R,G,B),(R,G,B)...]) for each type.

def parse_lines(lines):
    """ Parse strings of marble data"""
    lines = lines[2:-2]
    rows = [d.split(', ') for d in lines.split('), (')]
    data = [[int(v.replace(')][(', '')) for v in r] for r in rows]
    return pd.DataFrame(data)[[0, 1, 2]]

We import the raw data and create a list dfs containing a pandas.DataFrame for each file.

data = []

files = [
    'blue-white-glass.data',
    'cyan-glass.data',
    'glass-blue.data',
    'glass-green.data',
    'glass-red.data',
    'glass-yellow.data',
    'planet-black-blue.data',
    'planet-green.data',
    'planet-ocean.data',
]

dfs = []

for i, fname in enumerate(files):
    print(f'Load data {i}: {fname}')

    with zipfile.ZipFile(f'../.assets/data/marbles/{fname}.zip', 'r') as zipf:
        with zipf.open(f'{fname}', 'r') as infile:
            content = infile.readlines()[0].decode()
            dfs.append(parse_lines(content))

Load data 0: blue-white-glass.data
Load data 1: cyan-glass.data
Load data 2: glass-blue.data
Load data 3: glass-green.data
Load data 4: glass-red.data
Load data 5: glass-yellow.data
Load data 6: planet-black-blue.data
Load data 7: planet-green.data
Load data 8: planet-ocean.data

So far, we have a numerical index and no column names. Here we set the column names.

for df in dfs:
    df.columns=['R','G','B']

We define a color code that we know what we will talk about and can use in plots accordingly.

plt.figure(figsize=(18,2))
for i in range(9):
    plt.scatter([i],[1],s=5000)
plt.xticks(np.linspace(0,8,9))
plt.yticks([])
plt.show()

display(Image('../.assets/data/marbles/Murmeln.png'))

Remark: Real world data¶

On other data sets getting information like column names can be the first task in data preparation and take some time. In addition, one of the main tasks is to aggregate data from different sources to one data structure (here: pandas.DataFrame) on which the machine learning model will be applied. In general, the ML-algorithms need numerical data as input. Accordingly, strings need to be encoded (see Feature Engineering). But also units need to be checked, and time series need to be set accordingly to the correct format (see pandas.to_datetime). Another major task is to perform sanity checks on the data, to check for missing values and compensate outliers if needed.

Data Exploration¶

It's your turn! Have a look on standard features of the dataframe and some statistical information.

For example check:

How many samples do we have?
What are min, max, mean and so on?
Do we have outliers or missing values?
Do you already see some significant differences between the types?

# Give it a try!

First Plots¶

When the items of interest can be separated well in a feature space, this enables an ML algorithm to learn the patterns and classify with high accuracy. Let's start with some simple plots of one, two or more types of marbles, and to find features which separate the classes well.

# One type: Change the type or features to plot
dataset = 5
X = 'R'
Y = 'G'

plt.scatter(dfs[dataset][X], dfs[dataset][Y], s=10, alpha=0.01, color=f'C{dataset}')
plt.xlabel(X)
plt.ylabel(Y);

# Two types: Change the types or feature to plot
dataset_A = 1
dataset_B = 3
X = 'R'
Y = 'G'

plt.scatter(dfs[dataset_A][X], dfs[dataset_A][Y], s=10, alpha=0.01, color=f'C{dataset_A}')
plt.scatter(dfs[dataset_B][X], dfs[dataset_B][Y], s=10, alpha=0.01, color=f'C{dataset_B}')
plt.xlabel(X)
plt.ylabel(Y);

# Three types: Change the type or features to plot

dataset_A = 1
dataset_B = 3
dataset_C = 5
X = 'R'
Y = 'G'

plt.scatter(dfs[dataset_A][X], dfs[dataset_A][Y], s=10, alpha=0.01, color=f'C{dataset_A}')
plt.scatter(dfs[dataset_B][X], dfs[dataset_B][Y], s=10, alpha=0.01, color=f'C{dataset_B}')
plt.scatter(dfs[dataset_C][X], dfs[dataset_C][Y], s=10, alpha=0.01, color=f'C{dataset_C}')
plt.xlabel(X)
plt.ylabel(Y);

# It's your turn. Create a plot of four types of marbles.
# Can you still obtain a good separation?

# Two features but showing all types of marbles.
X = 'R'
Y = 'B'

plt.figure()
for df in dfs:
    plt.scatter(df[X], df[Y], s=10, alpha=0.01)
plt.xlabel(X)
plt.ylabel(Y);

Complete chaos?¶

It seems to be total chaos when plotting all types of marbles. But we see that they do differ somewhat. Maybe ML can take several combinations of features into account to come up with a model for what is to hard for us to do "by hand". This is the power of ML!

Feature Selection and Engineering¶

So far, we got a broad overview of our data and could detect some possible features for a classification task. For the actual training of an ML-model we need to select features (feature selection) as input to classify our target. In our example we use all three features but we could also select only some of them. In real world data it often makes sense to select since computing power can be a limiting factor. Also, more features do not necessarily improve the overall performance of the classifier.

Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. — Andrew Ng:

Besides selection, creating of additional features (feature engineering) can be another crucial step. In this case we are fine with the three features we have but in real world data we usually need perform feature engineering to develop the full potential of ML. Some examples are

encoding of features (e.g. categories to numerical features),
apply transformations to features (e.g. logscale),
generate new features (e.g. simple stats)
rounding, binning, sampling, ...

Maybe additional feature engineering will help our ML model. For this step we usually need some domain expertise. In the case of colors we can perform a transformation to another spectrum or parameter set (Hue, Saturation, Value/Brightness; see also HSL and HSV). If we arrange all colors in one 2D-plane we can describe all colors with two parameters (X, Y) or an amplitude (I) plus one angle (Phi). It becomes a kind of a dimensionality reduction which enables us to have all necassary information in two parameters. We only leave out the brightness information.

display(Image('../.assets/data/marbles/Koordinaten.png', width=300, height=300))

For the transformation, we define some helper functions (and skip an explanation of their technical details):

def generate_xy_values(df):
    df['X'] = 0.5 * np.sqrt(3) * df['G'] - 0.5 * np.sqrt(3) * df['B']
    df['Y'] = df['R'] - (1 / 3 * df['G']) - (1 / 3 * df['B'])
    
def generate_intensity_values(df):
    df['I'] = np.square(df['X']) + np.square(df['Y'])

def generate_angles(df):
    df['Phi'] = np.arctan2(df['Y'], df['X'])

# We save the original data
from copy import deepcopy
dfs_orig = deepcopy(dfs)

# And apply the transformations
for df in dfs:
    generate_xy_values(df)
    generate_intensity_values(df)
    generate_angles(df)

Let's see what is changed.

dfs_orig[0].head()

dfs[0].head()

Data Exploration II¶

Use the additional features to obtain a better separation.

# It's your go! Does the separation improve?

Example: You can get a good separation for this four types of marbles which was less good with R, G, B space before.

# X, Y
plt.scatter(dfs[1]['X'], dfs[1]['Y'], s=10, alpha=0.01, color='C1')
plt.scatter(dfs[2]['X'], dfs[2]['Y'], s=10, alpha=0.01, color='C2')
plt.scatter(dfs[3]['X'], dfs[3]['Y'], s=10, alpha=0.01, color='C3')
plt.scatter(dfs[7]['X'], dfs[7]['Y'], s=10, alpha=0.01, color='C7');

# Phi, I
plt.scatter(dfs[1]['Phi'], dfs[1]['I'], s=10, alpha=0.01, color='C1')
plt.scatter(dfs[2]['Phi'], dfs[2]['I'], s=10, alpha=0.01, color='C2')
plt.scatter(dfs[3]['Phi'], dfs[3]['I'], s=10, alpha=0.01, color='C3')
plt.scatter(dfs[7]['Phi'], dfs[7]['I'], s=10, alpha=0.01, color='C7');

# R, G, B
plt.scatter(dfs[1]['R'], dfs[1]['B'], s=10, alpha=0.01, color='C1')
plt.scatter(dfs[2]['R'], dfs[2]['B'], s=10, alpha=0.01, color='C2')
plt.scatter(dfs[3]['R'], dfs[3]['B'], s=10, alpha=0.01, color='C3')
plt.scatter(dfs[7]['R'], dfs[7]['B'], s=10, alpha=0.01, color='C7');

But getting still a huge chaos when plotting all types.

plt.figure()
for df in dfs:
    plt.scatter(df['X'],df['Y'],s=10, alpha=0.01)

plt.figure()
for df in dfs:
    plt.scatter(df['Phi'],df['I'],s=10, alpha=0.01)

	R	G	B
0	8	16	11
1	8	18	12
2	12	18	15
3	15	20	16
4	16	20	18

	R	G	B	X	Y	I	Phi
0	8	16	11	4.330127	-1.000000	19.750000	-0.226961
1	8	18	12	5.196152	-2.000000	31.000000	-0.367422
2	12	18	15	2.598076	1.000000	7.750000	0.367422
3	15	20	16	3.464102	3.000000	21.000000	0.713724
4	16	20	18	1.732051	3.333333	14.111111	1.091580