Now, let's tackle a slightly more complex problem: image classification!
First, the basics:
%matplotlib inline
# mahotas used to compute features
import mahotas as mh
import numpy as np
from matplotlib import pyplot as plt
from IPython.html.widgets import interact, fixed
plt.rcParams['figure.figsize'] = (10.0, 8.0) # 10 x 8 inches
plt.gray()
This is a subcellular location problem.
The data is organized by directories:
data/
nuclear/
cytoplasmic
The images are stored in JPEG format (which is a terrible idea for scientific uses, but this is a demo, and JPEG downloads faster).
We can use the glob
to get the same effect as on the command line:
from glob import glob
images = glob('data/nuclear/*jpeg')
images += glob('data/cytoplasmic/*jpeg')
Let's just look at some basic stats on the data:
print("Number of images: {}".format(len(images)))
n_nuclear = sum(('nuclear' in im) for im in images)
print("Number of nuclear images: {}".format(n_nuclear))
print("Number of cytoplasmic images: {}".format(len(images)-n_nuclear))
@interact(n=(0,len(images)-1))
def look_at_images(n):
im = images[n]
if 'nuclear' in images[n]:
label = 'nuclear'
else:
label = 'cytoplasmic'
print("Image {} is {}".format(n, label))
plt.imshow(mh.stretch(mh.imread(im)))
Let's look at some images
The general idea of classification using features is illustrated in the diagram below
Mahotas makes it pretty easy to compute features. We are just going to use Haralick features in this example:
im = images[0]
im = mh.imread(im)
print(mh.features.haralick(im, return_mean_ptp=True))
# use a library function to compute features from the pictures.
features = []
is_nuclear = []
for im in images:
is_nuclear.append('nuclear' in im)
im = mh.imread(im)
features.append(mh.features.haralick(im, return_mean_ptp=True))
We convert to numpy for convenience
features = np.array(features)
is_nuclear = np.array(is_nuclear)
We import scikit-learn and will use a random forest classifier
# feed the random forest classifier using the features extracted
from sklearn import ensemble, cross_validation
clf = ensemble.RandomForestClassifier()
We now use cross-validation to obtain predictions:
predictions = cross_validation.cross_val_predict(clf, features, is_nuclear)
acc = np.mean(predictions == is_nuclear)
print("Accuracy: {:.2%}".format(acc))
# rather standard -> very bad classifier
We can look at the images where the classifier makes mistakes:
errors = np.where(predictions != is_nuclear)[0]
@interact(n=(0,len(errors)-1)) # interpreted by jupyter
def spot_error(n):
err = errors[n]
im = images[err]
label = 'nuclear' if is_nuclear[err] else 'cytoplasmic'
print("Image {} should have been identified as {} (was not)".format(err, label))
plt.imshow(mh.imread(im))
This is not a great classification method (80% is not awful, but nuclear vs cytoplasmic is an easy problem [try ER vs cytoplasmic if you want a hard problem]). In fact, we could hope to obtain close to 100% accuracy with these data with good methods.
A large improvement could be obtained by using more features, which capture other aspects of the images, by taking into account the DNA channel (which we ignored in this exercise), or by other ides
The paper Determining the subcellular location of new proteins from microscope images using local features by Coelho et al. introduced these data and presented much better methods.
=> convoluted neuronal network: might be useful if a lot and lot of data and a problem where human do well and computer very crapily (e.g. street signs)
=> not a lot of such problems (some deep learning in biology, but rarely the case that they beat dramatically other methods)
=> in biology, not problem yet where deep learning really useful can have a breakthrough for a problem where machines struggle a lot and humans perform well
=> for image analysis, if we use deep learning, we cannot select features. Internally, it uses some features, but we don't know a priori how it is (for picture).
Also done: take at intermediate layer of the neural network (as a pre-trained set of features) and combine with another method. Slice the middle layer, and looks at which part of the image the guys are firing at.
Unsupervised learning: hard to interpret the results (in the case of organism classification)