=> detect specific TF binding site in sequences
=> understand if using CNN able to find a motif
In this exercise we will learn how CNN can be used to detect a known motif in DNA sequences. We will focus on FOXO1 transcription factor binding motif from HOCOMOCO v11 [1]. More specifically we will learn how to build CNN model that exploits 1D convolutions in order to detect is a specific window over the genome contains the FOX01 motif aformentioned.
The FOXO1 motif considered is 12 bps long. We used probabilties of finding a specific basis in a position of the sequence form HOCOMOCO, to generate our dataset. The generated dataset is composed by 20000 sequences of length 100 divided in two classes:
The sequences are then splitted randomly in two sets with balanced examples from the two classes and saved in the files FOXO1_train.csv
and FOXO1_test.csv
in /sib_autumn_school/jupyter_root/data/motifs
[1]: "HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis" Ivan V. Kulakovskiy et al., Nucl. Acids Res., 11 November 2017
# importing needed modules
from __future__ import print_function
import pandas as pd
import numpy as np
import seaborn as sns
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPooling1D
from keras.utils import plot_model
# fixing seed to ensure reproducibility of the results
np.random.seed(0)
# configuring plotting
%matplotlib inline
sns.set(
context='notebook',
style='white',
palette='colorblind',
font_scale=2,
rc = {
'figure.figsize': (20.0, 10.0)
}
)
# utilities
# sequences saved on disk
# for each point of the sequence => expand to give nt
def load_data():
"""
Function to load the data for the exercise.
It returns a tuple of tuples containg:
( (x_train, y_train), (x_test, y_test) )
where:
- x_* are numpy arrays with shape (number of sequences, sequence length, number of nucleotide bases).
In a given data point and sequence position a one hot code vector is used to represent the basis:
- [1, 0, 0, 0] -> A
- [0, 1, 0, 0] -> C
- [1, 0, 1, 0] -> G
- [1, 0, 0, 1] -> T
- y_* are numpy arrays with shape (number of sequences,).
For every sequence a binary label is assigned: 1 (it contains the FOXO1 motif), 0 (uniform bases).
"""
# file coordinates
data_path = '/sib_autumn_school/jupyter_root/data'
train_filepath = '{}/FOXO1_train.csv'.format(data_path)
test_filepath = '{}/FOXO1_test.csv'.format(data_path)
print('loading data')
train = pd.read_csv(train_filepath, index_col=0)
test = pd.read_csv(test_filepath, index_col=0)
print('number of train sequences: {}'.format(train.shape[0]))
print('number of test sequences: {}'.format(test.shape[0]))
number_of_bases = max(train.values.max(), test.values.max()) + 1
print('number of bases: {}'.format(number_of_bases))
# helper for data preparation
mapping_bases = np.eye(number_of_bases)
return (
(
np.array([
mapping_bases[sequence]
for sequence in train.loc[:, train.columns != 'label'].values
]),
train['label']
),
(
np.array([
mapping_bases[sequence]
for sequence in test.loc[:, test.columns != 'label'].values
]),
test['label']
)
)
def get_learning_report(history_callback):
"""
Generate a learning report pandas DataFrame from a keras history callback.
"""
learning_report = pd.DataFrame(history.history)
learning_report.index += 1
learning_report.index.name = 'Epochs'
learning_report.columns = pd.MultiIndex.from_tuples(
[
('Accuracy', 'Train'), ('Loss', 'Train'),
('Accuracy', 'Validation'), ('Loss', 'Validation'),
], names=['Metric','Set']
)
return learning_report
def plot_model(model, summary=True):
"""Plot a keras model and conditionally print its summary."""
if summary:
print(model.summary())
return SVG(model_to_dot(model).create(prog='dot', format='svg'))
def plot_learning_report(learning_report):
"""Plot learning report."""
learning_report['Loss'].plot(title='Loss')
learning_report['Accuracy'].plot(title='Accuracy')
# loading the data
(x_train, y_train), (x_test, y_test) = load_data()
x_train.shape
# all sequences 100 length with 4 dim vector representing nucleotides
# store some useful variables
# dropout: to avoid overfit
batch_size = 256
epochs = 10
dropout_rate = 0.2
length, number_of_bases = x_train.shape[1:]
learning_reports = {}
#fix a window to look at 6 nt bases at each time and use 64 different filters
# 6 x 64 x output shape
# how many sliding windows you passed on the sequence
# 1st convolution -> obtained something size 64, then take max (max pooling)
# simple model
filters = 64
kernel_size = 6
model = Sequential()
# apply a first convolutional layer
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
input_shape=(length, number_of_bases)
))
# global max pooling:
model.add(GlobalMaxPooling1D())
# output layer with class prediction
model.add(Dense(1))
model.add(Activation('sigmoid'))
# compilation
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
plot_model(model)
# fitting
history = model.fit(
x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test)
)
learning_reports['v1'] = get_learning_report(history)
# will need more epoch to converge
# in term of accuracy, already plateauing, already a good classification score
plot_learning_report(learning_reports['v1'])
# => try to change the dimension of the window
# same architecture, but increase window size
# double size of the window = double number of weights
# simple model with kernel_size equal motif size
filters = 64
kernel_size = 12
model = Sequential()
# apply a first convolutional layer
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
input_shape=(length, number_of_bases)
))
# global max pooling:
model.add(GlobalMaxPooling1D())
# output layer with class prediction
model.add(Dense(1))
model.add(Activation('sigmoid'))
# compilation
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
plot_model(model)
# fitting
history = model.fit(
x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test)
)
learning_reports['v2'] = get_learning_report(history)
plot_learning_report(learning_reports['v2'])
# improve 3% the network
# by stacking layers: improve the results
# learn how to combine more complex features
# price of adding weights (more computational)
# with dropout => keep overfitting very low
# typical when using strong dropout:
# model performing worse on training set because preventing or using all the neurons
# simple model with kernel_size equal motif size and an additional convolution
# add convolutional layer (go deeper)
filters = 64
kernel_size = 12
model = Sequential()
# apply a first convolutional layer
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
input_shape=(length, number_of_bases)
))
# deeper is better
filters = max(filters//2, 2)
kernel_size = max(kernel_size//2, 2)
model.add(MaxPooling1D())
model.add(Dropout(dropout_rate))
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
))
# global max pooling:
model.add(GlobalMaxPooling1D())
# output layer with class prediction
model.add(Dense(1))
model.add(Activation('sigmoid'))
# compilation
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
plot_model(model)
# fitting
history = model.fit(
x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test)
)
learning_reports['v3'] = get_learning_report(history)
plot_learning_report(learning_reports['v3'])
# simple model with kernel_size equal motif size and two additional convolutions
filters = 64
kernel_size = 12
model = Sequential()
# apply a first convolutional layer
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
input_shape=(length, number_of_bases)
))
# deeper is better
filters = max(filters//2, 2)
kernel_size = max(kernel_size//2, 2)
model.add(MaxPooling1D())
model.add(Dropout(dropout_rate))
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
))
# even deeper is even better
filters = max(filters//2, 2)
kernel_size = max(kernel_size//2, 2)
model.add(MaxPooling1D())
model.add(Dropout(dropout_rate))
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
))
# global max pooling:
model.add(GlobalMaxPooling1D())
# output layer with class prediction
model.add(Dense(1))
model.add(Activation('sigmoid'))
# compilation
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
plot_model(model)
# fitting
history = model.fit(
x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test)
)
learning_reports['v4'] = get_learning_report(history)
plot_learning_report(learning_reports['v4'])
# simple model with kernel_size equal motif size and three additional convolutions
filters = 64
kernel_size = 12
model = Sequential()
# apply a first convolutional layer
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
input_shape=(length, number_of_bases)
))
# deeper is better
filters = max(filters//2, 2)
kernel_size = max(kernel_size//2, 2)
model.add(MaxPooling1D())
model.add(Dropout(dropout_rate))
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
))
# even deeper is even better
filters = max(filters//2, 2)
kernel_size = max(kernel_size//2, 2)
model.add(MaxPooling1D())
model.add(Dropout(dropout_rate))
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
))
# abyssal would be the best
filters = max(filters//2, 2)
kernel_size = max(kernel_size//2, 2)
model.add(MaxPooling1D())
model.add(Dropout(dropout_rate))
model.add(Conv1D(
filters,
kernel_size,
padding='valid',
activation='relu',
strides=1,
))
# global max pooling:
model.add(GlobalMaxPooling1D())
# output layer with class prediction
model.add(Dense(1))
model.add(Activation('sigmoid'))
# compilation
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
plot_model(model)
# fitting
history = model.fit(
x_train, y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test)
)
learning_reports['v5'] = get_learning_report(history)
plot_learning_report(learning_reports['v5'])
final_report = pd.Panel(learning_reports)
final_report
In case you are more interested about the topic sequences analysis and deep learning we suggest the following publications:
deeper bind: more layers (more complex)
# filters fixed in size
you want to learn 64 filters
these filters are the weights of the network
you can have correlated wieghts (filters).
look how the filters loook like to see if you need all the filters.
filters could be highly correlated or lots of them might be dying
if all filters have low weights => you can remove it
jsut by plotting the fvectors in a heatmap -> you will see it
plot with accuracy: meaningless without cross-validation
usually at least plot a confidence band for the accuracy
(20-fold cross-validation usually)
with keras you can save all the logs to get all the weights of all the filters
you can visualize how the filters change your image or the values of the weights
problem of class imbalance
- combination of weights on loss function (multiply loss by facotr) penalize performing well on dominant class,
reward performign well least represented class
- balance the dataset upstream to get a balanced training set (even if test dataset is umbalanced) if enough data,
better to create a balanced training set
in this kind of networks with lots of weights, 70%-30% can lead to overfitting much of the time
hyperopt => very good Python package for hyperparameter optimization
or other based on bayesian, or genetic algorithms
when you get a classification and you want a p-value: not neural network , in other machine learning,
you could have from boostraping. in deep learning, too costly
other way: take similar images and look how many times as good as what you get
conv2d to reconstruct the image
kem and gradkem -> kind of heatmap of the filters features