What this small notebook will show just a quick run-through of the thought process behind some of the work I've done and how I attempt to visualize with some meaning the results I find.
So, for this, we're going to be examining the result of one type of output for the 256 signatures for 1-methyladenosine (m1A) in biological context:
with N = A,C,G,T.
For our sequencing data, the signal we see comes in 3 main flavors:
Here, we'll be focusing on mutations only caused in deep sequencing as a result of the modification m1A.
So, we're looking at signals of C, G, and T.
To start, we'll import the necessary libraries we'll need: numpy for array handling, pandas to import the data, matplotlib and Axes3D for some interesting visualization, and sklearn's KMeans for clustering the data. Since we don't know what to expect, this is just an unsupervised learning to see what we'll find.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
Importing the data and making sure everything looks good.
The data contains several columns:
The signature of the modification, so the A in the center with flanking random letters on either side (Remember, NNm1ANN)
Several columns containing the ratio of incorporated nucleotide at that position.
The last column is just the ratio of non-A signal in this case.
To verify, we can just write 'sig_df' to veirfy that we have 256 contexts, from the NN(mod)NN, or 4*4*1*4*4.
sig_df = pd.read_csv('./Aggregate.txt',sep='\t', header = 0)
sig_df.head()
So, again, we're looking at mutations only. So, in this case, we just want to see the non-A ratios, so we'll throw this into a numpy array since we can't use pandas' dataframes with our scikit-learn tools.
sig_np = sig_df.as_matrix(columns=['C_Ratio','G_Ratio','T_Ratio'])
We need to make one quick adjustment here. As you can see above, even though there is a modification at this position, it will sometimes incorporate the right letter in the output.
To account for variable incorporation, we'll normalize these mutational columns to add to 1.
sig_np = sig_np/sig_np.sum(axis=1)[:,None]
sig_np[0]
Now let's just take a look at the data. Very basic 3-D scatter plot, just showing the rough scatter of the ratios of misincorporation.
fig = plt.figure(figsize=(8,6))
ax = Axes3D(fig)
ax.scatter(sig_np[:,0],sig_np[:,1],sig_np[:,2])
ax.set_xlabel('C ratio')
ax.set_ylabel('G ratio')
ax.set_zlabel('T ratio')
ax.set_title('Scatter Plot of our Mutational Ratios')
ax.view_init(30,30)
plt.show()
There appears to be one very large cluster, but the rest of the points are sparse. However, there may be some patterning to the rest of the data.
And now we fit...
For the sake of this example, I'm going to allow for 6 different clusters. In various other data not presented here we've seen different patterns with fewer clusters, but this data looks to be pretty spread.
estimator = KMeans(6)
estimator.fit(sig_np)
labels = estimator.labels_
labels
This code just allows us to rotate the plot. As well, I wanted to set a color map just so I know what colors represent each cluster so I can call this later.
%matplotlib nbagg
label_color_map = {0:'k',
1:'r',
2:'g',
3:'y',
4:'b',
5:'c',
6:'m',
7:'orange'
}
label_color = [label_color_map[l] for l in labels]
fig = plt.figure(figsize=(8,6))
ax = Axes3D(fig)
ax.scatter(sig_np[:,0],sig_np[:,1],sig_np[:,2],c=label_color)
ax.set_xlabel('C ratio')
ax.set_ylabel('G ratio')
ax.set_zlabel('T ratio')
ax.set_xlim(0,.8)
ax.set_xlim(0,.8)
ax.set_xlim(0,.8)
ax.set_title('KMeans Clustered Signature Data (Misincorporated Nucleotide)')
for angle in range(0, 360):
ax.view_init(30, angle)
plt.draw()
cluster = pd.DataFrame(data=estimator.labels_,index=sig_df['Signature'])
cluster.columns=['KMeans_Cluster_ID']
cluster.head(n=10)
cluster[cluster['KMeans_Cluster_ID']==1]
This data represents the ID label for black points, which represent a high G ratio with low C and T.
For the most part, upon visual inspection, most of these seem to contain a G in the second position, so our context appears to be:
NG(m1A)NN
Possibly signifying that this nearest neighbor effect causes a G to be incorporated at the site of modification.