MSM estimation

Assuming our data was sampled in a time-correlated manner as it is the case for MD simulation data, we can use this clustering result as a basis for the estimation of a core-set Markov-state model.

[37]:
from csmsm.estimator import CoresetMarkovStateModel
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In [37], line 1
----> 1 from csmsm.estimator import CoresetMarkovStateModel

ModuleNotFoundError: No module named 'csmsm'
[8]:
langerin = cluster.Clustering(
    np.load("md_example/langerin_projection.npy", allow_pickle=True),
    alias="C-type lectin langerin"
    )
[9]:
langerin.labels = np.load("md_example/langerin_labels.npy")
[11]:
M = CoresetMarkovStateModel(langerin.to_dtrajs(), unit="ns", step=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-42957f32a014> in <module>
----> 1 M = CoresetMarkovStateModel(langerin.to_dtrajs(), unit="ns", step=1)

src/csmsm/estimator.pyx in csmsm.estimator.CoresetMarkovStateModel.__init__()

TypeError: __init__() got an unexpected keyword argument 'unit'
[251]:
# Estimate csMSM for different lag times (given in steps)
lags = [1, 2, 4, 8, 15, 30]
for i in lags:
    M.cmsm(lag=i, minlenfactor=5, v=False)
    M.get_its()
[252]:
# Plot the time scales
fig, ax, *_ = M.plot_its()
fig.tight_layout(pad=0.1)
ax.set(**{
    "ylim": (0, None)
})
[252]:
[(0.0, 10459.606311240232)]
../_images/tutorial_md_example__msm.backup_7_1.png
[224]:
fig, ax = plt.subplots()
matrix = ax.imshow(M.T, cmap=mpl.cm.inferno)
fig.colorbar(matrix)
ax.set(**{
    "aspect": "equal",
    "xticks": range(len(M.T)),
    "xticklabels": range(1, len(M.T) + 1),
    "yticks": range(len(M.T)),
    "yticklabels": range(1, len(M.T) + 1)
})
plt.show()
../_images/tutorial_md_example__msm.backup_8_0.png

Prediction

[134]:
# Lets make sure we work on the correctly clustered object
print("Label", "r", "c", sep="\t")
print("-" * 20)
for k, v in sorted(langerin_reduced.labels.info.params.items()):
    print(k, *v, sep="\t")
Label   r       c
--------------------
1       0.19    15
2       0.4     5
3       0.25    15
4       0.4     5
5       0.375   10
6       0.375   10
7       0.19    15
8       0.19    15
9       0.5     5
10      0.5     5
11      0.5     5
12      0.5     5
13      0.375   10
14      0.375   10
15      0.5     5
16      0.19    15
17      0.25    15
[142]:
langerin_reduced_less = langerin.cut(points=(None, None, 50))
[143]:
langerin_reduced_less.calc_dist(langerin_reduced, mmap=True, mmap_file="/home/janjoswig/tmp/tmp.npy", chunksize=1000)  # Distance map calculation
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-143-4379ad10935c> in <module>
----> 1 langerin_reduced_less.calc_dist(langerin_reduced, mmap=True, mmap_file="/home/janjoswig/tmp/tmp.npy", chunksize=1000)  # Distance map calculation

~/CNN/cnnclustering/cnn.py in calc_dist(self, other, v, method, mmap, mmap_file, chunksize, progress, **kwargs)
   1732                 len_self = self.data.points.shape[0]
   1733                 len_other = other.data.points.shape[0]
-> 1734                 self.data.distances = np.memmap(
   1735                     mmap_file,
   1736                     dtype=settings.float_precision_map[

~/CNN/cnnclustering/cnn.py in distances(self, x)
   1124     def distances(self, x):
   1125         if not isinstance(x, Distances):
-> 1126             x = Distances(x)
   1127         self._distances = x
   1128

~/CNN/cnnclustering/cnn.py in __new__(cls, d, reference)
    722             d = []
    723
--> 724         obj = np.atleast_2d(np.asarray(d, dtype=np.float_)).view(cls)
    725         obj._reference = None
    726         return obj

~/.pyenv/versions/miniconda3-4.7.12/envs/cnnclustering/lib/python3.8/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     81
     82     """
---> 83     return array(a, dtype, copy=False, order=order)
     84
     85

MemoryError: Unable to allocate 10.5 GiB for an array with shape (52942, 26528) and data type float64

Cluster alternatives

It is always recommended to cross validate a clustering result with the outcome of other clustering approaches. We want to have a quick look at the alternative that density-peak clustering provides.

[61]:
pydpc_clustering = pydpc.Cluster(langerin_reduced.data.points)
../_images/tutorial_md_example__msm.backup_15_0.png

Clustering in this case is just a single step without the need of parameter specification. In the following, however, we need to extract the actual clusters by looking at the plot below. Points that are clearly isolated in this plot are highly reliable cluster centers.

[65]:
pydpc_clustering.autoplot = True
[66]:
pydpc_clustering.assign(0, 1.8)
../_images/tutorial_md_example__msm.backup_18_0.png

This gives us 7 clusters.

[70]:
langerin_reduced.labels = (pydpc_clustering.membership + 1)
draw_evaluate(langerin_reduced)
../_images/tutorial_md_example__msm.backup_20_0.png

As we are interested in core clusters we want to apply the core/halo criterion to disregard points with low cluster membership probabilitie as noise.

[71]:
langerin_reduced.labels[pydpc_clustering.halo_idx] = 0
draw_evaluate(langerin_reduced)
../_images/tutorial_md_example__msm.backup_22_0.png
[72]:
M = cmsm.CMSM(langerin_reduced.get_dtraj(), unit="ns", step=1)
[73]:
# Estimate csMSM for different lag times (given in steps)
lags = [1, 2, 4, 8, 15, 30]
for i in lags:
    M.cmsm(lag=i, minlenfactor=5)
    M.get_its()

*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 1 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 2 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 4 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 8 ns
---------------------------------------------------------

Using 116 trajectories with 25900 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 15 ns
---------------------------------------------------------

Trajectories [0, 1, 73]
are shorter then step threshold (lag * minlenfactor = 75)
and will not be used to compute the MSM.

Using 113 trajectories with 25732 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************


*********************************************************
---------------------------------------------------------
Computing coreset MSM at lagtime 30 ns
---------------------------------------------------------

Trajectories [0, 1, 4, 63, 73]
are shorter then step threshold (lag * minlenfactor = 150)
and will not be used to compute the MSM.

Using 111 trajectories with 25447 steps over 7 coresets

All sets are connected
---------------------------------------------------------
*********************************************************

[74]:
# Plot the time scales
fig, ax, *_ = M.plot_its()
fig.tight_layout(pad=0.1)
../_images/tutorial_md_example__msm.backup_25_0.png
[75]:
figsize = mpl.rcParams["figure.figsize"]
mpl.rcParams["figure.figsize"] = figsize[0], figsize[1] * 0.2
M.plot_eigenvectors()
mpl.rcParams["figure.figsize"] = figsize
../_images/tutorial_md_example__msm.backup_26_0.png

This result is in good agreement with the one we obtained manually and argualbe faster and easier to achieve. If we decide that this result is exactly what we consider valid, then this is nice. If we on the other hand want to tune the clustering result further, with respect to splitting, noise level and what is considered noise in the first place we gain more flexibility with the manual approach.