Finding Subsequences of Two-Valued Sequences with Correlation.

Let's say that we have some sequences of -1's and 1's like this: \[a = [1,-1,-1,-1,-1,1,1,-1,-1,1]\] Note that this kind of sequence is not so rare: we can convert this easily to a binary sequence by replacing -1 as 0. To get a binary sequence into this form, normalize it.

Let's also say that we want to find a subsequence that we think is something like \[sub = [1,1,-1,-1]\] Where should we look?

In this case it's pretty easy to spot potentials. In a much larger set, this would not be as easy. Moreover, it's not as easy as looping over the first sequence to see if there's an exact match because we also want to look at things that look similar to $$sub$$. Hm.

One potential way of doing this is to use the correlation to figure out where our subsequence is.

Recall that correlation of two random variables $$X$$ and $$Y$$ of the same size, which I'll call $$\rho$$ is given by \[\rho_{XY} = \frac{\sum_{i}x_{i}y_{i}}{\sigma_{X}\sigma_{Y}}\] where the $$x_{i}, y_{i}$$ are the $$i$$th elements of $$X, Y$$ and the $$\sigma$$'s are their respective standard deviations. This tells us how correlated $$X$$ and $$Y$$ are; that is, given $$X$$ how well can we predict values of $$Y$$ and vice versa.

This gist of our attack is this: we take $$sub$$ and notice that it's length is 4. We'll then take the first 4 elements of $$a$$ and find the correlation with $$sub$$. Then we'll take the 2nd to 5th element of $$a$$ and find the correlation to $$sub$$. We keep sliding over 1 unit in $$a$$ and comparing it to $$sub$$ until we reach the end of $$a$$. Notice that if the values match (are both 1 or both -1) then this contributes 1 to the sum in the correlation. If they do not match, this contributes -1. Our result will them be an array of values of matches minus mismatches.

At this point, note that we're going to drop the values of $$\sigma$$ in the correlation. These will be the same value for every pairing we make (as the value is computed across the entire sequences) and only serve to scale our values. We won't need them for this post.

In our above case, our resulting array from doing this sliding correlation algorithm is: \[[2, 0, -2, -4, 0, 4, 0]\] and looking at the index of the maximum value in this array (also called the $$\arg\max$$) we find that it occurs at the 6th place (or, in Python, the 5th index). This is, as you can check, when the subsequences occurs in the first sequence. If we plot the points, notice that the "peak" is at the 6th place (it's index 5 because Python is 0-indexed). Code is also included below.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

a = np.array([1,-1,-1,-1,-1, 1, 1,-1,-1, 1])
sub = np.array([1, 1, -1, -1])

result = [sum(sub*a[i: i+len(sub)]) for i in range(len(a) - len(sub) + 1)]
plt.plot(range(len(result)), result)

Could this be a coincidence? Let's do a longer sequence.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import random

a = np.array([random.choice([-1, 1]) for _ in range(10000)])
sub = np.array([1]*50 + [-1])

result = [sum(sub*a[i: i+len(sub)]) for i in range(len(a) - len(sub) + 1)]
plt.plot(range(len(result)), result)

Here, I have an original sequence of 10,000 elements and I'm looking for a subsequence that is fifty 1's followed by a -1. I get this figure as a result. Notice that there is a peak around 7,900 or so. Let's try to look closer at that.

Neat. Okay, blowing this up a little more, we find the sequence in the original sequence that we're matching to. It looks like this:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
-1, 1, 1, 1, 1, 1, -1, 1, 1, 1,
-1, -1, -1, 1, 1, 1, 1, 1, 1, 1,
1, -1, -1, 1, 1, 1, -1, 1, 1, -1,
-1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1, 1]

Close, but not exact. But, then again, this sequence was randomly generated so there is a good chance our subsequence wasn't in our original sequence and this is the best we can do.

Homework. Try some sequences on your own. You can also try things with values like -2, -1, 1, and 2 (for example, looking at DNA sequences and coding each letter with one of these digits). This won't always work but will work well for a surprisingly large number of cases.

Homework. Our method here weighted matches and mismatches equally. What if we wanted to weight mismatches more heavily? How could we accomplish this? Similarly, what if we did not care about mismatches at all but wanted to just maximize matches?

We'll extend this to more general sequences in a later post (there's some tweaks that need to be done to have it work in general).