# 14 – Markov Models

Let’s now look at a popular model for time-series data. Time-series data means data in a sequence, where the order of the elements has some meaningful interpretation. For example, the weather across the days of a week is time-series data; weather across random days of a year is not. Time-series data presents interesting challenges, because the value of some element typically relies on elements preceding it.

In this post, we’ll talk about Markov models. Let’s discuss notation. Let $S = \{ s_1, s_2, \ldots, s_{|S|}\}$ represent the possible states that you can observe. In our weather example, we might have sunny, rainy, and cloudy states. Suppose we are given a sequence with $T$ elements: $\{z_1,z_2,\ldots,z_T\}$. We’ll make the assumption that the time-series data came from a Markov chain.

In a Markov chain, the assumption is that the value of the current element depends only on the preceding one, and not the others. A Markov chain has a transition matrix, which holds the probability values for each transition. Continuing our weather example, the transition matrix would have probabilities that today will be sunny, cloudy, or rainy, given that yesterday was any of these. So this transition matrix will be $3 \times 3$.  Another assumption of Markov chains is that this transition matrix remains constant over time.

We can also have a parameter representing the initial state distribution, denoted by $\pi$. For our weather example, we might choose to simply leave it as $1/3$ for all 3 weather states. Now given a state sequence, we can compute its probability.

## Probability of a state sequence

If $z_0$ is the start state, then \begin{aligned} p(z) &= p(z_t, z_{t-1}, \ldots, z_1, z_0; A) \\ &= p(z_t|z_{t-1}; A)p(z_{t-1}|z_{t-2}; A)\ldots p(z_1|z_0; A) \\ &= \prod\limits_{t=1}^T p(z_t|z_{t-1}; A) \\ &= \prod\limits_{t=1}^T A_{z_{t-1}z_t} \end{aligned}

This shouldn’t be too hard to follow. The first step comes from the assumption that the data comes from a Markov chain, so only the preceding element is required in the conditional probability.

So finding the probability of a state sequence given the parameter $A$ isn’t very hard, but it raises the question: how do you find it in the first place? We’ve already covered this concept:

## Maximum likelihood parameter assignment

Let’s define the log-likelihood of the Markov chain: \begin{aligned} l(A) &= \log p(z; A) \\ &= \log \prod_{t=1}^T A_{z_{t-1}z_t} \\ &= \sum\limits_{t=1}^T \log A_{z_{t-1}z_t} \\ &= \sum\limits_{i=1}^{|S|} \sum\limits_{j=1}^{|S|} \sum\limits_{t=2}^T [z_{t-1} = s_i, z_t = s_j] \log A_{ij} \end{aligned}

We need to carefully note that each row of the transition matrix must add up to 1. Further, all these probability values must be non-negative. Our optimization problem is \begin{aligned} \max\limits_{A} \text{ } &l(A) \\ \text{s.t.} & \sum\limits_{j=1}^{|S|} A_{ij}=1, i = 1, \ldots, |S| \\ & A_{ij} \geq 0, i, j = 1, \ldots, |S| \end{aligned}

Our Lagrangian looks like this: \begin{aligned} \mathcal{L}(A, \alpha) = &\sum\limits_{i=1}^{|S|} \sum\limits_{j=1}^{|S|} \sum\limits_{t=1}^T [z_{t-1} = s_i, z_t = s_j] \log A_{ij} + \sum\limits_{i=1}^{|S|} \alpha_i \left( 1 - \sum\limits_{j=1}^{|S|} A_{ij} \right) \end{aligned}

It’s quite easy to solve for the parameters analytically. Let’s take the partial derivatives and set them to 0 to get them: \begin{aligned} \frac{\partial L}{\partial A_{ij}} &= \frac{1}{A_{ij}}\sum\limits_{t=1}^T [z_{t-1}=s_i, z_t = s_j] - \alpha_i \equiv 0 \\ &\Rightarrow \boxed{A_{ij} = \frac{1}{\alpha_i}\sum\limits_{t=1}^T [z_{t-1}=s_i, z_t = s_j]} \\ \frac{\partial L}{\partial \alpha_i} &= 1 - \sum\limits_{j=1}^{|S|} A_{ij} \\ &= 1 - \sum\limits_{j=1}^{|S|}\frac{1}{\alpha_i}\sum\limits_{t=1}^T [z_{t-1}=s_i, z_t = s_j] \equiv 0 \\ &\Rightarrow \alpha_i = \sum\limits_{j=1}^{|S|}\sum\limits_{t=1}^T [z_{t-1}=s_i, z_t =s_j] \\ &\Rightarrow \boxed{ \alpha_i = \sum\limits_{t=1}^T [z_{t-1}=s_i] } \end{aligned}

Let’s briefly review. For the first equation, we could collapse the first two summations because for all the sum indices except the specific $i, j$ that we’re interested in, $A_{ij}$ becomes a constant. It can be a little confusing in the beginning, but you’ll get used to it. We essentially use the same collapsing trick with the next equation, noting that $z_t$ can only equal one particular $s_j$, removing the need for both that term in the Iverson notation and the corresponding summation. For the last derivation, we also used the previously obtained value of $A_{ij}$.

Now we substitute the values of $\alpha_i$ in the equation for $A_{ij}$: $\boxed{ A_{ij} = \frac{\sum\limits_{t=1}^T [z_{t-1}=s_i, z_t = s_j]}{ \sum\limits_{t=1}^T [z_{t-1}=s_i]}}$

This is very intuitive. It says that to find the probability of a transition from one state to another, look at the sequence you have for times when you made this transition (the numerator), and to get a probability, divide this by the number of times you were in the first state (the denominator).

We’ll very briefly mention here that this is pretty much exactly what n-gram models in Natural Language Processing do (if you don’t know what n-grams are, skip this paragraph). Recall that in an n-gram model, to find the probability of say, a bigram, you essentially compute conditional probabilities, which evaluate to what we derived above. Even in NLP, you could use MLE to compute the n-grams, and you’d arrive at the same formula above, though you might use different notations.

An observant reader would notice that we haven’t talked much about our initial state distribution. This is something that you have to set up, maybe by looking at historical data. So you might look at the weather over the last 5 years, and say, “Okay, it’s been sunny about 70% of the days, cloudy about 10%, and rainy about 20% of the days, so let’s assign the initial state distribution as 0.7 for sunny, 0.1 for cloudy, and 0.2 for rainy.”

## Conclusion

We’ve talked about an interesting time-series model, Markov Models. These model Markov chains, which have many other uses as well, most notably, Markov Chain Monte Carlo (MCMC). Next, we’ll discuss another variant of Markov models: Hidden Markov Models (HMMs).