Build your own Hidden Markov Model So far we have discussed Markov Chain and its
ID: 3602323 • Letter: B
Question
Build your own Hidden Markov Model
So far we have discussed Markov Chain and its relation with HMM. Now it’s time to construct your own HMM to solve specific problems.
Scenario:
Suppose you are a bioinformatist working in a microbiology research group. The microbiologists in your lab identified new bacteria that possibly causes stomach cancer. Your job is to annotate the genome of this new bacteria. The first step of genome annotation is to identify the coding- and non-coding regions. You decide to achieve that by constructing an HMM model.
Construct an HMM model by answering following questions:
What are the hidden states? What is the observation?
Hidden state: The genome (nucleotides) that causes stomach cancer.
Observable state: The gene expression. (cancer/ no cancer)
In the Urn and Ball problem, each hidden state (urn) is linked with one observation (a ball). If you apply the same rule to this problem, you may find every neighboring nucleotide coming from a different hidden state. It is obviously a wrong assumption for genome. Therefore, you need to link one hidden state to a window of nucleotides.
To simplify the problem, let’s assume that you have in total K windows of length L (), and the position of each ( is fixed on the genome. Consequently, you will have a sequence of hidden states .
Draw the hidden states and nucleotide windows as the Urn and Ball example in Wiki (https://en.wikipedia.org/wiki/Hidden_Markov_model#A_concrete_example)
In Urn and Ball example, the hidden states and observation is connected by probability, e.g. the probability of ball drawn from urn . How would you connect the hidden states and observations in this problem? (Hint: as we spent so much time discussing Markov Chain, it shouldn’t be hard for you to guess the answer.)
Observe your graph. How many Markov Chains are needed in this problem? What are they? (Hint: neighboring nucleotide windows may be from different hidden states.
How would you model the relationship of the neighboring hidden states? Are they independent from each other?)
Assume you will use first-order Markov Chains. Describe your strategy of calculating the transition matrix and initial probability. (Hint: Obviously, you need the data to calculate these parameters. In the previous exercise, you calculated the transition matrix using a given DNA sequence (Q5 on Wednesday). However, this strategy will not work for this problem as you don’t know where the coding regions are. Therefore, what data do you want to use? Where can you find the data?)
Now you have set up your HMM. It is obvious that for each window , you will have several options of hidden states. The amount of computation is enormous if using a brute-force strategy (Does this problem sound familiar to you?). However, statisticians developed a really smart method called Viterbi algorithm. It is based on Dynamic Programming (our old friend!)
Review what you’ve learnt about dynamic programming in the global alignment problem:
for each cell, the score is the max of () – gap penalty, score() – gap penalty, score() – mismatch penalty or score() + matching reward). The backtracking direction is determined accordingly. The score for the base case (the very top left cell in the grids) is 0.
Fill the following table to adapt HMM into a dynamic programming problem:
What are the counterparts?
Global alignment
You are interested in probabilities rather than the scores. How would you apply the same rule to HMM? Let’s work on the probability for base case (i.e. the beginning window ). What parameters do you need? How to connect the next case with base case ?
Now you have an HMM model. Remember that gene structure is complicated, as we discussed it in Unit 1. Do you think this model is good in identifying genes? If not, what changes do you want to make? (Hint: the setting of HMM is adjustable, e.g. the number of hidden states, length of the nucleotide window, positions of the window, Markov Chains, etc.)
What are the counterparts?
Global alignment
Explanation / Answer
The hidden Markov model is one of the most important machine learning models
in speech and language processing. To define it properly, we need to first introduce
the
Markov chain
, sometimes called the
observed Markov model
. Markov chains
and hidden Markov models are both extensions of the finite automata of Chapter 3.
Recall that a
weighted finite automaton
is defined by a set of states and a set of
transitions between states, with each arc associated with a weight. A
Markov chain
Markov chain
is a special case of a weighted automaton in which weights are probabilities (the
probabilities on all arcs leaving a node must sum to 1) and in which the input se-
quence uniquely determines which states the automaton will go through. Because
it can’t represent inherently ambiguous problems, a Markov chain is only useful for
assigning probabilities to unambiguous sequences
A Markov chain is useful when we need to compute a probability for a sequence
of events that we can observe in the world. In many cases, however, the events
we are interested in may not be directly observable in the world. For example, in
Chapter 10we’ll introduce the task of part-of-speech tagging, assigning tags like
Noun and Verb to words.
we didn’t observe part-of-speech tags in the world; we saw words and had to in-
fer the correct tags from the word sequence. We call the part-of-speech tags
hidden
because they are not observed
Related Questions
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.