I am working with a set of DNA motifs that are predicted as potential regulatory
ID: 38874 • Letter: I
Question
I am working with a set of DNA motifs that are predicted as potential regulatory motifs (e.g. transcription factor binding sites). The motifs belong to several species, and I wanted to cluster these motifs via their Position Weight Matrices (PWMs) (also known as PSSMs) to collapse similar motifs together into groups.
There is a tool called MATLIGN (website here) that does what I need, but their required format for the PWMs are different to what I have, they claim:
"Matrices must be in the frequency matrix format (only integer numbers are acceptable)"
The problem is that my PWM matrices do not have integer numbers but decimals instead. e.g.:
A C G T
1 0.000000 1.000000 0.000000 0.000000
2 1.000000 0.000000 0.000000 0.000000
3 0.000000 0.000000 1.000000 0.000000
4 0.000000 0.421755 0.000000 0.578245
5 0.289407 0.000000 0.282556 0.428038
In other words, instead of the decimal values I have in my matrix I need to have integer counts. Could anybody suggest what I can do? Would I need to create artificial "pseudo-counts"?
Explanation / Answer
So what you need is basically your data expressed as counts instead of proportions. Even if you do not have the matrix of counts as raw data, these proportions only needs to be multiplied by the total number of binding sites used in the study (e.g. the number of sequences that have been analysed) to get the counts (since proportion = count/total number of binding sites). You should have that information somewhere.
indeed there was this missing piece of information available to me, it came in the form of a variable called nsites which equates to the total number of DNA sites that the PWM was generated from.
Related Questions
drjack9650@gmail.com
Navigate
Integrity-first tutoring: explanations and feedback only — we do not complete graded work. Learn more.