There are many questions that come up when looking at the results of our algorithm run on genomic data.
- There are 608 motifs that we find
- From these 608 motifs, there are about 2,500,000 occurrences of the motifs in the human genome
- From these occurrences, there are about 2,000,000 occurrences of the motifs in the human genome that are within 500 kb of a gene with orthologs in all three species.
- From these 2,000,000 occurrences, approximately 100,000 are duplicates (about 5%). That
is, in about 5% of the locations in the human genome where motif A hits, there is another motif
B that hits in the same place. There is a larger percentage that overlap; 5% is based on
exact hits (so if motif A hits at 1:100000 and B hits at 1:99999, this is not counted
as a match, even though A and B overlap by nearly their entire lengths).
- Overlapping motifs cannot be a priori tossed out. There is some evidence
(completely independent of our study) that suggests that there may be two active forms of the RE1 binding site, and there may be other such factors, based on some sort of dimeric complex. We are currently working on seeing whether or not any of our reported motifs can be classified according to this.
- An individual motif's examination page (what you get to by clicking from the front page of motifs) will give you a way of seeing which other motifs it "overlaps"; there, I define overlap as being "occurrence within some delta of distance", though I do not technicall place restrictions on how big that delta can be. (However, everything is precomputed, so there are only a few pre-set values.) This means that "overlap" can mean physical overlap (when the overlap window ~ 5 bp), or it can mean "cooccurrence" (when the overlap window ~ 50 bp). We use this feature in determining if a particular motif belongs to a "family". There are only a few families, some of which are noise----that is, microsattelite or similar uninformative DNA.
- Nearly every conserved gene has at least one motif occurrence near by.
- The maximum number of motif occurrences within a single gene is about 3500. (This is
obviously an area with high conservation.) Here is a histogram:

- The Anderson paper from PNAS (1996) has 9 findable sites that have known
(validated) binding and/or silencing activity, and 9 more sites with unknown activity.
Motif #3 recovers all but 1 of these sites. Several other motifs appear to be related: 211, 371, 243. Most often, there are a number of hits that overlap the site, but pwm #3 seems to do a
really good job of getting it to the nucleotide.
- Of the 16 Anderson sites, 14 have conserved hits in the three species.
- Of the Buckley and/or Anderson hits, how many are conserved in HMR? (about half)
- What are the rules governing the viability of an RE1, based on the known sites? The Buckley lab uses the "RE1 is a consensus hit with 0 mutations", but with a very broad consensus. Can we do better? It appears that no, the RE1 string is pretty good. But we're pretty close.
- We assume that TFBSs are conserved, with few mutations across sequences. Is this true for the validated sites? Yes.
- Given these rules, what are your 10 best predictions for NRSE-regulated genes? (TBD)
- How do TRANSFAC motifs compare to what we find? We consider POU domains and Zn finger domains specifically,
and only consider factors that have binding sites or binding matrices (or both) in human, mouse, or rat. There is often overlap between a TRANSFAC motif (e.g., YY1) and one of our sites (e.g., 169), but determining the exact rules for a "match" or a "nonmatch" is somewhat puzzling.
- What is our NRSE prediction? Which motifs? (See this for a more complete list.
- There should be some hits for NRSE on Chr 22 (?). Where are they, and what do we find? (TBD)
- Of our predicted motifs, can we cluster them based on hit correlation? (This is in progress.)
- What is the distribution of gene lengths (not including flanking regions) in the genome?
Conserved genes:

All genes:

Comparing the same gene's length in different species:
- How does the number of hits relate to the length of a gene?
Conserved hits:

All hits:
- How big are the flanking regions? We use a cap of 500Kb, so between 0kb and 500kb. Here is a distribution:
Conserved genes:

All genes:

- How variable is the number of hits for a motif? Here is a histogram, though it is slightly off because it assumes that all genes are in the + orientation.:
All hits:

Conserved hits only:

- What is the distribution of z-scores for random vs nonrandom motifs?
"Our motifs":

"Random motifs":

This is neither uniform nor normal, so a t-Test is not applicable. Instead, a Mann-Whitney test can determine
if "our" motifs are significantly better than "random" motifs.