Gene Finding Methods
Outline
- Overview
- Evidence collection
- Building EST-based gene models
- Computational prediction of tRNAs and RNAs
- Training Gene Prediction Programs
- Computational Gene Models
- Transfer Annotation
- Selection of Consensus Gene Models
- Predicting Alternatively Spliced Transcript Models
- Predicting UTR
- Assigning Gene Product Names
- Reporting Annotation Accuracy
- Gene Numbering
- Overview of Reference genes
- Overview of Query Genes
- Feature Mapping
- Feature Comparisons
- Splice Analysis
- Possible Problems
Overview
This document provides a general description of our automated genome annotation for eukaryotic genomes.
Gene annotation at Broad is a multi-step process.
Evidence collection
- Blast evidence:
- EST evidence:
- Pfam domains:
- EST alignments:
Blast homology search against the Genbank's NR database produces a set of raw blast out put. Individual blast alignments are then clustered into single blast clusters by linking the blast alignments derived from the same blast hit. Several such overlapping blast clusters on the genomic axis represents what we call as blast loci on the genome assembly. Currently, all blast data with e-values greater than 1e-10 are used considered as usable blast evidence.
For most of our fungal and other eukaryotic genomes, we use species-specific ESTs sequenced here are broad as well as publicly available EST sequences from genbank to produce a set of high confidence gene models.
We run Hmmer searches using pFAM library to find pFAM domains on six-frame translations of the genomic sequence.
First, we align ESTs to the genome using BLAT and then collapse them into distinct clusters transcripts using a Broad-developed program called CallReferenceGenes (described below). EST alignments with 90% identity over 50% of the EST length with canonical splice junctions are considered valid EST alignments suitable for building gene models.
Building EST-based gene models
EST clusters produced in the previously described section are used as inputs for building high confidence gene models for the purposes of training and evaluating gene finding tools.
FindEstOrf is an internal tool used for constructing CDS from EST transcripts. This tool assigns start and stop codons to create a longest ORFs. We then compare these constructed ORFs to overlapping BLAST evidence and pick the ones with comparable reading frame and protein length to the best blast hits from closely related species. FL_EST gene models created by this process are fully covered by EST alignments and are within + 20% of the length of homologous proteins.
In addition, we also build EST-based gene models by a manual process. By combining blast, EST and ab initio predictions produced using a parameter file from a related genome, manual annotators carefully build gene models that are otherwise missed by our highly conservative automated findORF script.
Computational prediction of tRNAs and RNAs
tRNA scan is used for detecting tRNAs on the genome assembly. We use RNA finding programs such as RNAmmer and RFamSearch to detect the common RNA features.
Training Gene Prediction Programs
Commonly used gene finding programs such as Augustus, GENEID, GeneMark, FGENESH and SNAP are trained in house or by the developers of these programs using the high confidence EST gene sets.
Computational Gene Models
Gene structures are predicted using a combination of gene models from computational gene prediction programs such as FGENESH, GENEID, GeneMark and EST-based automated and manual gene models. FGENESH is a commercial gene prediction program sold by Softberry, while GENEID, by Enrique Blanco and Roderic Guigo, is available under the GPL. GeneMark is another gene finding tool developed by the Mark Borodovsky's group (Borodovsky & McIninch, Comp. Chem., 1993). Both GENEID and FGENESH are usually trained using a set of high confidence EST-based gene models generated by clustering blat-aligned species-specific ESTs. In addition, we also use other commonly used gene finding programs such as SNAP (Ian Korf) and Augustus (Mario Stanke) if and when genome-specific parameter files are available.
After training the gene prediction programs, we run each of them on the assembly and evaluate their performance by comparing the gene models with EST and blast evidence. Those that perform adequately are used in the automated gene calling pipeline. The modular architecture of our automated pipeline makes it easy to incorporate new gene prediction programs and customize the pipeline to suit the genomes annotated.
Transfer Annotation
We use well-curated annotations from other sources and reference genomes to improve our automated annotations. Broad's in-house synteny-based gene transfer process has two main steps:
First, we generate collinear block alignments between the two genomes by creating pair-wise alignments between the two genomes. Further, a global alignment is generated for the entire region the collinear block covers.
Second, an in-house gene mapping program then transfers genes from reference onto the target genome within the specific syntenic blocks. We use genewise to further refine a gene model at each locus.
Selection of Consensus Gene Models
Broad's automated gene calling process uses a rule-based selection process to evaluate the evidence and build consensus gene models.
Ab initio predictions, blast and EST alignments, reference gene models, and manual and automated EST-gene models are clustered into potential gene loci. We select the most likely non-conflicting gene models based on the best evidence available at each locus. Our method uses heuristics such as splice agreement with ESTs and relative overlap with the BLAST hits to choose the prediction most in accord with the evidence. It does not have an internal model of gene structure and thus runs on a wide variety of eukaryotic and prokaryotic organisms without training.
Our gene selection process offers several useful options to exclude calling genes at certain loci. For example, we can exclude genes in regions with tRNA, rRNA and known repeat elements using a conservative overlap criterion.
Gene models with problems are tagged appropriately with curation flags and notes in the gene report to indicate potential problems. Despite all the progress in the field of gene finding, accurate gene finding on draft genomes is still a challenge. We make an effort to track easily identifiable problematic gene models and tag them with appropriate curation flags to alert the users of the nature of the problems. These tags are also used by manual annotators to specifically target manual editing and fine-tuning of bad gene models.
Predicting Alternatively Spliced Transcript Models
We do not predict alternatively spliced transcript unless there is manual or full-length EST evidence in support of their existence. Transcript models that differ only with respect to un-spliced 3' and 5' ends with the reference gene models are not considered evidence for alternative splicing. We include only canonically spliced and uniquely aligned ESTs with alternate splice junctions as valid alternatively spliced transcripts.
Predicting UTR
When an EST alignment uniquely aligns with >= 95% identity and overlaps a gene prediction, and the region of overlap has absolute labeling agreement (e.g., every nucleotide in the region of overlap is exonic in both the prediction and the alignment, or is intronic in both), we consider the prediction and alignment to be compatible. UTR predictions are generated by walking out along these compatible EST alignments from the end(s) of each prediction. A chain of one or more overlapping, compatible EST alignments that begin at the 5' or 3' end forms each UTR extension. When an EST aligns to more than one location on the genome, or touches more than one gene prediction (on either strand) it is ignored.
Assigning Gene Product Names
We do not assign gene symbols. Instead, our gene naming protocol currently relies on high confidence blast homology and in some cases, community inputs to assign gene product names. We hope to improve the gene naming process in the future based on other functional annotation protocols and tools.
Reporting Annotation Accuracy
We run in-depth reports on each annotation we produce to get a measure of our annotation accuracy. A host of accuracy statistics is compiled for each prediction that touches an EST; following Guigo's Evaluation of gene structure prediction programs (Genomics, 1996 Jun 15; 34(3):353-67). We calculate specificity, sensitivity, correlation coefficient and simple matching coefficient on the levels of nucleotides, splice sites, introns and exons.
Gene Numbering
Every annotated gene is given a Locus Number of the form MGG_##### that should be considered the only guaranteed way to identify a gene uniquely. Each locus number is guaranteed to identify a unique gene even over different assemblies. Loci are simply identifiers and are not guaranteed to have any particular order or internal structure.
The mapping of gene accessions from the previous release to the current version is available here.
Overview of Reference genes
8358 genes (4531 on '+' strand, 3827 on '–')
9217 transcripts
(3374 spliced, 5843 unspliced)
14855 exons, 5638 introns
| len | %cov | %gc | %at | ||
|---|---|---|---|---|---|
| genic | 5328453 | 12.78 | 53.15 | 46.85 | |
| intergenic | 36367004 | 87.22 | 51.34 | 48.66 | |
| exonic | 4811617 | 11.54 | 53.77 | 46.23 | |
| intronic | 470769 | 1.13 | 47.08 | 52.92 | |
| coding | 4129984 | 9.91 | 54.78 | 45.22 | |
| 5′ UTR | 318331 | 0.76 | 50.25 | 49.75 | |
| 3′ UTR | 480571 | 1.15 | 46.49 | 53.51 | |
| alt. spliced | 46067 | 0.11 | 50.90 | 49.10 | |
| genomic | 41617957 | 99.81 | 51.57 | 48.43 | |
min | median | mean | n50 | max | |
| total length (incl. UTR + introns) | 22 | 565 | 650 | 734 | 3486 |
| coding length | 22 | 453 | 488 | 567 | 2421 |
| exons per transcript | 1 | 1 | 1.61 | 2 | 13 |
| exons per spliced transcript | 2 | 2 | 2.67 | 5 | 13 |
| nt per exon | 4 | 312 | 361 | 502 | 2885 |
| nt per intron | 23 | 87 | 112 | 108 | 978 |
| intergenic nt | 1 | 2118 | 4353 | 9506 | 183972 |
| 5′ UTR nt | 1 | 17 | 67 | 160 | 1212 |
| 3′ UTR nt | 1 | 50 | 102 | 237 | 2216 |
Overview of Query Genes
11043 genes (5580 on '+' strand, 5463 on '–')
11054 transcripts
(8800 spliced, 2254 unspliced)
30705 exons, 19651 introns
| len | %cov | %gc | %at | ||
|---|---|---|---|---|---|
| genic | 19361282 | 46.43 | 55.98 | 44.02 | |
| intergenic | 22334175 | 53.57 | 47.72 | 52.28 | |
| exonic | 16968750 | 40.70 | 57.23 | 42.77 | |
| intronic | 2390421 | 5.73 | 47.12 | 52.88 | |
| coding | 15977408 | 38.32 | 57.92 | 42.08 | |
| 5′ UTR | 399584 | 0.96 | 47.89 | 52.11 | |
| 3′ UTR | 602820 | 1.45 | 45.02 | 54.98 | |
| alt. spliced | 2111 | 0.01 | 51.83 | 48.17 | |
| genomic | 41617957 | 99.81 | 51.57 | 48.43 | |
min | median | mean | n50 | max | |
| total length (incl. UTR + introns) | 132 | 1503 | 1754 | 2073 | 19707 |
| coding length | 99 | 1194 | 1446 | 1791 | 19500 |
| exons per transcript | 1 | 2 | 2.78 | 3 | 18 |
| exons per spliced transcript | 2 | 3 | 3.23 | 3 | 18 |
| nt per exon | 3 | 316 | 553 | 1056 | 12935 |
| nt per intron | 4 | 88 | 122 | 122 | 1029 |
| intergenic nt | 1 | 1023 | 2028 | 3838 | 167583 |
| 5′ UTR nt | 2 | 136 | 169 | 225 | 1077 |
| 3′ UTR nt | 2 | 196 | 240 | 323 | 1465 |
Feature Mapping
Searched for mappings from 11043 query features to 8358 reference features.
Reference genes farther than 200 nt from any query gene: 1452
Containing 1507 transcripts:
1330 unspliced
(713 with CDS),
0 with non-canonical splices
(0 with CDS),
177 with canonical splices
(71 with CDS).
These may indicate missed genes.
379 query genes (3.4%) hit 555 reference genes (6.6%) on the opposing strand. Hits to opposing strands are considered misses in the following comparisions.
Strand mismatches such as these may be orientation problems in the reference set, or incorrect predictions in the query set. Unspliced reference transcripts are difficult to orient correctly. 522 reference gene(s) were unspliced. These may easily have been placed on the wrong strand; queries hitting these genes are likely to be oriented correctly. 33 reference gene(s) had at least one canonically-spliced transcript; everything else being equal, their orientation is more likely to be correct than the queries hitting them.
The table below shows how query genes (rows, roman type) map to reference genes (columns, italic type). We cluster genes into overlapping groups on the same strand and record eight types of relationship. To the left of each arrow is the number of query genes in each type of cluster; the number of reference genes is to the right.
| none | reference one | many | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| none | - | 0 | ↔ | 2546 | 0 | ↔ | 0 | |||
| query | one | 6349 | ↔ | 0 | 3659 | ↔ | 3659 | 995 | ↔ | 2125 |
| many | 0 | ↔ | 0 | 30 | ↔ | 15 | 10 | ↔ | 13 | |
| counts per cluster type | ||||||||||
| none | reference one | many | ||||||||
| none | - | 0 | ↔ | 30.5 | 0 | ↔ | 0 | |||
| query | one | 57.5 | ↔ | 0 | 33.1 | ↔ | 43.8 | 9.0 | ↔ | 25.4 |
| many | 0 | ↔ | 0 | 0.3 | ↔ | 0.2 | 0.1 | ↔ | 0.2 | |
| percentages per cluster type | ||||||||||
one↔one clear, unambiguous mappings from one query gene to one reference gene. none↔one reference genes that do not map to any query gene (misses). none↔many reference genes that overlap other reference genes but do not map to query genes. None↔X mappings may represent underpredictions. one↔none query genes that do not map to any reference gene. many↔none query genes that overlap other query genes but do not map to reference genes. X↔none mappings may represent overpredictions or deficiencies in the reference gene set. one↔many single query genes overlapped by multiple reference genes (merges). These may occur if the reference set contains partial genes. many↔one single reference genes spanning multiple query genes (splits). These are often annotation errors. many↔many complex clusters where multiple query genes map to multiple reference genes.
Feature Comparisons
Predicted genes that hit reference genes on the wrong strand, or hit reference genes containing only non–canonically-spliced transcripts, are ignored in the following calculations. When scoring genes with multiple transcripts, we select the single best transcript↔transcript comparison. For clusters in the one↔many and many↔many categories above, each query is compared to as many references as possible. The best non-overlapping comparisons are used to score these loci.
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| nucleotides | 4054284 | 3554853 | 457456 | 4293 | 37682 | |
| splice sites | 14311 | 13335 | - | 551 | 425 | |
sn | sp | smc | acp | ac | cc | |
| nucleotides | 0.9895 | 0.9988 | 0.9896 | 0.9757 | 0.9515 | 0.9510 |
| splice sites | 0.9691 | 0.9603 | 0.9318 | - | - | - |
| #comp | TP | TN | FP | FN | ||
|---|---|---|---|---|---|---|
| introns | 4583 | 4154 | - | 113 | 13 | |
| exons | 8725 | 7906 | - | 4 | 41 | |
sn | sp | sn_sp | wrong | miss | ||
| introns | 0.9611 | 0.9064 | 0.9338 | 0.0030 | 0.0247 | |
| exons | 0.8466 | 0.9061 | 0.8764 | 0.0044 | 0.0005 |
#comp the number of subfeatures compared.
TP true positives: nucleotides predicted as exonic in both
query and reference; or, splice junctions with exact agreement, in
position and type (donor:donor, acceptor:acceptor), between query and
reference; or, introns or exons with both splice sites in exact
agreement with the reference.
TN true negatives: nucleotides predicted as intronic in
both query and reference; not defined for splice sites, introns or
exons.
FP false positives (overpredictions): nucleotides marked
as exonic in the query but intronic in the reference; or, splice
junctions identified only in the query*; or query introns/exons that
do not overlap a reference intron/exon.
FN false negatives (underpredictions): nucleotides marked as
intronic in the query but exonic in the reference; or, splice
junctions identified only in the reference*; or reference
introns/exons that do not overlap a query intron/exon.
sn sensitivity,
The table below briefly summarizes the relative coverage of
overlapping query and reference genes. The first column shows the
amount of sequence (in nt and percent) in each gene set that is
overlapped by sequence in the other gene set. The next three columns
show the overhanging sequence from transcripts that overlap but are
offset. Overhangs are categorized as 5′ (one transcript begins
upstream of the other), 3′ (one transcript extends downstream
past the other), and mixed (indeterminate). Finally, the last column,
"complete miss", counts the number of nt in queries that did not
hit a reference at all, and vice versa.
A high degree of overlap can indicate a good match between the query
and reference sets. A high "missed" score may simply indicate a small
reference set.
| overlap | overhang, 5′ | overhang, 3′ | overhang, mixed | complete miss | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| query | 4054284 nt | / | 20.9% | 2364563 nt | / | 12.2% | 1214752 nt | / | 6.3% | 1066835 nt | / | 5.5% | 10661781 nt | / | 55.1% |
| reference | 4054284 nt | / | 76.3% | 30403 nt | / | 0.6% | 33953 nt | / | 0.6% | 26294 nt | / | 0.5% | 1166359 nt | / | 22.0% |
Splice Analysis
13335 splice agreements, 976 disagreements.
6624 ignored:
216 due to EST misalignment,
6408 due to partial initial/terminal exon coverage.
perfect exon:exon/intron:intron matches: 4709/4154
0 query transcripts contained noncanonical splices.
| transcripts with no splice problems | 4307 | 91.8% |
| ... with complete reference coverage | 901 | 19.2% |
| explainable by alternate splicing | 162 | 3.5% |
| ... with spliced reference | 113 | 2.4% |
| clashes | 225 | 4.8% |
| ... with spliced reference | 102 | 2.2% |
Transcripts that have a splice site disagreement with an overlapping reference gene are placed into two categories, depending on the severity of the clash. If all splice disagreements could be explained by well-known types of alternate splicing, we call the transcript a "possible alternate splice." If the two transcripts cannot be reconciled in this way, we label the query a "clash." In partitioning splice disagreements into two categories, we are not asserting that 3.5% of this genome shows alternate splicing. We do this as a form of triage: genes in the "clash" category are manually inspected before release.
| in ref. | in query | ||
|---|---|---|---|
| cassette exons | ![]() | 4 | 0 |
| retained introns | ![]() | 88 | 12 |
| early 3′ splices | ![]() | 15 | 41 |
| late 5′ splices | ![]() | 12 | 20 |
cassette exon an exon that falls completely with an intron of a variant transcript. Such exons may represent alternative splice forms but are more likely instances of exonic over- and under-prediction. retained intron an intron that falls within the exon of a variant transcript. These introns may indicate alternative splicing but usually are over- and under-predicted introns. early 3′ splices two introns agree on their 5′ splice site but differ on the 3′ side, relative to the affected intron. In other words, differing 3′ splice sites lie on the leading edge of an exon. late 5′ splices two introns agree on their 3′ splice site but differ on the 5′ side, again relative to the affected intron. Most terminology is from Matlin AJ, et. al. Understanding alternative splicing: towards a cellular code. Nat Rev Mol Cell Biol. 2005 May;6(5):386-98.
Possible Problems
| multiple sources | 5538 | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| no evidence feature | 876 | |||||||||||
| MG6_FGENESH_3 | 1067 | |||||||||||
| MG6_FL_EST_GENES | 93 | |||||||||||
| MG6_GENEID_2 | 589 | |||||||||||
| MG6_GENEMARK | 1389 | |||||||||||
| MG6_GENEWISE_1 | 96 | |||||||||||
| MG6_MANUAL_1 | 953 | |||||||||||
| MG6_SNAP_7 | 453 | |||||||||||
| short protein, < 50 aa | 15+0 | 4 | 0 | 1 | 0 | 5 | 4 | 0 | 1 | 0 | ← not tallied in problems | |
| shorter protein, < 30 aa | 0+0 | - | - | - | - | - | - | - | - | - | ||
| very short protein, < 10 aa | 0+0 | - | - | - | - | - | - | - | - | - | ||
| initial exon ≤ 6 nt | 18+1 | 4 | 4 | 4 | 0 | 2 | 3 | 0 | 1 | 1 | ||
| internal exon ≤ 6 nt | 5+0 | 1 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| terminal exon ≤ 6 nt | 12+0 | 0 | 6 | 2 | 0 | 2 | 0 | 0 | 2 | 0 | ||
| ≥ 15 exons | 2+0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| intron ≥ 1000 nt | 1+0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| intron ≤ 20 nt | 16+1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 14 | ||
| first codon not Met | 1+0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ← not tallied in problems | |
| first codon not xTG | 0+0 | - | - | - | - | - | - | - | - | - | ||
| first codon not known START | 0+0 | - | - | - | - | - | - | - | - | - | ||
| last codon not known STOP | 1+1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | ||
| contains in-frame STOP | 0+0 | - | - | - | - | - | - | - | - | - | ||
| coding length not modulo 3 | 0+1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ||
| non-canonical splicing | 0+0 | - | - | - | - | - | - | - | - | - | ||
| has ≥1 tagged BLAST hit | 8551+118 | 4455 | 860 | 756 | 52 | 505 | 740 | 96 | 764 | 441 | ← not tallied in problems | |
| ≤1/3 as long as BLAST tags | 72+5 | 20 | 1 | 8 | 7 | 7 | 19 | 1 | 11 | 3 | ||
| ≥3× longer than BLAST tags | 16+1 | 0 | 3 | 0 | 0 | 1 | 1 | 0 | 12 | 0 | ||
| contains ≥1 N in exon | 57+74 | 7 | 27 | 54 | 6 | 4 | 0 | 0 | 26 | 7 | ||
| low-quality exonic sequence | 0+142 | 20 | 23 | 42 | 4 | 5 | 18 | 0 | 25 | 5 | ← not tallied in problems | |
| touches gap(s) | 7+8 | 2 | 0 | 5 | 0 | 1 | 3 | 0 | 4 | 0 | ||
| spans contigs | 7+8 | 2 | 0 | 5 | 0 | 1 | 3 | 0 | 4 | 0 | ||
| within 1 kb of contig edge | 61+19 | 28 | 2 | 19 | 0 | 4 | 10 | 2 | 13 | 2 | ← not tallied in problems | |
| any overlap (UTR or CDS) | 121+0 | 67 | 4 | 7 | 4 | 1 | 7 | 2 | 22 | 7 | ← in 60 clusters | |
| CDS overlap only | 7+0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 | ← in 3 clusters | |
| CDS overlap > 50 nt | 7+0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 | ← in 3 clusters | |
| CDS overlap > 100 nt | 7+0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 | ← in 3 clusters | |
| CDS overlap > 200 nt | 7+0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 6 | 0 | ← in 3 clusters | |
| has predicted UTR | 3509+44 | 2211 | 184 | 223 | 91 | 160 | 233 | 33 | 324 | 94 | ← not tallied in problems | |
| UTR ≥ CDS length | 346+3 | 245 | 5 | 17 | 25 | 13 | 9 | 2 | 29 | 4 | ← not tallied in problems | |
| UTR is spliced | 196+3 | 120 | 7 | 13 | 7 | 13 | 10 | 2 | 20 | 7 | ← not tallied in problems | |
| one or more problems | 319+82 | 100 | 47 | 78 | 15 | 17 | 33 | 3 | 77 | 31 | ||




