Gene Confidence
Confidence Summary
Overall confidence
Determined by combining BLAST agreement, reference/EST agreement, sequence quality and heuristic confidence.
- high: Has evidence and is in clear accord with it: has strong agreement (full or partial) with reference/EST and/or strong agreement (length within +/- 25%) with BLAST evidence, no sequence problems, no model problems.
- medium: Has evidence but is not in clear accord with it: has reference/EST and/or BLAST evidence, but either clashes with reference/EST evidence or has length within +/- 50% of BLAST evidence. Genes in which we would otherwise have high confidence, but have sequence and/or model problems, are also in this category.
- low: Conflicts with evidence and/or has serious structural problems: has major model problems, has length that is not within +/- 50% of the BLAST homology evidence, and/or conflicts with both reference/EST evidence and BLAST evidence. Genes in which we would otherwise have medium confidence, but have sequence and/or model problems, are also in this category.
- unknown: Has no sequence problems, no model problems, no BLAST evidence, and no reference/EST evidence.
| high | medium | low | unknown | |
|---|---|---|---|---|
| N. crassa | 6030 | 1531 | 559 | 1706 |
Confidence Grading Process
Predictions are graded according to four criteria:
- BLAST agreement
- Reference or EST agreement (when available)
- Sequence quality
- Gene model confidence
These four criteria are summarized in an overall confidence rating for
each gene prediction. Overall confidence is assigned in three steps,
according to the flowchart below.
BLAST length difference
Calculated by comparing the length of the prediction to the average length of the top 1-3 scoring overlapping BLAST hits. Each hit must have greater than 30% average identity and 30% query coverage, and the hits must come from distinct taxIds.
- <= 10%: CDS length is within +/- 10% of the average length of the top 3 overlapping BLAST hits.
- 11-25%: CDS length is within +/- 25% of the average length of the top 3 overlapping BLAST hits and is not within +/- 10% of that length.
- 26-50%: CDS length is within +/- 50% of the average length of the top 3 overlapping BLAST hits and is not within +/- 25% of that length.
- > 50%: CDS length is not within +/- 50% of the average length of the top 3 overlapping BLAST hits.
- no evidence: Has no overlapping blast evidence of sufficient identity and coverage.
| <= 10% | 11-25% | 26-50% | > 50% | no evidence | |
|---|---|---|---|---|---|
| N. crassa | 4766 | 1503 | 705 | 355 | 2497 |
Reference/EST agreement
Calculated by comparing splice site predictions to those of overlapping EST clusters or reference gene models.
- full: Has complete reference/EST coverage and no splice site conflicts.
- partial: Has incomplete reference/EST coverage and no splice site conflicts.
- clash: Has at least one splice site conflict with an overlapping reference/EST.
- split: Has incomplete reference/EST coverage indicating a possible split of a single gene.
- merge: Has incomplete reference/EST coverage indicating a possible merge of two distinct genes.
- no evidence: Has no overlapping reference/EST evidence.
| full | partial | clash | split | merge | no evidence | |
|---|---|---|---|---|---|---|
| N. crassa | 1927 | 2838 | 1404 | 0 | 0 | 3657 |
Heuristics quality
Heuristic confidence in the gene model itself, based on intron/exon lengths, splice sites, and other annotation attributes.
- no problems: Model is longer than 90 bases long, all splice sites are canonical, no in-frame stops or frame shifts, all introns between 20 and 1000 bases (inclusive), all exons longer than 6 bases.
- minor problems: Model is 90bp or shorter, has an intron < 20bp or > 1000bp or exon <= 6bp.
- severe problems: Model has a frameshift, in-frame stop, is non-canonically spliced, and/or is 30bp or shorter in length.
| no problems | minor problems | severe problems | |
|---|---|---|---|
| N. crassa | 9354 | 469 | 3 |
Sequence quality
Calculated by examining underlying and neighboring sequence.
- no problems: No N's in exons, all basecalls higher than Q10, >1Kb from any gap(s).
- minor problems: Within 1Kb of a sequence gap.
- severe problems: Underlying sequence has N's within exons and/or basecalls of Q10 or lower.
| no problems | minor problems | severe problems | |
|---|---|---|---|
| N. crassa | 9822 | 0 | 4 |

