Gene Confidence


Confidence Summary

Overall confidence

close
Overall confidence

Determined by combining BLAST agreement, reference/EST agreement, sequence quality and heuristic confidence.

  1. high: Has evidence and is in clear accord with it: has strong agreement (full or partial) with reference/EST and/or strong agreement (length within +/- 25%) with BLAST evidence, no sequence problems, no model problems.
  2. medium: Has evidence but is not in clear accord with it: has reference/EST and/or BLAST evidence, but either clashes with reference/EST evidence or has length within +/- 50% of BLAST evidence. Genes in which we would otherwise have high confidence, but have sequence and/or model problems, are also in this category.
  3. low: Conflicts with evidence and/or has serious structural problems: has major model problems, has length that is not within +/- 50% of the BLAST homology evidence, and/or conflicts with both reference/EST evidence and BLAST evidence. Genes in which we would otherwise have medium confidence, but have sequence and/or model problems, are also in this category.
  4. unknown: Has no sequence problems, no model problems, no BLAST evidence, and no reference/EST evidence.
  high medium low unknown
N. crassa 6030 1531 559 1706

Confidence Grading Process

Predictions are graded according to four criteria:

  1. BLAST agreement
  2. Reference or EST agreement (when available)
  3. Sequence quality
  4. Gene model confidence

ratings These four criteria are summarized in an overall confidence rating for
each gene prediction. Overall confidence is assigned in three steps,
according to the flowchart below.

BLAST length difference

close
BLAST length difference

Calculated by comparing the length of the prediction to the average length of the top 1-3 scoring overlapping BLAST hits. Each hit must have greater than 30% average identity and 30% query coverage, and the hits must come from distinct taxIds.

  1. <= 10%: CDS length is within +/- 10% of the average length of the top 3 overlapping BLAST hits.
  2. 11-25%: CDS length is within +/- 25% of the average length of the top 3 overlapping BLAST hits and is not within +/- 10% of that length.
  3. 26-50%: CDS length is within +/- 50% of the average length of the top 3 overlapping BLAST hits and is not within +/- 25% of that length.
  4. > 50%: CDS length is not within +/- 50% of the average length of the top 3 overlapping BLAST hits.
  5. no evidence: Has no overlapping blast evidence of sufficient identity and coverage.
  <= 10% 11-25% 26-50% > 50% no evidence
N. crassa 4766 1503 705 355 2497

Reference/EST agreement

close
Reference/EST agreement

Calculated by comparing splice site predictions to those of overlapping EST clusters or reference gene models.

  1. full: Has complete reference/EST coverage and no splice site conflicts.
  2. partial: Has incomplete reference/EST coverage and no splice site conflicts.
  3. clash: Has at least one splice site conflict with an overlapping reference/EST.
  4. split: Has incomplete reference/EST coverage indicating a possible split of a single gene.
  5. merge: Has incomplete reference/EST coverage indicating a possible merge of two distinct genes.
  6. no evidence: Has no overlapping reference/EST evidence.
  full partial clash split merge no evidence
N. crassa 1927 2838 1404 0 0 3657

Heuristics quality

close
Heuristics quality

Heuristic confidence in the gene model itself, based on intron/exon lengths, splice sites, and other annotation attributes.

  1. no problems: Model is longer than 90 bases long, all splice sites are canonical, no in-frame stops or frame shifts, all introns between 20 and 1000 bases (inclusive), all exons longer than 6 bases.
  2. minor problems: Model is 90bp or shorter, has an intron < 20bp or > 1000bp or exon <= 6bp.
  3. severe problems: Model has a frameshift, in-frame stop, is non-canonically spliced, and/or is 30bp or shorter in length.
  no problems minor problems severe problems
N. crassa 9354 469 3

Sequence quality

close
Sequence quality

Calculated by examining underlying and neighboring sequence.

  1. no problems: No N's in exons, all basecalls higher than Q10, >1Kb from any gap(s).
  2. minor problems: Within 1Kb of a sequence gap.
  3. severe problems: Underlying sequence has N's within exons and/or basecalls of Q10 or lower.
  no problems minor problems severe problems
N. crassa 9822 0 4