Frequently Asked Questions:

All pages were tested only for Netscape 7, IE 6.

1. What is DNannotator?

2. What is it not?

3. What is DNannotator good at?

4. What is annotated sequence, and what is un-annotated sequence?

5. Why do we need DNannotator?

6. What is the basic principle of DNannotator?

7. What is gb-header, why use gb-header?

8. What is the basic requirement for using this tool?

9. How can I view my annotation graphically?

10. Why I can't get the annotation I am looking for?

11. What is the limitation of DNannotator?

12. Why is there a limit on data size?

13. Can I have oversized data analyzed in DNannotator?

14. What is coming?

15. Subscribe to mailing list.


DNannotator:

It's a tool for you, to do LARGE-SCALE (> 100 annotations) on SIZEABLE genomic region (> 1 Mb).

1. If you have some sequence annotations in a draft sequence and would like to have the same annotations in a new sequence version but don't want to go over the whole tedious and delicate manual annotation procedures again.

2. If you have hundreds of SNPs or primers to map/annotate into some genomic DNA sequences, no matter where the sequence came from.

Then, you may need to use DNannotator.

        If you have only one annotations to make, in a small one gene region, don't bother to learn how DNannotator works.   I recommend that you can use Vector NTI (not cheap, but easy to use) or Sequin, or even a simple text editor to open your sequence, and do the annotation manually.  BUT, DNannotator will still work for you as it would supposedly not make any "typo" for your annotation.

        Then, why "manual" tools can't do "large-scale annotation on sizeable region"?  1). you may have problem open the target sequence due to the sequence size; 2). Manual annotation can be time-consuming and error-prone, also extremely boring.

DNannotator is not:

It's not a replacement of those tools such as Genotator, NIX, or ORNL Genome Analysis Pipeline etc, which perform gene prediction and searching against static public databases.

It's not a replacement of public annotation such as NCBI map viewer, UCSC Genome Browser or Sanger Center Ensembl.  Because DNannotator does not intend to do searching against the whole public databases, neither does it intend to do whole genome annotation. 

DNannotator does not provide graphic viewer.  You have to take DNannotator's output and use viewer like Artemis or Genome Browser.

DNannotator is good at:

        Annotation using your own source data and your own customized gDNA sequence.

  1.  It's good at working on pure Genbank format. For a  graphic view of the annotated results, Vector NTI is a good but costly choice, and Artemis is a good free alternative choice.  User needs to decide which viewer to use in the "choice" setting for output data.  If you want to use a VNTI data in Artemis, it also is not difficult.  DNannotator supplies function to do conversion between VNTI format Genbank data and standard Genbank format data.
  2. It's good at batch data processing for massive annotations or for many sequences. If you have only one fragment to be mapped to one gDNA sequence.  You may use any text editor to FIND the sequence or do a BLAST and count to the right position, and then write it down on a paper, as many researchers are doing today.

Annotated sequence vs. Un-annotated sequence

Annotation means labels of features or notes for fragments/parts of a sequence, specifying what the fragments are, or what the fragments do, etc.  For example, an annotation in Genbank format as:

 exon 67408..67569
     /vntifkey="61"
     /label=gene1\exon2
states that bases 67408 to 67569 are a region of exon2 of gene1.

We are using a narrowly defined concept here.  Both "annotated sequence" and "un-annotated sequence" are related to a certain kind of annotations or features.

For example, when we talk about map SNP "rs123" to a "sequence A", if there is already an annotation to specify where rs123 is located in sequence A, (here specifically, we are talking about the label in standard Genbank format or in a feature table), then sequence A is regarded as "annotated" for SNP rs123.  On the contrary, if sequence B does not contain annotation for "rs123", even thought it may contain other annotation, such as exons of gene xxx, it is still regarded as "un-annotated" for SNP rs123.  Therefore, sequence B can be annotated by DNannotator for rs123.
 
 

Why or when need DNannotator?

Three major purposes for sequence annotation: 1. organize all sequence related data, such as gene structures, expression regulation elements, etc., especially those that can be submitted to public places and be exchanged with other researchers;  2. manage all lab related data, such as oligos, primers, and amplicons used in the lab.  It's more project-specific data, but important for efficient administration of a research project like gene mutation screening or association linkage screening. 3. Preserve annotation when sequence is updated;   DNannotator helps you to do batch annotation with your own data (primers, SNPs, exons) to any target gDNA sequence you want to use.  The target gDNA can be those downloaded from public database, can be those assembled in your own way.

You may wish to use DNannotator,

  1. when you have hundreds of primers, SNPs, exons, etc. scattered around in your computer and wish to have a common ground to organize them.  With the common ground,  you can see clearly all the positional relationships between the elements.

  2. when you have hundreds of elements annotated in an old version of draft sequence, but a new version of sequence is coming out.  You will want to move to the ground of the new version of sequence but don't want to re-do all the carefully performed manual annotations.

  3. potentially, you can adapt this method to do homologue mapping between sequences from different species by adjusting the threshold of the BLAST filter.

    For example, with one completely un-annotated Genbank format sequence as a starting point, you can use DNannotator to add "features" of any SNP into the sequence; later, you can use the output of the previous SNP mapping analysis as un-annotated sequence for annotating "features" of new SNPs, primers, or gene exons. This kind of cycle can continue looping as much as you wish.  New feature/annotation can accumulate easily in this one sequence-based platform.
    There is no need to worry about losing varieties of annotations that have hardly accumulated, because a function of "annotation migration" takes care of this. All annotations in an old version of sequence which were not disrupted (disrupted annotation means that the new version of sequence has a completely different organization at the annotated fragment place) in the new version of sequence will be kept in that version. Therefore, you can focus on creating new annotation based on the new version of sequence without wasting time repeating all the analyses already performed.
 
 

Basic principle:

        1.  For SNP, primer mapping, and annotation migration:

        BLAST results provide all the homologous relationships between annotated  and un-annotated sequence fragments.  If looking only at those perfect matches, generally speaking, the matches indicate all the corresponding elements in two sequences.  With accurate calculation, therefore, the annotation can be transferred to the un-annotated sequence based on the matching relationship.

      Only when a continuous match (very limited gaps are allowed) covers the whole feature location will the feature be transferred.  In different matching scenarios, annotation either can or can not be transferred, depending on the position/range of annotations and matches.

        2. For exon mapping:

         Basically, we just provide a handy parser for Sim4 results. The program here converts the Sim4 results into Genbank format feature data. 

          BLAT-based exon mapping is also provided.     

        3. For STS mapping:

        e-PCR is used to map "primer-pair information" based STS mapping.  So, if you have the primer pairs which are used to amplify the markers, you'd better use this method to map the markers.  In many instances, this method is more sensitive than BLAST-based approach.  But the tradeoff of this method is that the annotation results rely on the accuracy of primers supplied.

        In annotation migration, features extending large region, such as "gene", would use e-PCR approach.

 

gb-header and its related utilities

    gb-header is the beginning part of Genbank format data, which includes contents of LOCUS, DEFINITION, ACCESSION, VERSION, SOURCE, REFERENCE, FEATURES, BASE COUNT and ends with ORIGIN.  In other words, only sequence parts are excluded from gb-header.

    A tool is provided for you to extract gb-header from standard Genbank format sequence data.

    We use gb-header rather than whole Genbank data for annotation purpose, since gb-header is the place harboring all the feature data.   Sequence data are analyzed separately in BLAST, which is used as basis to generate the new annotation.

    By using gb-header alone, much small amount of data need to be uploaded for processing.  It's obvious that sequence part occupies the majority of one Genbank format data.   By doing this, the procedure can be speeded up, and much less amount of data need to be transferred over network.  The GFF format proposed by Sanger as gb-header, provides only annotation data too.

    If doing annotation many times with many batches of source data, you will generate a series of gb-headers differing only at "FEATURE" contents.  A tool is supplied for merging all gb-headers into one.  You just need to put all the gb-header files into one gzip archive, and submit it to DNannotator.

    To make the pretty and clean gb-header, you can run a small function of DNannotator to re-organize and clean the final combined gb-header file.

    Certainly, you may need to merge gb-header with its corresponding sequence body later after all annotations are accomplished.   A small utility is implemented to do this for you. 

     If you want to use Artemis to view the annotation, and you wish to put the different categories of annotation into different layer, then, don't merge all the gb-header files.  By reading-in them separately as individual "entry", you can get a better layer-by-layer view in Artemis.  (More information about Artemis)

Basic requirements:

  1.  Knowledge of sequence data format.  For example, you should know the basic differences between Genbank and FASTA format, between white space and tab delimited space, etc. 
  2. Certainly, you need to organize your input data efficiently, by using other tools like MS EXCEL or LOTUS NOTES, etc.
  3.  A good enough network connection.  All of these requirements varies with the size of data you want to analyze.  For example, if you want to analyze 1 Kb data, you can even use dialup network connection.  If data is over 1 Mb, dialup might be too slow for both uploading data for analysis and downloading results.

 

View annotation graphically

    Artemis is recommended as one good choice for view annotation in Genbank format or gb-header.

    To get further information about Artemis, please follow the link.

    If you are doing custom annotation on Genome Browser's latest freeze, you can take the "custom track" data file and view it in Genome Browser.

Possible reasons of receiving unexpected results:

      Assuming you already checked that you have proper hits in the BLAST results and all required input files are ready, the following errors may be the cause of problems:

  1. Wrong input.  DNannotator is still not smart enough to detect whether or not you had given it BLAST results where a Genbank file was asked for.  The only error checking available now is that it can make sure you have input in all the places that it needs the input.  So, you have to be cognizant of what you are doing and which file is which.
  2. Wrong input format. Don't ever use continuous white space for tab space. The other format errors are very easy to find out.  DNannotator supplies an example data file wherever a special format is required (such as FASTA and text format annotation).  Please take a close look at it, if you are using that function for the first time.  If you are using data file from a Mac machine, please be sure the check the box (right above the "Submit" botton) to specify that you are using Mac format data.
  3. Wrong BLAST version.  The BLAST parser provided here would not work for program version earlier than 2.1.  You can find out the version number at the very beginning of your BLAST results.
  4. Improper parameters setting.  If you set the parameters (size of minimum match, percentage of identity) of filtering out undesired hits improperly, you may receive either mis-assigment mapping or annotation (if the threshold is too low), or miss the annotation you are expecting to show up (if the threshold is too high).
  5. Not all annotations are transferable.  Due to the basic principle of DNannotator, the annotation migration heavily relies on the BLAST match situation and annotation data.  Only when a continuous match (very limited gaps are allowed) covers the whole feature location, the feature would be transferred.  Otherwise, human intervention will be required.  You should be very cautious about this situation: A interrupted BLAST match can possibly biologically interpreted quite differently.  So, it's recommended to take the "annotation evidence data" (parsed & filtered BLAST results), check for those fragments which were not annotated, and make your own biologically correct judgment.
  6. "Dirty FASTA" data.  If you copy sequence data from a webpage or a consensus sequence from a sequence assembly program, it is very likely that you are using sequence full of "junk" symbols/characters or spaces which might hurt some sequence analysis, such as Sim4, even BLAST.  If that's a case, it's recommended to run sequence cleanup to make a  clean FASTA sequence before doing other analysis.
  7. Program bugs.   If you already check all possible errors listed above and still can't figure out what's wrong, you may want to use the bug report form with detailed information you have, especially the "filename of returned results". Please don't send the whole results to me.  I can check the server by myself if needed.
  8. Wrong email address.  I've got a number of emails bounced back to me due to "unknown user" or "user don't have an account in ...".  So, please be sure that you filled the correct email address to get the result.
  9. Other possibilities?  Not in my mind yet.  If you find it, please let me know.

Limitation of DNannotator:

  1. DNannotator relies on the annotation you already have.  Or, it will start from the very beginning, helping you with the fragments or elements you wish to annotate into the DNA sequence. The user needs to supply the correct information for correct annotation or annotation migration.

  2. Since DNannotator relies on the BLAST match data, certain annotation may not be made by DNannotator in strict conditions.  Manual annotation for those difficult scenarios is always required.

  3. Sim4-based "exon mapping" can generate errors of exon serial number assignment. If incomplete cDNA or gDNA sequence was used for Sim4 analysis, certain exons may be missing.  But the DNannotator exon mapping function will only call those exons which appear consecutively in Sim4 results.  Therefore, if you know you have exons missing in the draft gDNA sequence, or you are using partial cDNA sequence, you'd better manually revise the exon number in the annotated results.  DNannotator will generate an alert message for you if one cDNA is not completed covered by annotation. For example, only 800 bp of a 1 Kb cDNA sequence is annotated by Sim4.  So, be care of those "WARNING", and watch out the possible cDNA and gDNA sequence errors.

 

Why size limit?

  1. Web-based applications all have this restriction due to the limit of network performance. 
  2. Currently the system runs on one slow G3 machine. We have to make the small amount of resource work well for most users fairly. 

    With better computing resource, we will lift the limit.

For oversized data

  1. The first thing you should try is to use gzip to compress the data. If you have not used it before,  http://www.gzip.org/ is where you can get executables on almost all kinds of OS platforms.  Currently, this is the only compression format DNannotator will accept.  DNannotator can automatically process gzip compressed data once the ".gz" suffix is attached. So, REMEMBER to to put a ".gz" behind your file name, if it's gz compressed. The commercial software might suit better for general users. We recommend: Winzip and Winrar. They works well on .tar, .gz files that all our applications generated. Also to refresh the memory of Unix users here: "tar xzf filename.tar.gz," "gunzip filename.gz."
  2. The major problem with size limit will occur for BLAST results rather than other data files.  Therefore, you may try to shrink your BLAST results by setting a higher threshold such as using "-e 1e-8" or even lower e-value in BLAST searching.
  3. If you get an email asking you to download the result files, you can follow the hyperlinker in the email to download the results from the DNannotator server. But the data will be removed after 10 days, so get it before it's gone.

  

Functions to be implemented or improved

1. Implement a complete data set to support DAS system.  Ideally, if you have DAS up-and-running, you will be able to directly import DNannotator output into your "reference sequence server" and "annotation server".

2. Gene prediction, transcription-factor binding sites prediction, etc will be implemented soon.

3. BLAT-match based annotation for STS.

4. annotation migration take gb-header too.

5. Modified data merging, so that .zip file rather than only .tar can be used to merge multiple gb-header file.

Can merge the sequence at the same time as a optional choice.  So user does not need to use two functions to do one thing.

6. Automatic update UniSTS data.

7. Use Genbank accession number as cDNA annotation source.

Subscribe to mailing list.

        Follow this linker