Introduction to Features [PIR - Protein Information Resource]


	Protein Search

Home About PIR Databases Search/Retrieval Download Support

HOME / About / Introduction to Features

Introduction to Features

	General Principles
	Repeat Domains
	Product Records
	Signal Sequences and Transit Peptides
	Miscellaneous Rules
	Region Records
	Membrane Crossing Regions
	Repeats
	Suggestions for Annotators
	Homology Domains
	Domains

"Domain" Record

The format for the "Domain:" record is

"Domain:" ["(or "hyphenated pairs")"] domain name ["("form")"] ["#status" status] "<" tag ">"

This record should be generally be applied to a single hyphenated pair. A "domain" carries the connotation of having some degree of spatial coherence, that is, secondary or tertiary structure. Separate segments of sequence that together form the same domain should be placed in the same record. Separate segments of sequence that form spatially distinct domains that happen to have the same description should be placed in separate records.

We have attempted to standardize most "Domain" records, but this format is still somewhat variable. Here we set forth some very general guidelines pertaining to certain types of domains.

Back to Top

General Principles

Use the same name for the same kind of domain. Insofar as possible, use the same or similar tags for the same kind of domain. Domain names should be INFORMATIVE; avoid names such as "first", "A", "II", etc. A domain or region should be annotated only when it is biologically significant and the name should reflect that interesting structural or functional property. Names that are obvious or used only for the convenience of particular authors should be suspect.
Do not include enumeration within the names given to repeated domains of the same type within the same sequence. This results in needless proliferation of names that are all the same except for a number or letter. The enumeration should be in the tag instead.

The boundaries of domains are assumed to have "predicted" status and are understood to be not necessarily precise; usually no additional indication of uncertainty is needed. If there is considerable uncertainty, for example if any of three Mets might be the initiator, this is indicated by an initial parenthetical phrase. For example,

Domain: (or 5-32 or 11-32)

The "or" form should be avoided whenever possible.

The boundaries of homology domains are understood to be more or less arbitary and defined on the basis of sequence similarities; do not put a status on such domains.

The question of what types of regions we should call domains is still under discussion.
Each instance of a given kind of domain within a sequence should have a separate domain record, thus use

20-42/Domain: transmembrane #status predicted
50-72/Domain: transmembrane #status predicted

and do not use

20-42,50-72/Domain: transmembrane #status predicted

In some cases a single 3-dimensionally defined domain does consist of separated segments of sequence, and a list of ranges may appear in such cases but this is rare.

Back to Top

Signal sequences and transit peptides

These domains have been standardized in PIR. Please follow the format given below for the simple cases; in more complex cases, use the examples as a guide. A form must appear with a transit peptide. Tags are required, but these examples are suggestions.

"Domain: " ["(or"hyphenated pair ["or" hyphenated pair ...]")"] "signal sequence" ["(fragment)"] ["#status" status] "<SIG>"
"Domain:" ["(or" hyphenated pair ["or" hyphenated pair ...] ")"] "transit peptide ("form")" ["(fragment)"] ["#status" status] "<TNP>"

Examples:

Domain: signal sequence #status predicted <SIG>
Domain: signal sequence (fragment) #status experimental <SIG>
Domain: transit peptide (amyloplast) #status predicted <TNP>
Domain: transit peptide (chloroplast) #status predicted <TNP>
Domain: transit peptide (chloroplast) (fragment) #status experimental <TNP>
Domain: transit peptide (mitochondrion) #status predicted <TNP>

The "or" form should be avoided whenever possible.

Domain: (or 1-15) signal sequence (fragment) #status predicted <SIG>
Domain: (or 1-43 or 1-49) signal sequence #status predicted <SIG>

When the boundary between a signal sequence and the following domain has not been determined or predicted, use a record like one of these:

Domain: signal sequence and propeptide #status predicted <SIG>
Domain: signal sequence (fragment) and propeptide #status predicted <SIG>

When more than one protein product is presented in an entry, use this format.

"Domain: signal sequence (of "product name") ("status") <SIG>"

For example

Domain: signal sequence (of membrane glycoprotein E1) #status predicted <SIG>

Back to Top

Membrane-crossing regions

These are currently annotated as domains.

Domain: transmembrane #status predicted <TMM>
Domain: transmembrane beta strand #status predicted <TMM>
Domain: transmembrane helix #status experimental <TMM>

Try to be consistent in assigning boundaries of transmembrane domains within a group of closely replated proteins. Lacking any other criteria, use the minimum range suggested by the ALOM program. The preferred tags are "<TMM>" when there is only one, and "<TM1>", "<TM2>", etc., when there are more than one. When there are more than nine, use tags like "<TM01>".

[BLACK] Do not use the following kinds of names for transmembrane domains:

"transmembrane 2" or "transmembrane II" (the numbers should not be part of the name)
"transmembrane domain"
"transmembrane region"
"membrane-spanning segment"
"potential transmembrane sequence"
"membrane anchor domain"

[GRAY]
The following cases also appear and are under review.

Domain: intramembrane
Domain: membrane anchor
Domain: membrane associated
Domain: membrane insertion
Domain: membrane-bound
Domain: transmembrane amphipathic helix #status predicted

Back to Top

Homology domains

Homology domains form a special class. They are distinguished by the property that those of a given type (with the same name) are homeomorphic and share sequence homology although they are found in different (nonhomeomorphic) proteins. The names of such domains end with the word "homology". Many such domains are homologous with most of the entire length of some other protein, in which case they may be named after such a protein, either exactly ("trypsin homology") or with a more general designator ("protein kinase homology"). Other domains have, so far, been found as domains within multidomain proteins ("homeobox homology"). Some are named to indicate that they are repeated in a certain protein ("complement factor H repeat homology"). The conversion of homology domain names to include the terms "homology" or "repeat homology" is still incomplete.

The boundaries of homology domains should be consistent with an alignment of representative domains of the named type. Dr. Barker is collecting such alignments. A preferred tag will be assigned for each type of homology domain. Some examples:

Domain: basic proteinase inhibitor homology <BPI>
Domain: cytochrome b5 core homology <CB5>
Domain: protein kinase homology <KIN>
Domain: calmodulin repeat homology <EF1>

Note that some domains with names of proteins may have been assigned not by sequence homology but by predicted activity. Defining a homology domain is preferable to a name that predicts structure ("EF hand") or function ("calcium-binding") because structure may be distorted or function lost in homologous domains. Do NOT add the word "homology" without affirming that there is sequence homology! Please read carefully the discussion and proposal of the use of "Domain" and "Region" records for repeated sequence elements.

[BLACK] Do not use a status for this type of domain. They are only assigned by homology, which is always an inference of predicted status and never experimental. Boundaries are understood to be somewhat a matter of human judgement.

[BLACK] The following names are NOT acceptable:

"alpha chain homolog"
"complement binding protein-related"
"endozepine-like"
"homology with Ig C region domains"
"malK protein homolog 1"
"lipoyl domain 1 #status predicted"

In this last example, it's not clear whether it is an homology domain or a function prediction. The word "domain" is always superfluous. Domains should not be enumerated.

To denote regions that are under consideration as homology domains, it has become acceptable practice to annotate them as a "similarity" like

Domain: platelet-derived growth factor chain B similarity

with the understanding that they will be changed at a later date.

Back to Top

Repeat Domains

Domains that are repeated in a protein should be names as homology domains if they are also known to occur in diverse proteins. Otherwise, they may be named for the specific protein or an example of it omitting the term "homology", for example "CDC23 repeat".

Back to Top

Miscellaneous Rules

Use a hyphen before "binding" as in "cAMP-binding" when it is used attributively, that is as an adjective with a following noun. For example,
Domain: DNA-binding core #status predicted
Domain: alpha-actinin actin-binding domain homology
Otherwise, if it is used nominatively, as a name with no following noun, do not use a hypen. For example, Domain: DNA binding #status predicted Region: actin binding #status predicted

If required, use "fragment". It appears before the status (if used) and tag.
Always try to use names that are at least 3 characters in length.

[BLACK]Old "Duplication" records cannot be used. Any features of this type should be entered as "Region" or "Domain" records in accordance with the discussion below.

Following these guidelines will at least reduce the heterogeneity in the current database and make it more easy to convert.

Duplications are of two major types: short repeats (usually tandem) and longer domains. There is no firm cut-off without studying the situation; as a guideline, we could try 25 or fewer residues is a repeat, 50 or more is a domain, and in between it must be a domain if such domains exists in other types of proteins (e.g., EGF-like) but may be treated as a tandem repeat if it is unique to this type of protein.

Back to Top

Repeats

Repeats are very to fairly short, usually occur in tandem, and the pattern is often, but not always, specific to this type of protein. The annotation to use is "Region:" record Use a hyphenated pair for the entire region in the location field, "22-300", and do not give the boundaries of individual repeats. Several unconnected regions may be listed if they contain the same pattern, "22-100, 200-298" (note: we don't have the permission to use a semi-colon yet; hopefully it comes very soon!) If there is a list, then all the other information within the record must be applicable to the entire list.
In the description field, use the following format n "-residue repeats" ["("sequence pattern")] [", " descriptive phrase]
For example Region: 11-residue repeats (D-P-A-K-A-S-Q-G-G-L-E)
"n" is the typical number of residues in the repeat pattern and the number of repeats is not given.
A sequence pattern is a simple representation of the canonical pattern using the single-letter code separated by hyphens and when necessary alternatives are indicated for only the most common residues separated by a slash. For example, (A-C-D/E-F-G) No tag is usually used with the "Region" record. "tandem repeat" is used as a KEYWORD if the repeats are tandem. "repeat" may be allowed as a KEYWORD for non-tandem repeats.

Back to Top

Domains

Back to Top

"Product" Records

A Product is any relatively stable (i.e. isolatable) peptide chain, including chains that experience cleavage of a precursor form and remain bound together in the same molecule. This definition has several implications.

Some sequence elements previously identified as "Peptide" are probably not stable and will not fit the proposed definition of "Product". You may use "Domain" or "Region" for these. Activation peptide are normally annotated as a "Domain" unless they have been isolated and appear to be physiologically significant. What can usually be easily determined is what segments are present in the final mature protein(s) and what segments are removed.

[BLACK] Do not use Product records like

20-50,70-90/Product: mcguffin A and B chains #status experimental <MAT>

Several options are available, and there are good examples where it has been necessary to use one or the other of these forms. You may represent the chains in two separate "Product" features. 20-50/Product: mcguffin chain A #status experimental <ACH>
70-90/Product: mcguffin chain B #status experimental <BCH>
It is also possible to present a single "Product" feature and two "Domain" features, especially when the chains are covalently linked and only a single molecular entity with one molecular weight actually exists. 20-50,70-90/Product: mcguffin #status experimental <MAT>
20-50/Domain: mcguffin chain A #status experimental <ACH>
70-90/Domain: mcguffin chain B #status experimental <BCH>
[The use of the second approach is evident in annotating protein splicing.] Do not mix these forms in the same entry, and try to standardize them across a family.

So far there has been little standardization of "Product" records; however, the following guidelines should be used.
Do not use "Product" for a segment that has insufficient lifetime to be isolated.
A name in a "Product" feature should repeat the protein name as given in the entry title or a name in the "Contains" record, usually omitting "precursor" and including a chain designation. Version and clone designations may be omitted. This may be enforced at a later date.

Chain designations that are words or Greek letters should precede the word "chain" and designations that are English letters, numbers (Arabic or Roman) or combinations them should follow the word "chain": thus,

"chain B2"
"chain IV"
"pi chain"
"heavy chain"
"catalytic chain"

The tag is required. Use "<MAT>" for a single mature product.

If you can determine that at least both boundaries of a product have been experimentally determined AS PROTEIN with substantially enough of the portion between to leave little doubt that additional processing or splice forms do not occur, then use the status "#status experimental". Use "#status predicted" if the boundaries are assigned by homology or the sequence is determined substantially as nucleic acid. The experimental determination of only one end (almost always the amino end) is not sufficient to justify use of "#status experimental" for an entire "Product" feature because protein splicing, alternate transcripts, frame-shift errors and carboxyl-terminal propeptide processing introduce too many uncertainties.

[BLACK] Do not use "amino end of" and "carboxyl end of". Instead use the by modifier "(fragment)".

Back to Top

"Region" Record

This record remains generally unstandardized at this time to allow the annotation of new features that are not yet well-understood or standardized. A "Region" should probably carry the only the connotation of being contiguous sequence, as opposed to the spatial connotation of a "Domain". The following guidelines should be followed:

The tag is not usually used.

Status is often not appropriate.

See the discussion elsewhere for how to handle regions of tandem repeats.

The word "rich" should be appended with a hyphen. The word "binding" should be appended with a hypen if it is used as an adjective, and it should not have a hyphen if it is not followed by a noun.
Do not use the word "region" in the description; no "Region: xxx region". [BLACK] Avoid using expressions which match other record types, such as:

Region: active site
Region: extracellular domain

This first expression should especially not be used if specific residues are listed. It should either be annotated as an "Active site" or as

Region: catalytic

The second would be better as Domain: extracellular #status predicted
Regions of a specific type of secondary structure should not be annotated. In the NRL_3D database only, the PDB HELIX, TURN and SHEET features are converted to PIR "Region" features. The definitions and descriptions will use the PDB annotations in parentheses.

Region: helix (right hand alpha)
Region: turn (type II)
Region: beta sheet
Region: beta barrel

No other PIR databases should have entries with such conformational information annotated. The feature Domain: beta barrel
is acceptable.

Motifs or patterns combining various types of secondary structure may be annotated as "Regions". For example, Region: helix-turn-helix motif <HTH>

Do not use the word "motif" except for defined or accepted sequence motifs. Use the word "pattern" instead.

Do not use "#status predicted" for any feature that is defined by a sequence motif or pattern. Do not use something like

Region: pentapeptide motif (X-F-X-F-G) #status predicted

This is nonsensical because either the pattern is in the sequence or it is not. Instead use Region: pentapeptide motif (X-F-X-F-G)

Unfortunately it becomes more difficult to appreciate this rule when the name given to the motif is supposedly descriptive of a function. In cases like

Region: DNA-binding motif (K/R-G-R-G-R-P)

it is very tempting to use "#status predicted". But does the status mean that the property of DNA binding is predicted, or only that a motif is present? If the motif is present, it certainly isn't predicted, it is experimentally observed. But putting "experimental" would suggest that "DNA-binding" is not just a name but an observation. Don't be confused or confusing; never use a status with "motif", "pattern", "homology", "similarity", etc.

Back to Top

Suggestions for Annotators

Annotators may wish to use this checklist in preparing an annotation.
Usually the annotation should be the same as an annotation already in the database. Check for the feature in other database entries. Only these record types should be used:

Active site:
Binding site:
Cleavage site:
Cross-link:
Disulfide bonds:
Domain:
Inhibitory site:
Modified site:
Product:
Region:

The use of only these types is enforced in PIR databases.

Except for the special cases of "selenocysteine" and "N-formylmethionine", standard 3-letter residue codes should appear after the colon of "Active site", "Cleavage site" and "Inhibitory site" records, and in parentheses immediately after the first name in "Binding site", "Cross-link" and "Modified site" records. Be certain the residue code appears, that the residue has the correct number and that it corresponds to the proper residue in the sequence. This identity check is enforced in PIR databases.
Check that all other required fields are present and in the preferred order. The status should always be added to new entries in these records:

Active site:
Binding site:
Cleavage site:
Cross-link:
Disulfide bonds:
Inhibitory site:
Modified site:
Product:

A status may be appropriately used in only some "Region" and "Domain" records. If the extent field is used, only the word "partial" should appear, it should be placed immediately before the status and the status should be "experimental", not "predicted"

Check your spelling and punctuation. Spelling errors in chemical terms can be especially difficult to catch. When appropriate, check that the names in "Product" records correspond to names in the title or "Contains" record.

Check that there are unique tags on all "Product" and "Domain" records, and that they are different from other tags in the entry. Tags are not required on any other types of features.

Back to Top

Revised 10/22/01

Protein Information Resource