LEXICON / GLOSSARY

Alignment

Representation of a prediction of the amino acids in tertiary structures of homologues that overlay in three dimensions. Alignments held by SMART are mostly based on published observations (see domain annotations for details), but are updated and edited manually.

Alignment block

Ungapped alignments that usually represent a single secondary structure.

Bits scores

Alignment scores are reported by HMMer and BLAST as bits scores. The likelihood that the query sequence is a bona fide homologue of the database sequence is compared to the likelihood that the sequence was instead generated by a "random" model. Taking the logarithm (to base 2) of this likelihood ratio gives the bits score.

BLAST, Basic local alignment search tool.

An excellent database searching tool developed at the National Center for Biotechnology Information (NCBI) ([1], [2], [3], [4]). SMART uses NCBI-BLAST for detection of outlier homologues and homologues of known structure.

Cellular Role

Chromatin-associated: Chromatin is the tangled fibrous complex of DNA and protein within a eukaryotic nucleus
Interaction (with the environment): Molecules that sense cellular environmental change, such as osmolarity, light flux, acidity, ion concentration etc
Metabolic: Enzymes that catalyze reactions in living cells that transform organic molecules
Replication: The process of making an identical copy of a section of duplex (double-stranded) DNA, using existing DNA as a template for the synthesis of new DNA strands
Signalling: Proteins that participate in a pathway that is initiated by a molecular stimulus and terminated by a cellular response
Transport: Transporters (permeases) are proteins that straddle the cell membrane and carry specific nutrients, ions, etc. across the membrane
Translation: The process in which the genetic code carried by messenger RNA directs the synthesis of proteins from amino acids
Transcription: The synthesis of an RNA copy from a sequence of DNA (a gene); the first step in gene expression

Coiled coils

Intimately-associated bundles of long alpha-helices ([1], [2], [3]). Coiled coils are detected in SMART using the method of Lupas et al. ([4]).

Domain

Conserved structural entities with distinctive secondary structure content and an hydrophobic core. In small disulphide-rich and Zn²⁺-binding or Ca²⁺- binding domains the hydrophobic core may be provided by cystines and metal ions, respectively. Homologous domains with common functions usually show sequence similarities.

Domain composition

Proteins with the same domain composition have at least one copy of each of domains of the query.

Domain organisation

Proteins having all the domains as the query in the same order (Additional domains are allowed).

E-value

This represents the number of sequences with a score greater-than, or equal to, X, expected absolutely by chance. The E-value connects the score ("X") of an alignment between a user-supplied sequence and a database sequence, generated by any algorithm, with how many alignments with similar or greater scores that would be expected from a search of a random sequence database of equivalent size. Since version 2.0 E-values are calculated using Hidden Markov Models, leading to more accurate estimates than before.

Extracellular Domains

Domain families that are most prevalent in proteins outside of the cytoplasm and the nucleus.

Gap

A position in an alignment that represents a deletion within one sequence relative to another. Gap penalties are requirements for alignment algorithms in order to reduce excessively-gapped regions. Gaps in alignments represent insertions that usually occur in protruding loops or beta-bulges within protein structures.

Genomic database

Protein database used in SMART's 'Genomic' mode. It contains data from completely sequenced genomes only, and is synchronized with a recent version of the STRING database. The complete list of genomes included in the current SMART version is available here.

HMM, Hidden Markov model

HMMs are statistical models of the sequence consensus of an homologous family (see the HMMER). A particular class of HMMs has been shown to be equivalent to generalised profiles ([1]). Applications of HMMs to sequence analysis are nicely provided by HMMer.

HMM consensus

The HMM consensus is a 'one line summary' of the corresponding HMM. The amino acid shown for the consensus is the highest probability amino acid at that position according to the HMM. Capital letters mean "highly conserved" residues (probability > 0.5 for protein models). (modified from the HMMer User's Guide)

HMMer

The HMMer package ([1], [2]) provides multiple alignment and database searching capabilities.
There are several programs in the package, including one (hmmfs) that searches databases for non-overlapping LOCAL similarities (i.e. that match across at least part of the HMM), and another (hmmls) that searches databases for non-overlapping GLOBAL similarities (i.e. that match across the full HMM). These correspond approximately to profile-based searches using negative and positive profiles, respectively (see WiseTools). Database searches using hmmls or hmmfs provide alignment scores as bits scores.

Homology

Evolutionary descent from a common ancestor due to gene duplication.

Intracellular Domains

Domain families that are most prevalent in proteins within the cytoplasm.

Localisation

Numbers of domains that are thought from SwissProt annotations to be present in different cellular compartments (cytoplasm, extracellular space, nucleus, and membrane-associated) are shown in annotation pages.

Motif

Sequence motifs are short conserved regions of polypeptides. Sets of sequence motifs need not necessarily represent homologues.

NRDB, non-redundant database

A database that contains no identical pairs of sequences. It can contain multiple sequences originating from the same gene (fragments, alternative splicing products...). SMART's 'normal' mode uses a NRDB created from Uniprot and Ensembl protein databases.

ORF

Open reading frame.

Outlier homologues

These are often difficult to detect using HMM methodology. A complementary approach to their detection is to query a database of sequences taken from multiple sequence alignments, using BLAST.
Selecting this option will also activate searches against sequence databases derived from proteins of known structure. A simple BLAST search of the PDB is performed, together with a search of RPS_Blast profiles derived from SCOP. These profiles were kindly provided by Steffen Schmidt (see Schmidt et al. J. Chem. Inf. Comput. Sci. 2002 (42) 405-7).

P-value

This represents a probability that, given a database of a particular size, random sequences score higher than a value X. P-values are generated by the BLAST algorithm that has been integrated into SMART.

PDB, protein data bank

PDB is an archive of experimentally-determined three-dimensional structures (Brookhaven nat. Labs or EBI). Domain families represented in SMART and in the PDB are annotated as being of known structure; links are provided in SMART to the PDB via PDBsum. PDBsum links can be used to access a variety of sequence-based and structure-based tools.

PFAM

Pfam is a database of protein domain families represented as (i) multiple alignments, and (ii) HMM-profiles ([1], [2]). Pfam WWW servers do not exist anymore and their functionality has been merged into InterPro. SMART contains a facility to search the Pfam domain collection using HMMer.

Prokaryotic domains

SMART now also searches for domains found in two component regulatory systems. These can be found mainly in Prokaryotes, but a few were also found in eukaryotes like yeast and plants.

Profile

A profile is a table of position-specific scores and gap penalties, representing an homologous family, that may be used to search sequence databases (Ref.: [1], [2], [3]).
In CLUSTAL-W-derived profiles those sequences that are more distantly related are assigned higher weights ([4], [5], [6]). Issues in profile-based database searching are discussed in Bork & Gibson (1996) [7].

PROSITE

This is a dictionary of protein sites and motif patterns. Some SMART domain annotations contain links to PROSITE.

Schnipsel database; domain sequence database

"Schnipsel" is a German word meaning 'snippet' or 'fragment'. The schnipsel database consists of the sequences off all domains found with SMART in NRDB. Outliers of a family often cannot be detected by a profile, yet are detectable by pairwise similarity to one or more established members of a sequence family. So, searching against the schnipsel database gives complementary information to the profile searches.

Secondary Literature

The secondary literature is derived by the following procedure. For each of the hand selected papers referenced by a domain, 100 neighbouring papers are retrieved using PubMed. If one of these neighbouring papers is referenced from more than two original papers, it is included into the secondary literature list.

Seed Alignment

Alignment that contains only one of each pair of homologues that are represented in a CLUSTALW-derived phylogenetic tree linked by a branch of length less than a distance of 0.2 (see the related article).

SEG

A program of Wootton & Federhen [1] that detects regions of the query sequence that have low compositional complexity [2].

Sequence ID or ACC

Sequence identifiers or accession codes may be entered via the SMART homepage to initiate a query. You can use either an Uniprot or Ensembl sequence identifiers.

Signalling Domains

The original set of domains used in SMART were collected as those that satisfied one or both of two criteria:

cytoplasmic domains that possess kinase, phosphatase, ubiquitin ligase or phospholipase enzymatic activities or those that stimulate GTPase-activation or guanine nucleotide exchange
cytoplasmic domains that occur in at least two proteins with different domain organisations, of which one also contains a domain that satisfies criterion 1

These domains mostly mediate or regulate the transduction of an extracellular signal towards the nucleus resulting in the initiation of a cellular response. More recently, prokaryotic two-component signalling domains have been added to the SMART set.

SignalP

This program predicts the presence and location of signal peptide cleavage sites in amino acid sequences (SignalP home page).

SwissProt

The SwissProt database is an extensively annotated and non-redundant collection of protein sequences. SwissProt annotations have been mined for SMART-derived annotations of alignments.

TMHMM2

This program predicts the location and topology of transmembrane helices (TMHMM2).

Thresholds

For each of the domains found by SMART a combination of thresholds is used to distinguish between true and false hits. The different thresholds are described in the SMART paper.