Posts Tagged ‘Bioinformatics’

What is GeneMark

Sunday, May 3rd, 2009

GeneMark developed in 1993 was the first gene finding method recognized as an efficient and accurate tool for genome projects. GeneMark was used for annotation of the first completely sequenced bacteria, Haemophilus influenzae, and the first completely sequenced archaea, Methanococcus jannaschii. The GeneMark algorithm uses species specific inhomogeneous Markov chain models of protein-coding DNA sequence as well as homogeneous Markov chain models of non-coding DNA. Parameters of the models are estimated from training sets of sequences of known type. The major step of the algorithm computes a posteriory probability of a sequence fragment to carry on a genetic code in one of six possible frames (including three frames in complementary DNA strand) or to be “non-coding”.

Comparative Genomics Approaches of Gene prediction

Sunday, May 3rd, 2009

As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a comparative genomics approach. This is based on the principle that the forces of natural selection cause genes and other functional elements undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and Twinscan/N-SCAN.

Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques now play a central role in the annotation of all genomes.

Other Signals of Gene prediction

Sunday, May 3rd, 2009

Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like k-mer statistics, Fourier transform of a pseudo-number-coded DNA, Z-curve parameters and certain run features.[2]

It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of secondary structure in the identification of regulatory motifs has been reported.[3] In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.[4][5][6][7]

Ab Initio Approaches of Gene prediction

Sunday, May 3rd, 2009

Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to ab initio gene finding, in which genomic DNA sequence alone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either signals, specific sequences that indicate the presence of a gene nearby, or content, statistical properties of protein-coding sequence itself. Ab initio gene finding might be more accurately characterized as gene prediction, since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.

In the genomes of prokaryotes, genes have specific and relatively well-understood promoter sequences (signals), such as the Pribnow box and transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame (ORF), which is typically many hundred or thousands of base pairs long. The statistics of stop codons are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.

Ab initio gene finding in eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail.

Second, splicing mechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts (exons), separated by non-coding sequences (introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.

Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex probabilistic models, such as Hidden Markov Models, in order to combine information from a variety of different signal and content measurements. The GLIMMER system is a widely used and highly accurate gene finder for prokaryotes. GeneMark is another popular approach. Eukaryotic ab initio gene finders, by comparison, have achieved only limited success; notable examples are the GENSCAN and geneid programs. A few programs like CONTRAST also use machine learning approaches like support vector machines for successful gene prediction. The SNAP gene finder is HMM-based like Genscan and attempts be more adaptable to different organisms, addressing problems related to using a gene finder on a genome sequence that it was not trained against.[1]

Extrinsic Approaches of Gene prediction

Sunday, May 3rd, 2009

In extrinsic (or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a messenger RNA (mRNA) or protein product. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. BLAST is a widely used system designed for this purpose.

A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism’s genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many hundreds or thousands of different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.

Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the RefSeq database contains transcript and protein sequence from many different species, and the Ensembl system comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data..

What is Gene prediction

Sunday, May 3rd, 2009

Gene finding typically refers to the area of computational biology that is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

In its earliest days, “gene finding” was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of homologous recombination of several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic map specifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.

Determining that a sequence is functional should be distinguished from determining the function of the gene or its product. The latter still demands in vivo experimentation through gene knockout and other assays, although frontiers of bioinformatics research are making it increasingly possible to predict the function of a gene based on its sequence alone.

Web services in bioinformatics of Bioinformatics

Sunday, May 3rd, 2009

SOAP and REST-based interfaces have been developed for a wide variety of bioinformatics applications allowing an application running on one computer in one part of the world to use algorithms, data and computing resources on servers in other parts of the world. The main advantages lay in the end user not having to deal with software and database maintenance overheads[2] Basic bioinformatics services are classified by the EBI into three categories: SSS (Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological Sequence Analysis). The availability of these service-oriented bioinformatics resources demonstrate the applicability of web based bioinformatics solutions, and range from a collection of standalone tools with a common data format under a single, standalone or web-based interface, to integrative, distributed and extensible bioinformatics workflow management systems.

Software and tools of Bioinformatics

Sunday, May 3rd, 2009

Software tools for bioinformatics range from simple command-line tools, to more complex graphical programs and standalone web-services available from various bioinformatics companies or public institutions. The computational biology tool best-known among biologists is probably BLAST, an algorithm for determining the similarity of arbitrary sequences against other sequences, possibly from curated databases of protein or DNA sequences. BLAST is one of a number of generally available programs for doing sequence alignment. The NCBI provides a popular web-based implementation that searches their databases.

Protein-protein docking of Bioinformatics

Sunday, May 3rd, 2009

In the last two decades, tens of thousands of protein three-dimensional structures have been determined by X-ray crystallography and Protein nuclear magnetic resonance spectroscopy (protein NMR). One central question for the biological scientist is whether it is practical to predict possible protein-protein interactions only based on these 3D shapes, without doing protein-protein interaction experiments. A variety of methods have been developed to tackle the Protein-protein docking problem, though it seems that there is still much work to be done in this field.

High-throughput image analysis of Bioinformatics

Sunday, May 3rd, 2009

Computational technologies are used to accelerate or fully automate the processing, quantification and analysis of large amounts of high-information-content biomedical imagery. Modern image analysis systems augment an observer’s ability to make measurements from a large or complex set of images, by improving accuracy, objectivity, or speed. A fully developed analysis system may completely replace the observer. Although these systems are not unique to biomedical imagery, biomedical imaging is becoming more important for both diagnostics and research. Some examples are:
high-throughput and high-fidelity quantification and sub-cellular localization (high-content screening, cytohistopathology)
morphometrics
clinical image analysis and visualization
determining the real-time air-flow patterns in breathing lungs of living animals
quantifying occlusion size in real-time imagery from the development of and recovery during arterial injury
making behavioral observations from extended video recordings of laboratory animals
infrared measurements for metabolic activity determination
inferring clone overlaps in DNA mapping, e.g. the Sulston score


Pages: Prev 1 2 3 ...434 435 436 437 438 439 440 441 442 Next