BIOINFORMATICS
What is BIOINFORMATICS?
- Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry) and then applying "informatics" techniques (derived from disciplines such as applied math, computer science, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
Aims of Bioinformatics:
In general, the aims of bioinformatics are three-fold.
- First, at its simplest bioinformatics organizes data in a way that allows researchers to access existing information and to submit new entries as they are produced.
- The second aim is to develop tools and resources that aid in the analysis of data. For example, having sequenced a particular protein, it is of interest to compare it with previously characterized sequences.
- The third aim is to use these tools to analyse the data and interpret the results in a biologically meaningful manner. Traditionally,biological studies examined individual systems in detail, and frequently compared them with a few that are related.
What is a Database?
- A database is a collection of structured,searchable and up to date data.
- A Biological Database contains Biological Data
Classification Of database:
On the basis of accessibility:
- Public Database: The information is free accessible for everybody everywhere in the world.
- Private Database: The information is available in exchange for a subscription.
On the basis of type of information available:
- Primary Database: Contains information obtained directly from an experiment;also known as raw data.
- Primary Sequence Database: Raw data about sequence of Nucleotide eg: Genbank, EMBL, DDBJ or Protein. eg: PIR
- Primary Structure Database: Raw data about structure of Nucleotide eg: NDB or Protein eg: PDB
- Secondary Database:Contains information derived from Primary Database.
- Secondary Sequence Database:Contains information derived from primary database about sequence of nucleotide or protein. eg: Pfam.
- Secondary Structure Database:Contains information derived from primary database about structure of nucleotide or protein. eg: DSSP
- Composite Database:Joins a variety of primary databases.
Some other types of Database:
- Pathway Database: A pathway database (DB) describes biochemical pathways, reactions, and enzymes. eg: KEGG and BioCyc .
- Literature Database:Literature or bibliographic databases contain scientific articles or abstracts of them. Searches usually give the author's name, title, publication and date (citation information). Some also provide abstracts of the article. eg: PubMed.
What is Sequence Alignment?
Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
Why is sequence alignment necessary?
Often little is known about the function of a sequence from a genome sequencing program,but if similar sequences are found in the database for which functional or structural information is available,then this can be used as the basis of a prediction of function or structure for the new sequence.
Essential concepts in Sequence Alignment:
- Score: The score of a Sequence Alignment is a reflection of the overall quality of Sequence Alignment.Higher the score,higher is the similarity between two or more sequences.For Eg, a simple score consists of a additive contribution of +1 for every matching pair of letters in the alignment.
- Gaps: Gaps are introduced during sequence alignment ,to enable better alignment of 2 sequences,in which most of the aligned pairs are identical letters.
Let us see an example:
SEQ 1: AATTGATTGCCAATG
SEQ 2: AATTGA- - -CCAATG
If SEQ 1 is the ancestral sequence,then there has been a deletion of 3 nucleotides in SEQ 2.
If SEQ 2 is the ancestral sequence,then there has been a insertion of 3 nucleotides in SEQ 1.
Now ,we do not know which is the ancestral sequence.So the gaps are called indel.
Gap penalty:
Often in order to align sequences,many gaps are introduced.However we need to control this ,as insertion and deletion of monomers is a relatively slow evolutionary process,and alignments with large number of gaps do not make biological sense. To solve this problem,Gap penalties are used.
A gap penalty is subtracted from the score for each gap that is introduced
Types of Gap penalty:
Constant Penalty:
Independent of gap length. Denoted by A,where a size of A controls how strongly gaps will be penalized.
Proportional Penalty:
Dependent of gap length. Denoted by Bl,where a size of B controls how strongly gaps will be penalized and l is the gap length.
Affine Penalty:
Has both constant and proportional contributions. Denoted by A+Bl where,A is gap opening penalty as it is applied for gap of any length, Bl is the gap extension penalty, because it is attached to extending the length of a existing gap by 1 unit.
- P –Value:The probability of obtaining an alignment score S or higher by pure chance when aligning 2 sequences.When P-value is low ,it means the probability that 2 sequences have aligned by chance is low, and the probability that they have aligned due to evolutionary relationship is high.
- E-Value:No of distinct alignments with score S or higher that are expected to occur in database search by chance.
Types of sequence alignment:
Pairwise Alignment
Pairwise sequence alignment methods are used to find local or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision.The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods
Dot matrix method
To construct a dot-matrix plot, the two sequences are written along the top row and leftmost column of a two-dimensional matrix and a dot is placed at any point where the characters in the appropriate columns match—this is a typical recurrence plot. The dot plots of very closely related sequences will appear as a single line along the matrix's main diagonal.
- Advantages :
- A dot plot can be used to identify long regions of strong similarity between two sequences
- It produces a plot, which is easy to make and to interpret
- It can be used to compare very short or long sequences (even whole chromosomes – millions of bases)
- Disadvantages
- It is necessary to find the best window size and threshold by trial-and- error
- A dot plot can only be used to compare 2 sequences, not >2 sequences
- It doesn’t tell you what mutations occurred in the region of similarity (if there is one) since the two sequences shared a common ancestor
Dynamic programming algorithm method
Takes 2 sequences as input and produces as output the best alignment them.
Local Alignment
Finds short patches of sequence similarity b/w 2 sequences.
Algorithm-Smith Waterman Algorithm
Global alignment
Aligns the 2 sequences in entirety from left-most to right-most residue.
Algorithm :Needleman–Wunsch algorithm
Word method
Word methods, are heuristic methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools FASTA and the BLAST.
Blast:
The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms.
Applications:
Sequence Analysis
The application of sequence analysis determines those genes which encode regulatory sequences or peptides by using the information of sequencing. For sequence analysis, there are many powerful tools and computers which perform the duty of analyzing the genome of various organisms. These computers and tools also see the DNA mutations in an organism and also detect and identify those sequences which are related. Shotgun sequence techniques are also used for sequence analysis of numerous fragments of DNA. Special software is used to see the overlapping of fragments and their assembly.
Prediction of Protein Structure
It is easy to determine the primary structure of proteins in the form of amino acids which are present on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics can also be used to determine the complex protein structures.
Genome Annotation
In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It is a very important part of the human genome project as it determines the regulatory sequences.
Comparative Genomics
Comparative genomics is the branch of bioinformatics which determines the genomic structure and function relation between different biological species. For this purpose, intergenomic maps are constructed which enable the scientists to trace the processes of evolution that occur in genomes of different species. These maps contain the information about the point mutations as well as the information about the duplication of large chromosomal segments.
Health and Drug discovery
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management. Complete sequencing of human genes has enabled the scientists to make medicines and drugs which can target more than 500 genes. Different computational tools and drug targets has made the drug delivery easy and specific because now only those cells can be targeted which are diseased or mutated. It is also easy to know the molecular basis of a disease.
0 comments:
Post a Comment