Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation
Introduction
In recent years several authors outlined different 2-D graphical representations of DNA sequences [1], [2], [3], [4], [5], [6]. Such representations facilitate visual recognition of differences among related DNA sequences by inspection. More recently it was shown that graphical representations of a DNA sequence can lead to numerical characterizations of the sequence [7], [8]. The 2-D graphical representations involve an arbitrary assignment of the four types of bases (adenine (A), thymine (T), guanine (G), and cytosine (C)) to various geometrical alternatives, such as the four directions of Cartesian coordinate axes or the four symmetry non-equivalent horizontal lines. Hence, the resulting numerical characterization is model dependent. In contrast the 3-D graphical representation of DNA sequences [9], being based on four mutually equivalent tetrahedral directions has an important advantage in that the assignment of the four bases to the four tetrahedral directions does not involve arbitrary decisions. The same is also true for the 4-D representation of DNA recently introduced by Randić and Balaban [10].
Both 2-D and 3-D graphical representations of DNA are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself. Consequently, from such representations it is not possible to reconstruct the underlying DNA sequence. In the case of the 2-D representations this particular deficiency has recently been corrected by a modification of the directions of vectors assigned to the four bases [5], [6]. The recently introduced 2-D graphical representation of a DNA sequence, in which A, T, G, and C are assigned to four horizontal lines [11], also allows the sequence reconstruction. In this representation, just as in the other mentioned 2-D representations, the numerical characterization of the sequence depends on how the individual axes are labeled. Hence, it is desirable to construct structural sequence descriptors that would not depend on conventions of assigning the four bases to the four horizontal lines.
In this Letter we will outline a procedure circumventing conventions on assignment of individual bases to the four horizontal lines. The procedure results in a numerical characterization of a DNA sequence, which is independent of labeling of individual lines. We have achieved this by viewing the numerical data obtained from different assignments of the four types of bases to the four lines as components of a vector that characterizes a DNA sequence.
Section snippets
2-D graphical representation based on four horizontal lines
In Fig. 1 we show the 2-D graphical representation of the segment of human DNA, consisting of the first 20 bases of the coding sequence of the first exon of β-globin gene, based on the four horizontal lines separated by unit distance. We assigned (Fig. 1 top) A to the first line, T to the second one, G to the third one, and C to the last one. This is the order in which the four types of bases appear for the first time in the considered segment of DNA. The consecutive bases are represented by
Characterization of DNA sequences with 12-component vectors
In order to numerically characterize a DNA sequence given by the 2-D graphical representation based on four horizontal lines one can associate with a corresponding zigzag curve a matrix and consider matrix invariants that are sensitive to the form of the curve [11]. One of the matrices which meet this condition is the L/L matrix (the length/length matrix) whose elements are defined as the quotient of the Euclidean distance between a pair of vertices (dots) of the zigzag curve and the sum of
Similarities/dissimilarities among the coding sequences of the first exon of β-globin gene of different species
We will illustrate the use of the novel quantitative characterization of DNA sequences with the examination of similarities/dissimilarities among the 11 coding sequences of Table 1. A direct comparison of these sequences using computer codes is somewhat less straightforward due to the fact that the sequences have different lengths, from 86 bases (goat, bovine) to 105 bases (chimpanzee). If we represent the sequences with the corresponding 12-component vectors then different lengths of the
Acknowledgements
This work was supported in part by the Ministry of Science and Technology of the Republic of Croatia and in part by the Ministry of Education, Science and Sport of the Republic of Slovenia, contract: PI-0507 and PI-0508. M.R. thanks the National Institute of Chemistry, Ljubljana, Slovenia for cordial hospitality.
References (14)
- et al.
Chem. Phys. Lett.
(2001) - et al.
Chem. Phys. Lett.
(2003) BioTechniques
(1989)Curr. Sci.
(1994)- et al.
Comput. Appl. Biosci.
(1995) - et al.
J. Biosci.
(1998) - et al.
J. Chem. Inf. Comput. Sci.
(2002)
Cited by (250)
Non-standard bioinformatics characterization of SARS-CoV-2
2021, Computers in Biology and MedicineA geometric characterization of DNA sequence
2019, Physica A: Statistical Mechanics and its ApplicationsSimilarity/dissimilarity calculation methods of DNA sequences: A survey
2017, Journal of Molecular Graphics and ModellingSpectral-dynamic representation of DNA sequences
2017, Journal of Biomedical InformaticsChoice of Metric Divergence in Genome Sequence Comparison
2024, Protein Journal