Analysis of similarity/dissimilarity of DNA sequences based on novel 2-D graphical representation

https://doi.org/10.1016/S0009-2614(03)00244-6Get rights and content

Abstract

The recently proposed 2-D graphical representation of DNA based on four horizontal lines involves an arbitrary assignment of the four types of bases to the lines. While each such assignment is legitimate, it is desirable to have a scheme free of such arbitrary choices among non-equivalent geometrical representations. We outline one such approach, which is based on the construction of a 12-component vector whose components are the leading eigenvalues of the L/L matrices associated with DNA. The examination of similarities/dissimilarities among the coding sequences of the first exon of β-globin gene of different species illustrates the utility of the approach.

Introduction

In recent years several authors outlined different 2-D graphical representations of DNA sequences [1], [2], [3], [4], [5], [6]. Such representations facilitate visual recognition of differences among related DNA sequences by inspection. More recently it was shown that graphical representations of a DNA sequence can lead to numerical characterizations of the sequence [7], [8]. The 2-D graphical representations involve an arbitrary assignment of the four types of bases (adenine (A), thymine (T), guanine (G), and cytosine (C)) to various geometrical alternatives, such as the four directions of Cartesian coordinate axes or the four symmetry non-equivalent horizontal lines. Hence, the resulting numerical characterization is model dependent. In contrast the 3-D graphical representation of DNA sequences [9], being based on four mutually equivalent tetrahedral directions has an important advantage in that the assignment of the four bases to the four tetrahedral directions does not involve arbitrary decisions. The same is also true for the 4-D representation of DNA recently introduced by Randić and Balaban [10].

Both 2-D and 3-D graphical representations of DNA are accompanied with some loss of information due to overlapping and crossing of the curve representing DNA with itself. Consequently, from such representations it is not possible to reconstruct the underlying DNA sequence. In the case of the 2-D representations this particular deficiency has recently been corrected by a modification of the directions of vectors assigned to the four bases [5], [6]. The recently introduced 2-D graphical representation of a DNA sequence, in which A, T, G, and C are assigned to four horizontal lines [11], also allows the sequence reconstruction. In this representation, just as in the other mentioned 2-D representations, the numerical characterization of the sequence depends on how the individual axes are labeled. Hence, it is desirable to construct structural sequence descriptors that would not depend on conventions of assigning the four bases to the four horizontal lines.

In this Letter we will outline a procedure circumventing conventions on assignment of individual bases to the four horizontal lines. The procedure results in a numerical characterization of a DNA sequence, which is independent of labeling of individual lines. We have achieved this by viewing the numerical data obtained from different assignments of the four types of bases to the four lines as components of a vector that characterizes a DNA sequence.

Section snippets

2-D graphical representation based on four horizontal lines

In Fig. 1 we show the 2-D graphical representation of the segment of human DNA, consisting of the first 20 bases of the coding sequence of the first exon of β-globin gene, based on the four horizontal lines separated by unit distance. We assigned (Fig. 1 top) A to the first line, T to the second one, G to the third one, and C to the last one. This is the order in which the four types of bases appear for the first time in the considered segment of DNA. The consecutive bases are represented by

Characterization of DNA sequences with 12-component vectors

In order to numerically characterize a DNA sequence given by the 2-D graphical representation based on four horizontal lines one can associate with a corresponding zigzag curve a matrix and consider matrix invariants that are sensitive to the form of the curve [11]. One of the matrices which meet this condition is the L/L matrix (the length/length matrix) whose elements are defined as the quotient of the Euclidean distance between a pair of vertices (dots) of the zigzag curve and the sum of

Similarities/dissimilarities among the coding sequences of the first exon of β-globin gene of different species

We will illustrate the use of the novel quantitative characterization of DNA sequences with the examination of similarities/dissimilarities among the 11 coding sequences of Table 1. A direct comparison of these sequences using computer codes is somewhat less straightforward due to the fact that the sequences have different lengths, from 86 bases (goat, bovine) to 105 bases (chimpanzee). If we represent the sequences with the corresponding 12-component vectors then different lengths of the

Acknowledgements

This work was supported in part by the Ministry of Science and Technology of the Republic of Croatia and in part by the Ministry of Education, Science and Sport of the Republic of Slovenia, contract: PI-0507 and PI-0508. M.R. thanks the National Institute of Chemistry, Ljubljana, Slovenia for cordial hospitality.

References (14)

  • X. Guo et al.

    Chem. Phys. Lett.

    (2001)
  • M. Randić et al.

    Chem. Phys. Lett.

    (2003)
  • E. Hamori

    BioTechniques

    (1989)
  • A. Nandy

    Curr. Sci.

    (1994)
  • P.M. Leong et al.

    Comput. Appl. Biosci.

    (1995)
  • A. Roy et al.

    J. Biosci.

    (1998)
  • Y. Liu et al.

    J. Chem. Inf. Comput. Sci.

    (2002)
There are more references available in the full text version of this article.

Cited by (250)

  • Non-standard bioinformatics characterization of SARS-CoV-2

    2021, Computers in Biology and Medicine
  • A geometric characterization of DNA sequence

    2019, Physica A: Statistical Mechanics and its Applications
  • Similarity/dissimilarity calculation methods of DNA sequences: A survey

    2017, Journal of Molecular Graphics and Modelling
  • Spectral-dynamic representation of DNA sequences

    2017, Journal of Biomedical Informatics
View all citing articles on Scopus
View full text