Entropy and Compressibility of Symbol Sequences

Document Type
Conference Object
Issue Date
Ebeling, Werner
Pöschel, Thorsten
Neiman, Alexander
Complex Systems Institute

The purpose of this paper is to investigate long-range correlations in symbol sequences using methods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of correlations and fluctuations comprising many letters, our main aim is related here to the problem of sequence compression. In spite of the great progress in this field achieved in the work of Shannon, Fano, Huffman, Lempel, Ziv and others [1] many questions still remain open. In particular one must note that since the basic work by Lempel and Ziv the improvement of the standard compression algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts expressed the idee that the long range correlations, which clearly exist in texts, computer programs etc. are not sufficiently taken into account by the standard algorithms [1]. Thus, our interest in compressibility is twofold: (i) We would like to explore how far compressibility is able to measure correlations. In particular we apply the standard algorithms to model sequences with known correlations such as, for instance, repeat sequences and symbolic sequences generated by maps. (ii) We aim to detect correlations which are not yet exploited in the standard compression algorithms and belong therefore to potential reservoirs for compression algorithms.

First, the higher-order Shannon entropies are calculated. For repeat sequences analytic estimates are derived, which apply in some approximation to DNA sequences too. For symbolic strings obtained from special nonlinear maps and for several long texts a characteristic root law for the entropy scaling is detected. Then the compressibilities are estimated by using grammar representations and several standard computing algorithms. In particular we use the Lempel-Ziv compression algorithm. Further the mean square fluctuations of the composition with respect to the letter content and several characteristic scaling exponents are calculated. We show finally that all these measuresare able to detect long-range correlations. However, as demonstrated by shuffling experiments, different measuring methods operate on different length scales. The algorithms based on entropy or compressibility operate mainly on the word and sentence level. The characteristic scaling exponents reflect mainly the long-wave fluctuations of the composition which may comprise a few hundreds or thousands of letters.


@inproceedings{EbelingPoeschelNeiman:1996, address = {Boston}, author = {Ebeling, Werner and P"oschel, Thorsten and Neiman, Alexander}, booktitle = {Proceedings of PhysComp96: Fourth Workshop on Physics and Computation, Boston, 22-24 November 1996}, editor = {Toffoli, Tommaso and Biafore, Michael and Le{~a}o, Jo{~a}o}, keywords = {TPauthor,reviewed}, organization = {Complex Systems Institute}, pages = {103-107}, title = {Entropy and compressibility of symbol sequences}, year = {1996}}

Faculties & Collections