Pattern identification method using relational arrays in DNA sequences

Keywords: Sequential pattern mining, frequent contiguous patterns, biological sequences, DNA sequences, bioinformatics

Abstract

Biological sequences contain a significant amount of genetic information from living organisms. The analysis of these sequences can provide information that might help biologists better understand them. The discovery of frequent patterns from a group of DNA sequences has become one of the greatest challenges in the application of data mining techniques. There is considerable time and effort expended in obtaining sequential frequent patterns when the methods utilized are based on Apriori algorithms such as GSP or Key-segment. This paper proposes a sequence mapping-based algorithm designed to augment the search for contiguous frequent patterns in a group of DNA sequences. The present paper presents experiments for several sets of DNA sequences whose whole length varies between 1000 and 5000 nucleotides obtained from a biological database. These experiments demonstrated a faster algorithm for frequent pattern mining on DNA sequences compared with other related algorithms.

Downloads

Download data is not yet available.

References

Aggarwal, C. and Han, J., (2014). Frequent Pattern Mining. Springer International.

Agrawal, R., y Srikant, R. (1995)., Mining sequential patterns. En Proceedings of the eleventh International Conference on Data Engineering (pp. 3–14). Washington, DC, USA: IEEE Computer Society

Bailey, T.L., (2008) Discovering Sequence Motifs, page 231–251. Humana Press.

D’haeseleer, P., (2006). What are DNA sequence motifs? Nature Biotechnology, 24(4):423–425.

Han, J., Pei, J., Mortazavi-Asl, B., Chen, Q., Dayal, U., y Hsu, M.-C., (2000). Freespan: frequent pattern-projected sequential pattern mining. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 355–359).

Mizrachi, I., (2016). Genbank: The nucleotide sequence database. consultado el 11-11-2021 desde http://www.ncbi.nlm.nih.gov/books/NBK21105/

Rocha, M., Ferreira, P.G., (2018). Bioinformatics Algorithms: Design and Implementation in Python. Elsevier.

Srikant, R., y Agrawal, R., (1996). Mining sequential patterns: Generalizations and performance improvements. En P. Apers, M. Bouzeghoub, y G. Gardarin (Eds.), Advances in database technology (pp. 1–17). Berlin, Heidelberg: Springer Berlin Heidelberg.

Starr, C., Evers, C.A., Starr, L., (2011). Biología: Conceptos y Aplicaciones. Cengage Learning.

Published
2022-08-31
How to Cite
Sobrevilla-Solis, V. I., Franco-Árcega, A., Garcia-Islas, L. H., Rueda-Soriano, E., López-Morales, V., & Suárez-Cansino, J. (2022). Pattern identification method using relational arrays in DNA sequences . Pädi Boletín Científico De Ciencias Básicas E Ingenierías Del ICBI, 10(Especial3), 22-29. https://doi.org/10.29057/icbi.v10iEspecial3.8928

Most read articles by the same author(s)