Structural optimization of a full-text n-gram index using relational normalization

The VLDB Journal - Tập 17 - Trang 1485-1507 - 2007
Min-Soo Kim1, Kyu-Young Whang1, Jae-Gil Lee1, Min-Jae Lee1
1Department of Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South Korea

Tóm tắt

As the amount of text data grows explosively, an efficient index structure for large text databases becomes ever important. The n-gram inverted index (simply, the n-gram index) has been widely used in information retrieval or in approximate string matching due to its two major advantages: language-neutral and error-tolerant. Nevertheless, the n-gram index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance by using the relational normalization theory. We first identify that, in the (full-text) n-gram index, there exists redundancy in the position information caused by a non-trivial multivalued dependency. The proposed index eliminates such redundancy by constructing the index in two levels: the front-end index and the back-end index. We formally prove that this two-level construction is identical to the relational normalization process. We call this process structural optimization of the n-gram index. The n-gram/2L index has excellent properties: (1) it significantly reduces the size and improves the performance compared with the n-gram index with these improvements becoming more marked as the database size gets larger; (2) the query processing time increases only very slightly as the query length gets longer. Experimental results using real databases of 1 GB show that the size of the n-gram/2L index is reduced by up to 1.9–2.4 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram index. We also compare the n-gram/2L index with Makinen’s compact suffix array (CSA) (Proc. 11th Annual Symposium on Combinatorial Pattern Matching pp. 305–319, 2000) stored in disk. Experimental results show that the n-gram/2L index outperforms the CSA when the query length is short (i.e., less than 15–20), and the CSA is similar to or better than the n-gram/2L index when the query length is long (i.e., more than 15–20).

Tài liệu tham khảo

Baeza-Yates, R., Navarro, G.: A practical q-gram index for text retrieval allowing errors. CLEI Electron. J. 1(2), (1998) Baeza-Yates, R., Navarro, G.: Block addressing indices for approximate text retrieval. J. Am. Soc. Inf. Sci. 51(1), 69–82 (2000) Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press (1999) Barroso, L.A., Dean, J., Holzle, U.: Web search for a planet: the google cluster architecture. IEEE Micro 23(2), 22–28 (2003) Cao, X., Li, S.C., Tung, A.K.H.: Indexing DNA sequences using q-grams. In: Proc. Int’l Conf. on Database Systems for Advanced Applications (DASFAA). Beijing, pp. 4–16 (2005) Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 4th edn. Addison Wesley (2003) Gao, J., Goodman, J., Li, M., Lee, K.: Toward a unified approach to statistical language modeling for Chinese. ACM Trans. Asian Lang. Inf. Process. (TALIP) 1(1), 3–33 (2002) Grossi, R., Vitter, J.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: Proc. 32nd ACM Symposium on Theory of Computing (STOC), pp. 397–406 (2000) Karkkainen, J., Rao, S.: 7. Full-text indexes in external memory. In: Algorithms for Memory Hierarchies pp. 149–170 (2003) Karkkainen, J., Sutinen, E.: Lempel-Ziv index for q-grams. Algorithmica 21(1), 137–154 (1998) Karkkainen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string mathcing. In: Proc. 3rd South American Workshop on String Processing (WSP), pp. 141–155 (1996) Kim, M., Whang, K., Lee, J.: n-gram/2L-approximation: a two-level n-gram inverted index structure for approximate string matching. J. Comput. Systems Sci. Eng. (2007) (to appear) Kim, M., Whang, K., Lee, J., Lee, M.: n-Gram/2L: a space and time efficient two-level n-gram inverted index structure. In: Proc. the 31th Int’l Conf. on Very Large Data Bases (VLDB), Trondheim, pp. 325–336 (2005) Kukich, K.: Techniques for automatically correcting words in text. ACM Comput Surv 24(4), 377–439 (1992) Lee, J.H., Ahn J.S.: Using n-grams for korean text retrieval. In: Proc. Int’l Conf. on Information Retrieval. ACM SIGIR, Zurich, pp. 216–224 (1996) Lehtinen, O., Sutinen, E., Tarhio, J.: Experiments on block indexing. In: Proc. 3rd South American Workshop on String Processing pp. 183–193 (1996) Makinen, V.: Compact suffix array. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 305–319 (2000) Mayfield, J., McNamee, P.: Single N-gram stemming. In: Proc. Int’l Conf. on Information Retrieval. ACM SIGIR, Toronto, pp. 415–416 (2003) Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the Web. ACM Trans. Inf. Systems 19(3), 217–241 (2001) Miller, E., Shen, D., Liu, J., Nicholas, C.: Performance and scalability of a large-scale N-gram based information retrieval system. J. Digital Inf. 1(5), 1–25 (2000) Navarro, G., Baeza-Yates, R., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng Bull 24(4), 19–27 (2001) Navarro, G., Makinen, V.: Compressed full-text indexes. Technical report TR/DCC-2006-6, Department of Computer Science, University of Chile, (2006). (accepted to ACM Computing Surveys) Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 350–363 (2000) Puglisi, S., Smyth, W., Turpin, A.: Inverted files versus suffix arrays for locating patterns in primary memory. In: Proc. 13th Symposium on String Processing and Information Retrieval (SPIRE), Glasgow, pp. 122–133 (2006) Ramakrishnan, R.: Database Management Systems. McGraw-Hill, New York (1998) Scholer, F., Williams, H.E., Yiannis, J., Zobel, J.: Compression of inverted indexes for fast query evaluation. In: Proc. Int’l Conf. on Information Retrieval, ACM SIGIR, Tampere, pp. 222–229 (2002) Sutinen, E., Tarhio, J.: Filtration with q-samples in approximate string matching. In: Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 50–63 (1996) Ullman, J.D.: Principles of Database and Knowledge-Base Systems, Vol. I. Computer Science Press, USA (1988) Whang, K., Lee, M., Lee, J., Kim, M., Han, W.: Odysseus:a high-performance ORDBMS tightly-coupled with IR features. In: Proc. 21st IEEE Int’l Conf. on Data Engineering (ICDE), Tokyo, pp. 1104–1105, (2005) (this paper received the Best Demonstration Award) Williams, H.E.: Genomic information retrieval. In: Proc. 14th Australasian Database Conferences, Adelaide, pp. 27–35 (2003) Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Trans. Knowl. Data Eng. 14(1), 63–78 (2002) Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn., Morgan Kaufmann (1999) Yasushi, O., Masajirou, I.: A new character-based indexing method using frequency data for Japanese documents. In: Proc. Int’l Conf. on Information Retrieval, pp. 121–129. ACM SIGIR, Seattle (1995) Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Comput Surv 38(2), (2006)