Optimization for data de-duplication algorithm based on file content

Frontiers of Optoelectronics - Tập 3 - Trang 308-316 - 2010

Xuejun Nie¹, Leihua Qin¹, Jingli Zhou¹, Ke Liu¹, Jianfeng Zhu¹, Yu Wang¹

¹School of Computer Science and Technology, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China

Tóm tắt

Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in archival storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all file types. It has been proven that such method cannot achieve optimal performance for compound archival data. We analyze the content characteristic of different file types and propose candidate anchor histogram (CAH) to capture it. We propose an improved strategy for determining chunk boundaries based on CAH and tune some key parameters of CDC based on the data layout of underlying data de-duplication file system (TriDFS), which can efficiently store variable-sized chunks on fixed-sized physical blocks. These strategies are evaluated with representative archival data, and the result indicates that they can increase on average the compression ratio by 16.3% and write throughput by 13.7%, while only decrease the read throughput by 2.5%.

Tài liệu tham khảo

Tony A, Biggar H. Data De-Duplication and Disk-to-Disk Backup Systems: Technical and Business Considerations. The Enterprise Strategy Group Technical Report. 2007 Biggar H. Experiencing in Data De-Duplication: Improving Efficiency and Reducing Capacity Requirements. The Enterprise Strategy Group Technical Report. 2007 Lillibridge M, Eshghi K, Bhagwat D, Deolalikar V, Trezise G, Camble P. Sparse indexing: large scale, inline deduplication using sampling and locality. In: Proceedings of the 7th USERNIX Conference on File and Storage Technologies. 2009 Cox L P, Murray C D, Noble B D. Pastiche: making backup cheap and easy. In: Proceedings of the 5th Symposium on Operating Systems Design and Implementation. 2002, 285–298 Quinlan S, Dorward S. Venti: a new approach to archival storage. In: Proceedings of the Conference on File and Storage Technologies. 2002, 89–101 Jain N, Dahlia M, Tewari R. TAPER: tiered approach for eliminating redundancy in replica synchronization. In: Proceedings of the 4th USENIX Conference on File and Storage Technologies. 2005, 4: 21 Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage, 2006, 2(4): 424–448 Zhu B, Kai L, Patterson H. Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies. 2008, 18 You L L, Karamanolis C. Evaluation of efficient archival storage techniques. In: Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies. 2004, 227–232 Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference. 1994, 1–10 Rabin M O. Fingerprinting by Random Polynomials. Center for Research in Computing Technology. Harvard University Technical Report TR-15-81. 1981 Brin S, Davis J, Garcia-Molina H. Copy detection mechanisms for digital documents. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 1995, 398–409

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA