Efficient Reconciliation of Unstructured and Structured Data Over Networks
Abstract
The unifying theme of this work is to provide new scalable solutions comprising algorithms, protocols, and data-structures, for solving data synchronization and set difference estimation problems. These problems are are repeatedly encountered in distributed systems and solving them efficiently directly affects the scalability of the distributed system, i.e., how many network hosts can participate in the distributed system. Our new solutions, if deployed, can significantly reduce communication, computational overhead, and meta-data stored on hosts as compared to currently used approaches for data synchronization and set difference estimation. Modern distributed network applications often utilize a wholesale data trans¬fer protocol known as "slow sync" for reconciling data on constituent hosts. This approach is markedly inefficient with respect to bandwidth usage, latency, and en¬ergy consumption. We analyze and implement a novel data synchronization scheme (CPlsync) whose salient property is that its communication complexity depends on the number of differences between the PDA and PC, and is essentially independent of the overall number of records. Moreover, our implementation shows that the computational complexity and energy consumption of CPlsync is practical, and that the overall latency is typically much smaller than that of slow sync or alter-native synchronization approaches based on Bloom filters. We also consider the related problem of exchanging two similar strings held by different hosts with a minimum amount of communication. Our approach involves transforming a string into a multi-set of substrings that are reconciled efficiently using the set recon¬ciliation algorithms, and then put back together on a remote host using recent graph-theoretic techniques. We present analyses, experiments and results to show that its communication complexity compares favorably to rsync. We also quantify the possible tradeoff between communication complexity and the computational complexity of our approach. We propose several protocols for estimating the number of differences between sets held on remote hosts. Our approaches are based on modifications of the count¬ing Bloom filter and are designed for minimizing communication and interaction between the hosts. As such, they are appropriate for streamlining a variety of com¬munication sensitive network applications, including data synchronization in mobile networks, gossip protocols and content delivery networks. We provide analytical bounds on the expected performance of our approach, together with heuristic im¬provements and then back our claims with experimental evidence that our protocols can outperform existing difference estimation techniques.