11 posts
This article helps you to deduplicate a string given a string and chunk size.
Case : Lets say you have a large text file. Each row contains an email id and some other information (say some product-id). Assume there are millions of rows in the file. How would you effieciently de duplicate the data ?
DeDuplication : The process that returns an intermediate strinig , helps in reduplication of original string.
Steps to follow :
1. Break the string into chunks of the given size (Values of chunk size can be 1KB,10KB and so on).
2. Find the unique chunks and make a note of where these chunks occur in the string.
3. The intermediate string should contain the unique strings and their positions.
4. This string alone should be used to perform reduplication, which constructs the original string.
Example :
Input:
abcdexyzvwabcde
chunk size: 5 bytes
Output after deduplication:
abcde-0-2,xyzvw1
Output after reduplication:
abcdexyzvwabcde
Repo : Checkout the working example Here
Please log in to leave a comment.