| Date: | 13-03-2017 |
|---|---|
| Deadline: | 22-03-2017 23:59 |
You must implement a hash function API, an array API and a program that uses a bloom filter to detect duplicates in its input.
You must write a program dups which can process a very large number of lines on its standard input (containing possibly billions of input lines) and reproduces its input on its standard output without any duplicates. The program may drop a small fraction of the input which are not duplicates.
Your solution must use a Bloom filter to detect duplicate items.
The new aspect in this assignment is that there is not one single good answer: a part of your grade will be proportional to "how well" your program is able to reproduce the original input without duplicates.
Your program must accept multiple positional arguments on the command line:
The input items will be provided as zero or more separate lines of text on the standard input. The input can be arbitrarily large: you cannot assume a maximum number of lines in the input. Each line of input will contain only printable characters, in particular there will be no nul character in the input, and will contain at most 1024 characters.
Your program must print out the unique input items as-is on its standard output, as soon as possible after they are read. A unique input item is an item that has a low probability to have been read before (since the program started). For example in the following input:
hello world hello jane
There are 3 unique input items, and the corresponding expected output is:
hello world jane
Because when "hello" is encountered the 2nd time, it was seen before with a high probability (in this case, the probability is 1), so it is skipped.
When there are no more input lines available to read (if that ever happens), Your program must then:
For example with the input above the complete output would be:
hello world jane 4 3
Informatics only: While you are iterating in step 6, keep a journal of your results and your intermediate implementations to present in your PAV report.
Your grade starts from 0, and the following tests determine your grade:
Copyright © 2017, Raphael Poss. Permission is granted to distribute, reuse and modify this document and other documents for the Systems Programming course by the same author according to the terms of the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/.