Erlang Exercise 19-3

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Erlang Exercise 19-3

カカキ
Hello, everyone.

I'm a little confused with an Exercise, and I cannot understand the question.

Here is the question.

Write a program to detect plagiarisms in text. To do this, use a two-pass algorithm. In pass 1, break the text into 40-character blocks and compute a checksum for each 40-character block. Store the checksum and filename in an ETS table. In pass 2, compute the checksum of each 40-character block in the data and compare with the checksums in the ETS table.

Hint: You will need to compute a  "rolling checksum" to do this. For example, if C1 = B1 + B2 + ... + B40 and C2 = B2 + B3+ ... + B41, then C2 can be quickly computed by obversing that C2 = C1 + B41 - B1.

What I've done is finding two files, say file1.txt and file2.txt. I want to check whether file2 plagiarizes file1. So I get first 40-characters of file1 and calculate the checksum of it. And then I move the block one character to the right and calculate the checksum. At last, I will get a list of checksum of file1, and I store them into an ETS table.

On pass two, I do the same thing with file2 and get a list of checksum [C1, C2, ..., Cn]. For each element, I check whether it exists in ETS table or not. If exists, it means that there is a block with the same content in file 1. So it is plagiarism. Finally, I count the number of these blocks and output it.

However, it seems that I haven't use the filename. So I am wondering that there is something wrong with my opinion.

Thanks.


_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang Exercise 19-3

Richard O'Keefe
The question means something like this.
You are to construct and then use a table
   source : checksum -> set of filename
by doing
   for each filename given
      open the file
      for each 40-character window of the contents of the file
         compute the checksum of the window
         add filename to source{checkum}
      close the file

   s := {}
   open the test data
      for each 40-character window of the contents of the testdata
         compute the checksum of the window
         let s' be source{checksum} with default {}
         s := s U s'
      close the test data
   report the set of possible source file names s

Now the test is to accomplish, in Erlang, the same
effect as that would, without necessarily doing it
that way.

This is actually a fairly simplistic version of an
Information Retrieval task called Passage Retrieval.
With a little more work, you can not only report
that the test data may contain portions of various
files, but *which* portions.  With a bit more work
than that, you can rank them, distinguishing files
that contribute many passages from ones that contribute
few.  The neat thing here is that, done with care,
you can handle very large document collections in
Erlang without as much work as you might have to do in
languages lacking a native-data-structures persistent store.

When you look at this functionally, you see that this is
basically a Map (over files) Reduce (the set unions) and
that the index building could be done in a distributed
way, again with relatively little effort.  (The set
unions I'm talking about are the "add ... to ..." ones.
This is idempotent, so indexing the same document on two
nodes is wasteful but not dangerous.)


On Wed, 31 Oct 2018 at 18:55, カカキ <[hidden email]> wrote:
Hello, everyone.

I'm a little confused with an Exercise, and I cannot understand the question.

Here is the question.

Write a program to detect plagiarisms in text. To do this, use a two-pass algorithm. In pass 1, break the text into 40-character blocks and compute a checksum for each 40-character block. Store the checksum and filename in an ETS table. In pass 2, compute the checksum of each 40-character block in the data and compare with the checksums in the ETS table.

Hint: You will need to compute a  "rolling checksum" to do this. For example, if C1 = B1 + B2 + ... + B40 and C2 = B2 + B3+ ... + B41, then C2 can be quickly computed by obversing that C2 = C1 + B41 - B1.

What I've done is finding two files, say file1.txt and file2.txt. I want to check whether file2 plagiarizes file1. So I get first 40-characters of file1 and calculate the checksum of it. And then I move the block one character to the right and calculate the checksum. At last, I will get a list of checksum of file1, and I store them into an ETS table.

On pass two, I do the same thing with file2 and get a list of checksum [C1, C2, ..., Cn]. For each element, I check whether it exists in ETS table or not. If exists, it means that there is a block with the same content in file 1. So it is plagiarism. Finally, I count the number of these blocks and output it.

However, it seems that I haven't use the filename. So I am wondering that there is something wrong with my opinion.

Thanks.

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions

_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions
Reply | Threaded
Open this post in threaded view
|

Re: Erlang Exercise 19-3

Joe Armstrong-2
I'd have two types of entries in  the ets table
first a set of files
{{file,1},"/path/to/file1"}
{{file,2},"/path/to/file2"}
and a set of hashes
{{<Hash>, {<FileNumber>,<ByteOffset>}}

<Hash> is your 40 character Hash, <FileNumber> is 1,... (ie one of the
numbers in the table)
<ByteOffset> is the offset in the file that has this hash.

You compute this table in pass one iterating over all files in your collection

In pass 2 you compute the rolling hash and look at each value in the
ets table. Don't bother to store all the rolling hashes in a list. If
you get a hit try to extend the match to see how much more data
matches.

The sum of bytes hash is actually rather bad and will give a lot of
false hits (ie the hash agrees but the
 text differs) If you want a better rolling hash algorithm use one of
the methods from this reference
https://en.wikipedia.org/wiki/Rolling_hash

Cheers

/Joe
On Wed, Oct 31, 2018 at 9:29 AM Richard O'Keefe <[hidden email]> wrote:

>
> The question means something like this.
> You are to construct and then use a table
>    source : checksum -> set of filename
> by doing
>    for each filename given
>       open the file
>       for each 40-character window of the contents of the file
>          compute the checksum of the window
>          add filename to source{checkum}
>       close the file
>
>    s := {}
>    open the test data
>       for each 40-character window of the contents of the testdata
>          compute the checksum of the window
>          let s' be source{checksum} with default {}
>          s := s U s'
>       close the test data
>    report the set of possible source file names s
>
> Now the test is to accomplish, in Erlang, the same
> effect as that would, without necessarily doing it
> that way.
>
> This is actually a fairly simplistic version of an
> Information Retrieval task called Passage Retrieval.
> With a little more work, you can not only report
> that the test data may contain portions of various
> files, but *which* portions.  With a bit more work
> than that, you can rank them, distinguishing files
> that contribute many passages from ones that contribute
> few.  The neat thing here is that, done with care,
> you can handle very large document collections in
> Erlang without as much work as you might have to do in
> languages lacking a native-data-structures persistent store.
>
> When you look at this functionally, you see that this is
> basically a Map (over files) Reduce (the set unions) and
> that the index building could be done in a distributed
> way, again with relatively little effort.  (The set
> unions I'm talking about are the "add ... to ..." ones.
> This is idempotent, so indexing the same document on two
> nodes is wasteful but not dangerous.)
>
>
> On Wed, 31 Oct 2018 at 18:55, カカキ <[hidden email]> wrote:
>>
>> Hello, everyone.
>>
>> I'm a little confused with an Exercise, and I cannot understand the question.
>>
>> Here is the question.
>>
>> Write a program to detect plagiarisms in text. To do this, use a two-pass algorithm. In pass 1, break the text into 40-character blocks and compute a checksum for each 40-character block. Store the checksum and filename in an ETS table. In pass 2, compute the checksum of each 40-character block in the data and compare with the checksums in the ETS table.
>>
>> Hint: You will need to compute a  "rolling checksum" to do this. For example, if C1 = B1 + B2 + ... + B40 and C2 = B2 + B3+ ... + B41, then C2 can be quickly computed by obversing that C2 = C1 + B41 - B1.
>>
>> What I've done is finding two files, say file1.txt and file2.txt. I want to check whether file2 plagiarizes file1. So I get first 40-characters of file1 and calculate the checksum of it. And then I move the block one character to the right and calculate the checksum. At last, I will get a list of checksum of file1, and I store them into an ETS table.
>>
>> On pass two, I do the same thing with file2 and get a list of checksum [C1, C2, ..., Cn]. For each element, I check whether it exists in ETS table or not. If exists, it means that there is a block with the same content in file 1. So it is plagiarism. Finally, I count the number of these blocks and output it.
>>
>> However, it seems that I haven't use the filename. So I am wondering that there is something wrong with my opinion.
>>
>> Thanks.
>>
>> _______________________________________________
>> erlang-questions mailing list
>> [hidden email]
>> http://erlang.org/mailman/listinfo/erlang-questions
>
> _______________________________________________
> erlang-questions mailing list
> [hidden email]
> http://erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
[hidden email]
http://erlang.org/mailman/listinfo/erlang-questions