When does close count?

September 4th, 2007 by Andrew Jenks

They say close only counts in horseshoes and hand grenades. Could it count in e-discovery? That’s what all the rage is today with near-dupes and we’re on the bandwagon. A couple of weeks ago Discovery Mining announced the ability to use near duplicate technology for your review. The thought in the market is that it will speed up your review and allow reviewers to make more document decisions per hour or, in my opinion a possible mistake, a filtering option before you even begin review.

The methodology can really work to your advantage during a review, allowing you to mark or find “close” documents not based on the strict duplication of the md5 hash. I’ll try and break it down in the best way I can to explain what we’ve done at DM with the near-dupe algorithm. Traditional de-duplication methodology will take the entire document and hash it for de-duplication; this works but is very exact. Near-dupes will allow you to take a document that is almost the same and treat it like a duplicate.

Here’s the way a near-dupe would work with an email example: the office football pool. 1 email goes out to 15 people participating, that announces a 3pm meeting to go over the rules. 13 people respond with a “yes I will attend”. 1 person “has a confilct”, and 1 tells everyone “you don’t have a chance so why even have the meeting”. There are 15 “exact” dupes of the email. 13 “almost exact” dupes in response to the email, and 2 “kinda dupes”. Near duplicate technology allows you to mark all 30 documents as irrelevant, if the office football is not part of your review, and move on. In traditional exact de-duplication methodology you would only eliminate or mark the 15 “exact dupes” and run across the other 15 responses during the review.  As you can see, near-dupes makes you twice as fast when you review the football pool email.

This is getting to that hand grenade approach for review. Make one decision and see all the adjacent docs get marked with the same attribute. The way in which you can calculate a near-dupe is based on the distinguishing words that make up any document. These unique words are then searched across the collection to score and rank other documents that contain exactly these unique words or combination of those exact unique words. The higher the score the closer the duplicate. There’s a whole bunch of geeky math that goes into it, but those are the basics. (If you want the math, stuff send me an email.)

The one danger point is the score that the algorithm assigns to the “near-dupe”. The wider the threshold, the more will be seen as a dupe. If you’re filtering before review based on this method, this is the part you need to pay attention to. This is why I feel it’s tough to pre-process based on near-dupes. The vendors (DM included) that offer up the ability to review and auto-mark near-dupes give you the best of both worlds. A common pitfall when using near-dupes is in contract language. Many companies use the same unique words, specific terms even, for many different contracts across divisions. In this case near-dupes may catch contracts that do not even relate to each other, but are only off by a few items. What happens when the only difference in a contract is price? This doc would be considered a near-dupe, but the amount of the contract is the important fact. I say put it all up for review and let the tool guide you as to what needs to be marked or not. That’s the DM perspective.

E-discovery keeps making strides to allow human review to happen at a faster rate. This is one more method to allow reviewers to move through the increasing mountain of data generated by litigation. At the end of the day, it’s really a judgment call on your part. How are you going to get through all these documents in the shortest time frame. Near-dupes are only one part of the equation, but can eliminate your need to look at an extra 25-30% of the collection. I still would caution against eliminating the near-dupes from the reviewable collection. Instead you can use this technology to give you, the reviewer, the ability to make the call as to if it’s truly a near-dupe or not.

Posted in De-dupe, Technology


Leave a Comment

You must be logged in to post a comment.