Cool stuff in search

October 1st, 2007 by Andrew Jenks

This is a bit off-topic, however, it is still somewhat relevant to us in e-discovery. I was poking around the PBS website and ran across a cool search feature. PBS has done a good job of integrating ’search inside video’ into their website and offering it as a standard option. I was playing around with the News Hour search and you can search the last seven years of video right there on the site. Not only does it give you the episode that you’re looking for, it will also take you to the section in the transcript AND the section in the video that you’re interested in.

Video search is big business and there are lots of companies going after this market–but why aren’t the major networks offering this as well? Typically most shows are closed captioned, and so I would guess the text is there. When thinking about this type of technology, you can easily apply it to e-discovery and litigation support all around. Imagine the following senario: during a deposition a reference is made to a document. On the fly, that document is found on a review platform. The closely related documents are brought up along with a time-line and reference to other depositions. Integrating multiple search technologies with the power of a review platform and trail prep, in this manner, would be groundbreaking. I’m sure people are working on it, but our friends at PBS have a nifty tool in the meantime.

Just some cool findings for this Monday.


When does close count?

September 4th, 2007 by Andrew Jenks

They say close only counts in horseshoes and hand grenades. Could it count in e-discovery? That’s what all the rage is today with near-dupes and we’re on the bandwagon. A couple of weeks ago Discovery Mining announced the ability to use near duplicate technology for your review. The thought in the market is that it will speed up your review and allow reviewers to make more document decisions per hour or, in my opinion a possible mistake, a filtering option before you even begin review.

The methodology can really work to your advantage during a review, allowing you to mark or find “close” documents not based on the strict duplication of the md5 hash. I’ll try and break it down in the best way I can to explain what we’ve done at DM with the near-dupe algorithm. Traditional de-duplication methodology will take the entire document and hash it for de-duplication; this works but is very exact. Near-dupes will allow you to take a document that is almost the same and treat it like a duplicate.

Here’s the way a near-dupe would work with an email example: the office football pool. 1 email goes out to 15 people participating, that announces a 3pm meeting to go over the rules. 13 people respond with a “yes I will attend”. 1 person “has a confilct”, and 1 tells everyone “you don’t have a chance so why even have the meeting”. There are 15 “exact” dupes of the email. 13 “almost exact” dupes in response to the email, and 2 “kinda dupes”. Near duplicate technology allows you to mark all 30 documents as irrelevant, if the office football is not part of your review, and move on. In traditional exact de-duplication methodology you would only eliminate or mark the 15 “exact dupes” and run across the other 15 responses during the review.  As you can see, near-dupes makes you twice as fast when you review the football pool email.

This is getting to that hand grenade approach for review. Make one decision and see all the adjacent docs get marked with the same attribute. The way in which you can calculate a near-dupe is based on the distinguishing words that make up any document. These unique words are then searched across the collection to score and rank other documents that contain exactly these unique words or combination of those exact unique words. The higher the score the closer the duplicate. There’s a whole bunch of geeky math that goes into it, but those are the basics. (If you want the math, stuff send me an email.)

The one danger point is the score that the algorithm assigns to the “near-dupe”. The wider the threshold, the more will be seen as a dupe. If you’re filtering before review based on this method, this is the part you need to pay attention to. This is why I feel it’s tough to pre-process based on near-dupes. The vendors (DM included) that offer up the ability to review and auto-mark near-dupes give you the best of both worlds. A common pitfall when using near-dupes is in contract language. Many companies use the same unique words, specific terms even, for many different contracts across divisions. In this case near-dupes may catch contracts that do not even relate to each other, but are only off by a few items. What happens when the only difference in a contract is price? This doc would be considered a near-dupe, but the amount of the contract is the important fact. I say put it all up for review and let the tool guide you as to what needs to be marked or not. That’s the DM perspective.

E-discovery keeps making strides to allow human review to happen at a faster rate. This is one more method to allow reviewers to move through the increasing mountain of data generated by litigation. At the end of the day, it’s really a judgment call on your part. How are you going to get through all these documents in the shortest time frame. Near-dupes are only one part of the equation, but can eliminate your need to look at an extra 25-30% of the collection. I still would caution against eliminating the near-dupes from the reviewable collection. Instead you can use this technology to give you, the reviewer, the ability to make the call as to if it’s truly a near-dupe or not.


Clamoring for Features

July 25th, 2007 by Sam Carter

Guest Blogger: Sam Carter, Director of Product Management, Discovery Mining, Inc.

As Director of Product Management, an essential part of my job is prioritizing feature development for our application. Sure, I spend a lot of time writing specifications, looking over developers’ shoulders at in progress implementations and providing input, but at the core it’s about prioritization, and mapping that to our vision. And why? Well, with so many innovative ideas brewing inside the company along with client feedback, it makes for quite a selection! 

In order to make sense of the mountain of features, I use several broad guidelines. The overreaching guideline is to determine the specific need the feature is expected to address:

  • Features that prevent undesirable events or results from recurring
  • Features that prevent human error - automation of repetitive tasks that do not require human involvement
  • Features that clients / prospects are requesting
  • Features that are high value and map to our vision
  • Features that reduce support burden
  • Features that increase our service to clients

The first category of features tends to trump everything else, because our goal is to experience one type of problem only once. But prioritizing features in the remaining categories can be a bit trickier. An additional company goal is to build a general purpose product and in order to meet that objective we have to strike a delicate balance across general purpose, market, client, and prospect requirements.

If we cater too extensively to existing customers, then we run the risk that our product becomes too specialized to their needs and does not solve the problems of the general market as a whole. But at the same time, we learn from our customers input as well as have the desire to meet client needs. This is where the skill of driving toward a vision, while at the same time meeting immediate needs comes in.

Our philosophy has always been to enable computers to do the ‘heavy lifting’ in the document review process. Moving from paper based review to native document review is just the first step towards eliminating the tedium. We want to automate as much of the repetitive aspects as possible. Our ultimate objective is designing software for electronic data discovery that combines human intellect with computer automation, brute speed and processing capability.


Litigation is not flat

July 13th, 2007 by Andrew Jenks

As the world gets flat litigation needs to stay local. It may seem counter-intuitive, but the technology designed to support document-intensive litigation may hold the key to dealing with cross-jurisdictional disputes. E-discovery software, which provides secure collection of required records, and centralized storage, together with rapid sorting, access, and retrieval, can satisfy local requirements by providing a presence with strong local cultural and legal experience.

U.S.-based multinationals are confronted with differing demands regarding the proper handling of data in Europe. These companies, which rely on their local technology partners when dealing with discovery in the U.S., are faced with a new layer of complexity when foreign subsidiaries are involved. Foreign information privacy laws must be respected as part of U.S. discovery; the European Union has very strict rules regarding the movement of electronic data containing any personal information, so the “go-to” partners in the U.S. may not be as helpful if they have no local presence. In this case business may be global, but litigation is still local. Local law requires local expertise and local language support. The test for law firms and technology providers – particularly e-discovery companies - is to offer a local interface to a capability proven in the rigors of the U.S. system. In striking the necessary balance between the United States and non U.S. privacy, local experts in key markets and multi-lingual software are mandatory parts of the solution.

Forward thinking vendors, are serving their clients better by supporting each market in which their clients do business. The presence and expertise in local business culture will be a significant asset for any company seeking to take advantage of global markets. The way I see it you’ll need to ask the vendors out there to answer the following questions:

  • Technology: Can this technology be implemented in an environment outside the US? If it can, how easily can it scale?People: Do you have the experts, who can not only navigate discovery, but also the local culture, laws and business community? Does the team represent a real bridge to the expertise created in the United States?
  • Physical Location: Do you have a presence in strategic areas, designated by privacy or other data protection laws?
  • Languages Supported: Do you speak the language? Can you process the language?
  • Partnerships: Are there strategic relationships in place or possible (with local experts, for example) in the event they can contribute to enhancing credibility and trust?
  • Certification: Are you sailing into a “Safe Harbor?”
  • Going global isn’t for everyone and it’s tough to navigate these business challenges, however as Corporations and the Firms that support them go global selecting a vendor to scale with you is an important one.

Is your vendor skimming an extra 10% off the top?

June 8th, 2007 by Sam Carter

Guest Blogger: Sam Carter, Director of Product Management, Discovery Mining, Inc.

Like nearly all vendors in this market, we handle our pricing on a per gigabyte basis, where we bill based on the size of your data. Bear with me because I’m going to put on my geek hat for this, but I’d like to talk about how some binary math affects your wallet. Something that’s a little funny about computers is that they like to count in powers of 2, instead of 10 like you might be familiar with in everyday math. This means that most software considers a kilobyte to be 1024 bytes (that’s 2 to the 10th power), not 1000 bytes like you might expect. Similarly, one megabyte is usually 1024×1024 bytes (or 1,048,576), rather than 1000×1000 (or 1,000,000). So if you ask Microsoft Windows to calculate the size of a folder, and it tells you that you’ve got 1 TB of data, what it actually means is that you’ve got 1,099,511,627,776 bytes in that folder.

Some clever (and disreputable) fellows working in the hard drive business realized that they could take advantage of this ambiguity about the actual size, and claim that they’re selling 100 GB hard drives, even though Windows thinks that they are only 93.13 GB in size. Various consumers have objected to being swindled and filed suit (see for example Willem Vroegh v. Eastman Kodak Company, Case No. CGC-04-428953 in Superior Court in San Francisco, or Orin Safier v. Western Digital Corporation, et al., Case No. CGC-05-442812 also in the Superior Court in San Francisco, later moved to the Northern District of California, Case No. 05-03353 BZ), but the hard drive makers have chosen to settle cheaply, and to continue marketing their drives as bigger than what the rest of the computer industry considers their size to be. If Microsoft Windows tells you that it’s 1 TB of data, then we’re going to bill you for 1 TB of data. Even though the actual size is 1,099,511,627,776 bytes, we’re not going to call it 1.099 TB, and try to squeeze you for the extra 10%. You might want to ask yourself if the other vendors are doing the same thing.


Can you support Glagolitic?

May 25th, 2007 by Andrew Jenks

The Glagolitic alphabet was invented during the 9th century by the missionaries St Cyril (827-869 AD) and St Methodius (826-885 AD) in order to translate the bible and other religious works into the language of the Great Moravia region. What does that have to do with e-discovery? Believe it or not, everything, if you care about Unicode.

There are different revisions of Unicode, each adding or improving writing systems, to be used within the electronic world. Within Unicode there is a concept of encoding. Encoding is where things can get crazy. Encoding is the way in which the computer uses a series of numbers to store the characters that are represented. It leaves the display up to the software(browser, office doc, email client) and does a nice job of separating the two. There are limitations to encoding documents this way, however I’m not going into that now. You can check the wikipedia page if you want more info.

Back to our problem; How and what do you need to look for in a vendor that can support Unicode. Well there are different standards that are all considered “Unicode”. Read the rest of this entry »


Throwdown!

May 23rd, 2007 by Andrew Jenks

Let’s talk numbers. Last night I attended the sf geekSessions event about scaling web apps, specifically ruby on rails web apps. We don’t develop in ruby around here, but the general takeaway applicable to our business/industry was measurement. We’re dealing with ALOT of data, taking hundreds of gigabytes, that are un-structured, putting structure around them and building inventory is hard. Not only is that hard, keeping them in one single repository, with sub-second search, a dynamic set of attributes/tags/folders(you name it); is VERY hard.

In an industry that throws around terabytes like they are floppy disks, I find it amazing that there is no real sense of measurement. Where are the metrics? Why don’t we, the vendors, put our stats out there for all the world to see. My guess, it’s not as rosy a picture as we’d like and we’d have to stick by it. I see it all the time, cases that start small and get big, take systems to the edge of usability.

What we are dealing with is not a one-solution-fits-all problem. Each case is different and sometimes it makes no sense to get a vendor involved, however when you do it’s important that you know where the strengths and weaknesses are. A weakness in one area should not be a deal breaker if it’s not relevant to the case, but you should know what you’re getting before it’s too late.

Let’s start with a simple one: Document Account Ceiling

I’m going to define it this way:
The maximum number of documents in a single customer account (matter) that can be searched with a single term, from a single interface, and return results in under one minute.

There, that wasn’t so hard was it?  In computer science circles around the world they would say that any search that returns in a minute is broken, but it’s a good starting point. (I’ll be posting on this subject later)

I’d be interested to hear what you’ve got to say so add comments: the good, bad, and not so good.


We’re on the same team

May 22nd, 2007 by Andrew Jenks

Like most vendors in the space, we are typically dealing with large cases that have a high level of stress and anxiety.  It always amazes me when clients, however right or wrong they are, take to using the vendor as a pin cushion.  For some reason it is “right” to blame the vendor and we take the heat for things that are mostly out of our control.

Read the rest of this entry »

Posted in General |

No Comments »


Hello world!

May 21st, 2007 by Andrew Jenks

This is the new blog of Discovery Mining’s take on the e-discovery market.