In my last post, I discussed the nature and definition of concepts and how our solution is built to find concepts in any language. Whether it’s a concept you’ve defined or an example from a paragraph, we extract a fingerprint for analysis, scoring etc. But to find a concept, first you must be able to describe the concept you’re looking for.

We know one when we read one, but learning to describe a concept isn’t easy as it sounds

We enable you to create and teach intelligent agents to “read” documents in order to rank and find similar concepts. The first step in the process is to describe the concept in a simple text box. Starting with an empty text box can be surprisingly difficult, especially if you don’t have an example. When one user asked us how to describe the concept of anger or dishonesty, we had to take a step back and rethink the user experience.

The legal concept is expressed indirectly, by the act or consequence

The objective of the review attorney is to find evidence of wrongdoing in corporate documents, emails and social media. The critical evidence is rarely a clear statement like “I just gave my friend inside information on our earnings announcement so they could trade ahead of our disclosure.”

People don’t recite the definition of the crime when they talk about it. The language used is always more subtle and disguised. It also varies dramatically from one context to another (email vs. interview transcript for example). Tweets and text messages are full of acronyms, slang, phrases and partial sentences. The guilty party is usually aware of the act and tries to avoid being discovered. He is more likely to say, “Hi Dave, here are some stats you might find interesting.” Is he talking about the company earnings report or his fantasy football team? If he says, “we’ll have a big surprise for you tomorrow”, is it a surprise birthday party or a merger announcement?

Kant’s tree concept (again)

To underscore this point, let’s revisit the example of the tree concept from my last post. The concept of a tree distilled down (abstracted) from descriptions of many trees is clear enough. The challenge in eDiscovery and many (most) language processing problems is that we are looking for the indirect effect or consequences resulting from the existence or actions of the tree.

This is easier to understand with examples, “The fall colors in New England are beautiful this time of year,” or “We need to get some shade for the yard at the nursery.” The concept of the tree is there but if I was searching for trunks, roots, branches and leaves, the “criminal” tree would escape detection!

Context is critical, often simple but complex

Capturing context for your search can be as simple as limiting the analysis to documents between specific individuals during time periods that they had access to the information, resources and counterparts needed to commit wrongdoing. This is easily done during agent review by limiting the documents to that period or set of individuals. Metadata from documents and entity extraction is commonly used for this purpose.

Where possible, we recommend extracting context from the source documents to insure that the more complex contextual factors are incorporated automatically into the agent. For example, the same person in the same period does not use the same language on twitter as they do in email. Nor do they use the same language with their mother as with their girlfriend.

We are all guilty of injecting our own bias and filters into understanding language. Good technical solutions capture the richness and subtlety in the context of the communications and insure consistency of review.

We will be testing many variations on the above themes but would love to hear from the experts (you). What’s the toughest concept you’ve ever had to find? What’s the most difficult concept you’ve ever had to train someone to find? How did you do it?