Spring 2016 PSC Presentation
Xpriori is a privately held company headquartered in Denver and Colorado Springs, CO. The company has developed proprietary technologies, products and services for the Information Governance (“IG”), eDiscovery, compliance and review markets since 2006 as well as for XML-based data management. As part of its product and service offerings, Xpriori offers services based upon automated advanced semantically based as well as textual and visual similarity Clustering technologies, together with best of breed predictive coding, concept search and other proprietary and leading IG and eDiscovery solutions.
Xpriori handles large scale IG, data migration and eDiscovery projects. The overarching objectives are not limited to the classification of legacy content (business related documents and email). The objectives also include the development and codification of sustainable day forward procedures and methodologies deployable into a set of scalable, automated processes to handle newly added data as well as legacy information. With the persisting of the code associated with the clustering processes and its application to newly ingested data, Xpriori enables this as an automated process. The goal of Xpriori's IG service offerings is to build a model tailored to its customer's processes and to enable fully automated processes.
Your organization creates and receives an unimaginably large volume of documents, emails and unstructured information. Unless you associate it with your project and business process, timely classify and cull the information for project use and/or storage, you will. and no doubt have: (1) unknown liabilities, (2) mismatch of information to operational state, and (3) untold inefficiencies that sap your profits. We can help you automate classification and review processes to timely match the right information with your business process and workflows. We can help you have orderly response to project, compliance and litigation challenges. We do so with a minimum of direct labor and in a manner that does not throw out existing infrastructure and does not disrupt ongong operations.
What we offer is unique: Data content self organizes and informs us, not the other way around!
Clustering algorithms now help us understand data categories in ways heretofore unachievable when working with larger data sets or in the cloud. Clusters can be created without human definition for review and culling or accepting. Much manual effort is avoided and all content is considered.
We offer the ability to view an organization’s enterprise content from a perspective that is completely different than the conventional approach to content. Prior to having the ability to use algorithms to help vast quantities of data to self-contextualize itself, people would come at their data from pre-defined point such as a key word search, Boolean inquiry, or predictive coding. This pre-defined view of data is grounded in presumptions about the data which by its very nature creates a “data horizon” (data risk or value that is not “visible”, addressable or otherwise readily usable by an organizational stakeholder).
Having data describe itself to the user enables the user to see data in its complete context. Data blind spots for which there is no or insufficient classification are revealed in terms of relevance to other known data objects or documents within the corpus of content examined.
We work with all forms of data including text and non-textual forms of documentary information.
The following figure illustrates flexibility of cluster review at varying percentages of similarity including non-textual material.
Figure 1 Substantive Similarity Chart
We have significant value-add in working with emails.
Our email process gives users the ability to cluster and analyze data within or across email threads independent of or with emails and attachments. This unprecedented flexibility removes analysis constraints that are typically encountered in various data classification - litigation, records management or internal investigations.
We scale to your projects and your enterprise.
Our solution scales to accommodate organizational and initiative driven requirements. We scale horizontally to accommodate organizational and initiative driven requirements across different functional groups and departments. We scale vertically to address small document collections as well as large enterprise wide data silos. Handling millions of documents is not a problem.
We consider all documents, not just key word search results, to enable project use, to provide an informed preservation/legal hold and to discover risk and provide leadership to the entire process.
- Analyze all of the data – not just key word search results -- held by any custodian of potentially relevant material quickly and easily, avoiding risk of misfiled documents and his limited understanding
Early identification and containment of risk:
- Discover problems and factual basis for effecting disposition
- Act from full knowledge and not avoidance of what you don’t know
- Act having reviewed all potential documents and not just those that turn up on a key word search
- Provide discrete responses to a requesting party and not categorical responses that appear to be obfuscation
In both preservation and production, be more productive and accurate by deploying a few highly informed managers, controllers or reviewers – people with subject matter knowledge – to assessments and to provide informed results
- Limit soft costs associated with staff time
- Make audits of process and content meaningful
- Move from an ad hoc response to and ordered, consistent, and accurate approach
- Avoid the “smoking gun” problem
- Look at all of the data but at an acceptable cost
Best Practice: Creating a foundation for effective Information Governance, Training and Corporate Culture.
- IG is a comprehensive set of controls, processes and technologies that help your organization maximize the value of enterprise information, while mitigating risk and costs.
Among others, components include:
- Policy Development
- Records Management
- HR policies and procedures
- Security and risk management
- Legal Hold
With holistic approach, the benefits are to most aspects of the company
- Providing accurate information on a timely basis,
- Asset management
- Eliminate inefficiencies associated with getting the right information
- Provide information across the enterprise
- You can start small with areas of legal risk that you face on a day-to-day basis and carry forward process and content knowledge to new areas.
- Our systems learn and carry forward code that has been created – avoiding duplicative efforts
- Our systems work with minimal disruption to day=to-day operations
- Our systems are cost effective
- Xpriori has people who have “been there and done that!”
How we engage:
- Assessment of project
- Use of Gap Analysis
- Load sample data to confirm price and terms
- Proof of concept on smaller data set
- Statement of Work
- Team with client
The Ultimate Information Governance Goal:
Figure 2 Holistic Value Proposition to Stakeholders Across Functions
Key Personnel Resumes
Tim Dix, Co-Founder, Chairman and CEO
Tim has been instrumental in starting several businesses as the legal and financial leader behind the efforts. He was previously Founder, Chairman and CEO of NeoCore, Inc., a high-tech start-up company based in Colorado Springs, and is founder and chairman of Xpriori, LLC, which purchased the assets of NeoCore in 2003. He has guided Xpriori for the past 11 years into new adaptations of its proprietary technologies and representation of new technologies. His long standing interest in the storage, use and validation of unstructured information has made him a thought leader in the world of Information Governance (“IG”), eDiscovery and Compliance effective technologies. Over the past year, he has moved the company to a services model offering new technologies such as text and visual similarity clustering. As an entrepreneur, Dix participated in the founding of several companies including Woodland Park Cable Systems, Gold Point Development Corporation, Western Pacific Airlines, Vista Bank, NeoCore and Xpriori. As a lawyer, Mr. Dix has practiced in the areas of business and corporate law, with emphasis on capital-formation, business transfers, corporate finance, securities, tax and trade regulations for more than 40 years. In that capacity he has represented both public and private companies in a variety of industries. Contact Tim directly at: email@example.com; 01-718-210-5318.
Rich E. Davis, JD, Practice Lead and Chief Information Governance Strategist
Rich Davis is Xpriori’s Senior Consultant and Chief Information Governance Strategist. With a JD and BBA degrees and numerous technology certifications, Rich started his career with IBM designing data management solutions for Fortune 500 companies. His career path is impressive taking him from IBM to:
- New York State Court Administration as a technical analyst and teacher of Judges and other personnel about technologies;
- Managing Cravath Swaine & Moore’s Technical Litigation Support Group. As leader of Technical Litigation Support, where he advised the firms lawyers and clients on a host of eDiscovery, regulatory, information security and data privacy issues;
- Founding Manager of Kenyon and Kenyon’s Practice Management Department
Rich has been a consultant in the commercial marketplace for some time, recently founding his own firm, Veri Solutions, LLC while continuing to serve Xpriori in the above capacity. Veri and Xpriori team certain opportunities as well. In his consulting practice, Rich has assisted a variety of major companies including BP, having directed its first ever Proactive Litigation Readiness Gap Analysis In 2009. That engagement led to his being engaged to address data collection and information management needrelated to the Deepwater Horizon oil spill in May of 2010. He has a history of managing and technology adoption in the IG field including major data collection, migration, remediation and discovery projects.
In addition to being a U.S. Army veteran and supporter of veteran’s causes, he has significant other experience in the US court system, several major international engagements and is known in the industry and a lecturer and prolific writer.
Contact Rich directly at: firstname.lastname@example.org; and 01-646-306-3833
Case Study: Unlocking Value In Information
An Actual Case on Unlocking Value in Information – Part 1
A Large Petroleum Company Uses of Data Similarity Technology in Large Scale Classification Projects
Below is the first installment of an actual case where a large petroleum company (the “Company”) used and continues to use Data Similarity clustering technology as part of a large scale data classification effort to unlock the corporate information and knowledge that is trapped in that company’s stores of unstructured information -- Extracting Information To Support Profit Making Activities; Moving from the Concept of Cost Center to Creation of a Competitive Edge.
The first installment deals with the existential facts and concerns that shaped the Company’s approach and the make-up of the team that processed the information; the second installment will deal with adapting the approach to the existing document creation/storage environments; and the third deals with the actual use of the data similarity clustering as the core technology support the process.
CORE CONCERNS AND EXISTENTIAL FACTS THAT SHAPPED THE CASE;
True Information Governance and Enterprise Content Management, including their subset of eDiscovery, have traditionally been viewed as a business cost. Most companies have deployed content management systems in compliance with legal regulation or as a defensive measure to limit or eliminate possible legal action. These accomplishments, while noble, are usually costly, time consuming, and limited in scope.
In recent months, the Company has launched a series of projects using new automated document clustering technologies to evaluate and organize these largely digital information assets with a view to creation of value and competitive edge. The Company is undergoing a revamp of its enterprise content management strategy and philosophy at the functional and business asset level. With advances in Big Data and Cloud Computing, the Company believes that the extraction and analysis of data trapped in large document stores stands to offer significant returns, presenting an opportunity to turn a service once viewed as a cost center into a one that offers business value - even a competitive edge.
The primary objectives of this program include enterprise-wide cost containment, risk reduction, development of a competitive edge with more efficient access to and extraction of value and knowledge from heretofore inaccessible content.
The Company has recognized that it will require approaches that assure validity of the information extracted; extracting, mining this data is only possible with a well-defined structure in place. In developing guiding principles for the program, the Company believes that the system must have the following qualities to assure validity and usefulness of the information:
Accuracy– relevance, currency, and featuring a single source of truth; delivery of the right information on a timely basis; delivering the right version and avoiding the massive duplication that mires most enterprise information stores;
Findability– quality search results that reduce time spent finding answers; intuitive navigation that guides users to the information; reduced storage options, or putting stuff in the right place, to avoid confusion; storage containers that are well defined; and an environment that promotes information sharing.
Consistency– a consistent user experience to drive user adoption and use; simple interfaces to lower support requirements; established metrics to measure the results and the user experience; and promote replication of results and reuse.
Governance– improved ability to enforce standards, promote higher degree of compliance with corporate standards and policies, including retention; foster continuous improvement by capturing lessons learned; and security.
Solid technology and automated services must be used. Traditionally, document attribution and migration has been a manual effort. A seasoned document controller could be expected to review and attribute 6,000 documents a month. In the course of day to day business, this rate is usually more than adequate. However, when faced with a large acquisition or undertaking and potentially the need to review many years of accumulation of unclassified unstructured content -- hundreds of thousands if not millions of documents, it becomes clear that 6,000 a month is not going to get the job done. Compounding the effort, an organization's unchecked data doubles in size every 18 months, and for years that landscape typically has been maintained by the individual user with little regard for its organization at best, and at worst intentional obscurity in the name of job security. The Company has found that providing a system that will assure the foregoing qualities is impossible where there is heavy reliance on manual effort.
The system must be highly automated and: (a) deal with large quantities of legacy content that is largely unorganized or organized not in a very helpful manner; and (b) provide a framework for which content can be organized accurately on an ongoing basis, ensuring that order is maintained with little to no additional effort on the part of users.
Thus, there are two overriding requirements: (1) the highly automated solution not only migrates and organizes existing content efficiently with a high degree of accuracy, but also (2) it provides a framework for which content can be classified and attributed in an ongoing basis, ensuring that order is maintained with little to no additional effort on the part of your users.
To get a system that fulfills the forgoing requirements, the Company has concluded that content, must be analyzed, organized, and attributed – digitally tagged for association to a discrete organizational structure, and migrated or often physically moved to the right spot. To deal with this unstructured content would be difficult, time-consuming, and, without the right plan and technology, prone to failure at every turn.
In recent years, numerous companies have developed systems to help automate (or partially automate) the process. Most vendors in this space offer an incomplete service, typically relying on one methodology to facilitate the process. This "one size fits all" approach is shortsighted and prone to inefficiencies. It may find favorable results on a particular document class, but subpar results in many others. One needs a solution that provides a "best fit" solution that addresses the wide array of highly nuanced business documents.
After many assessments, the Company concluded that technologies that automate the clustering of documents based upon the similarity of their textual and/or visual content – the data similarity technology offered by Xpriori – should be central to the process. Document managers are able to view clusters presented at varying percentages of similarity and classify them to a defined hierarchy for use or storage. The defined hierarchy of classes is normally referred to as a taxonomy. There typically is a starting point taxonomy which is modified or upgraded as the clusters themselves suggest changes. The clustering enables the use of human judgment where it is most important in the process and enables decision making on larger numbers of documents – members of the clusters – at the same time. In the use case, the Company actually more than doubled the number of discrete classes in the applicable taxonomies, significantly improving user access to information.
THE TEAM THAT PROCESSED THE INFORMATION; WHAT WAS DONE
Leveraging the data similarity clustering technology, the Company found it could deploy a small team – not hundreds of reviewers – that could effectively use the technology to accomplish this large task. The team of eight to ten included: business subject matter experts who are familiar with the domain, vertical and content types related to the documents as well as how they are used; information technology (“IT”) project managers; compliance specialists; electronic content management (“ECM”) application developer/engineers; and document controllers who have significant familiarity with the type of documents being collected.
The Document Controllers were critical and were all subject matter experts (“SME’s”). They include persons familiar with the documents used in the industry. In the case, the document controllers had vast amounts of experience in the documentation and processes related to wells, fields, drilling operations and supporting functions. Their qualifications varied from individuals with strong clerical skills to PhD geophysicists.
The work was broken down into two phases: Phase One: an agnostic collecting, culling, denisting and de-duplicating files from targeted file shares within the enterprise with limited regard to their discrete content – i.e. clustering not yet applied; and Phase Two: very content sensitive sorting including application of automated text and visual similarity processing of documents to clusters for analysis and classification to: (a) apply, modify and re-apply a broad and hierarchical classification taxonomy; (b) identify documents that will not fit the existing taxonomy and modify it to include them at an appropriate place; and (c) create an operational system that will automatically do the same with any newly introduced documents as they are ingested into a Documentum archive.
Working in two phases is a standard approach in the marketplace. Phase One really can be done without regard to the subject matter of the content. It represents an early cull and triage of information and typically results in culling of documents for non-business or non-content reasons. Phase Two is a more subjective content business need driven analysis of the customer’s content. In this sense, all cases are different – from one Company or setting to another.
An Actual Case on Unlocking Value in Information – Part 2
As we indicated in our first installment, in recent months, a large Petroleum Company has launched a series of projects using new automated document clustering technologies to evaluate and organize these largely digital information assets with a view to creation of value and competitive edge. The Company is undergoing a revamp of its enterprise content management strategy and philosophy at the functional and business asset level. As is the case with many organizations, the primary objectives of this exercise include enterprise-wide cost containment, risk reduction and the extraction of value from content while enabling more effective use and management of the asset.
In Part 1, we discussed the goals and values for the project; the need for new technologies to supplant the slow pace of manual review and the small but expert team that was capable of using the new technologies to meet the goals, values and objectives. In Part 2, we address the team’s ability to meet the overarching goals for the system of Accuracy, Findability, Consistency and Governance at the various stages of creation, use and storage of information at the Company. The Company outlined objectives for each stage, and the team deployed the new technologies to meet those objectives at each stage.
2. STRUCTURAL CONSIDERATIONS: Location and Various Requirements of Documents during Content Life Cycle
The Company maintains its documents/content at different places in the corporate network environment depending on kinds of information and their association with stage of creation, use and retention – the :
|(1) Early Stage Work in Progress Documents||(1) Unmanaged Glogal File Shares|
|(2) Department Documents||(2) Managed File Shares|
|(3) Project Documents||(3) Managed Share Point|
|(4) "Published" Documents||(4) Documentum or Stored Documents|
The Company identified four areas of storage, use and retrieval, coupled with standard attribution and functionality, to be used at progressive stages of the data classification process. The overarching goals of managing collaborative SharePoint site, classifying and managed and unmanaged file share unstructured content and ultimately publishing managed documents and metadata were achieved by the Company’s ascribing and meeting the following functional and qualitative protocols and objectives for file shares associated with each stage: (1) Future Policy, the applicable time during which the policies are effective; (2) Permission Management, the applicable enterprise rules for storage and use; (3) Structure and persons who might have access and use; and (4) Types of files covered together with various characteristics either required or available. The application of these criteria to the four stages and locations outlined in the document/content life cycle is outlined immediately below.
Stage 1 -- Early Stage Work-in Process --
Stage 2 -- Department Documents --
Stage 3 -- Project Documents --
Stage 4 -- Published Documents --
3. GENERAL CONSIDERATIONS - MIGRATION OF DOCUMENTS
To meet the Company’s goal, significant amounts of legacy information has had to be moved through various processes to have it reside ultimately in “published” storage and available for enterprise wide use and collaboration. The Company also has deployed a centralized stack of Electronic Content Management (ECM”) tools or applications to help users find what they need. In the use case in question, the underlying data set was comprised of documents related to the field operations and management associated with oil and gas exploration and production. The information, by and large unstructured in nature, contained mostly large quantities of oil logs, maps, cad drawings, and other documentation largely non-textual nature. The initiative involved the collection, review, remediation and storage to Documentum of large amounts of this unclassified information – approximately 1.9 million documents -- that existed in various file shares on the Company servers. Initially, the Company worked through a pilot on approximately 100,000 taken from a couple of file shares that contained part of the 1.9 million.
4. THE RESULTS
The documents processed have yielded real value. In fact, initial assessment on documents classified to particular aggregations of acreage has demonstrated positive results on finding new drilling opportunities from information that is, in part, decades old. Also, 90% of information subjected to the process has been culled and culled defensibly in compliance with legal process.
There were many superfluous documents. The Company has found that clustering leads to aggregations of information more or less homogeneous in nature. As a result, search technologies available to users are working better and yielding better results. Even sophisticated tools such as “concept search” and “predictive coding”, often deployed in eDiscovery projects, could now be applied against smaller and more homogenous datasets with a common content based organizational presence. This has meant quicker response time, elimination of false positives in the search returns and quicker understanding of the context in which the information was created and used. To repeat, initial assessment on documents classified to particular aggregations of acreage has demonstrated positive results on finding new drilling opportunities from information that is in part decades old and that’s Real Value.
The program continues by developing new projects covering information being used or retained by other aspects of the business or other business units or functions.
5. KEYS TO UNDERSTANDING
There are several keys to understanding this process:
- The clustering process is automated and suggests clusters of documents that are either visually similar or textually similar based upon various algorithms;
- The results of the processed are tuned to a percentage of similarity that produces results that you can review and accept or decline.
- The process is to start small – that is to start with a smaller batch from the larger collection that you want to assess and classify, and the process will apply what it has learned to subsequent batches as processed;
- The clustering will suggest changes to your taxonomy;
- Particularly where you have documents not stored according to any rules, you can expect a significant cull of superfluous information;
- The process should be engaged with a smaller team –in the current case about 10 – but the team should be made up with subject matter experts, some technology support and people somewhat familiar with the business source of the documents;
- The process groups the documents to the particular contexts identified the team, making the information more useful and findable;
- The process should be governed by an overarching set of principles and goals; in this case, Accuracy, Findability, Consistency and Governance; and should be applied to the various stages of the content lifecycle of the organization.
- The process is best applied to discrete business units and/or its processes and workflows;
- What is learned at one stage is always carried forward to the next without the need to redo or recreate classification code; what has been created is carried forward.
An Actual Case on Unlocking Value in Information – Part 3
As we indicated in Parts 1 and 2, in recent months, a large Petroleum Company has launched a series of projects using new automated document clustering technologies to evaluate and organize these largely digital information assets with a view to creation of value and competitive edge. The Company is undergoing a revamp of its enterprise content management strategy and philosophy at the functional and business asset level. As is the case with many organizations, the primary objectives of this exercise include enterprise-wide cost containment, risk reduction and the extraction of value from content while enabling more effective use and management of the asset.
In Part 1, we discussed the goals and values for the project; the need for new technologies to supplant the slow pace of manual review and the small but expert team that was capable of using the new technologies to meet the goals, values and objectives. In Part 2, we addressed the team’s ability to meet the overarching goals for the system of Accuracy, Findability, Consistency and Governance at the various stages of creation, use and storage of information at the Company. We also offered a number of keys to understanding the process and deployment of the technology.
In this Part 3, we address the two phase work flow deployed at the Company and provide some details of the step by step process that was followed. In Part 1, we addressed the structure of the team and repeat some of that discussion immediately below.
The Team. The typical team assembled for a large scale classification project must include business subject matter experts who are familiar with the domain, vertical and content types related to the documents as well as how they are used. At the Petroleum Company, the team included the following:
Information Technology (“IT”) project managers;
Electronic Content Management (“ECM”) application developer/engineers; and
Document controllers (“DCs”) who have significant familiarity with the type of documents being collected.
The Document Controllers (“DCs”) were critical and were all subject matter experts (“SME’s”). They include persons familiar with the documents in the dataset In the Petroleum industry, as in the use case, DCs can include subject matter and domain experts in a particular discipline, i.e. energy exploration and production (E&P). In the use case, the document controllers had vast amounts of experience in the documentation and processes related to wells, fields, drilling operations and supporting functions. Their qualifications varied from individuals with strong clerical skills to PhD level geophysicists.
What was actually done?
The work was broken down to two phases: (a) Phase One: an agnostic collecting, culling, denisting and de-duplicating files from targeted file shares within the enterprise with limited regard to their discrete content – and as a result, clustering not yet applied ; and (b) Phase Two: content sensitive sorting including application of automated text and visual similarity processing of documents to clusters for analysis and classification to: (i) apply, modify and re-apply a broad and hierarchical classification taxonomy; (ii) identify documents that will not fit the existing taxonomy and modify the taxonomy to include them at an appropriate place; and (iii) create an operational system that will automatically do the same with any newly introduced documents as they are ingested into staged storage.
A. Phase One Detail: Data Collection and Initial Analysis
The Phase One effort was typical, e.g. (a) eliminate system files, (b) exact duplicates, (c) organize files by file type, (d) index them for common search tools, and (e) develop and apply some business rules for higher level classification. At the Petroleum Company, some rules already existed and were the basis for aligning a certain number of documents with the existing taxonomy – low hanging fruit if you will. To the extent possible, those documents could be stored and used depending upon stage of use or creation. In the case of legacy information, the storage is more likely to the “published” documents or, at the Petroleum Company, a Documentum store. There were significant numbers of documents that required further attention after completion of Phase One.
A few words about the Collection Process: at the Petroleum Company, the DCs and their project managers worked to identify potential authoritative data sources. Thus, they interview various people who might be involved as leaders or custodians of data. In this regard, among other efforts, they sought advice from their “Big Data” people, a group with data mining and other similar functions. They helped identify databases, etc. that contain authoritative data. Once the DCs completed clustering/classification/attribution activities described below, they created a register that they used to validate against the data source and to enhance the attributes to the documents that were collected.
Phase 1 activities are illustrated by the following workflow diagram and in the notes to each of the activities below:
Collect Data – data residing on business unit file shares is collected in ways that do not alter the file dates or content. Data is collected in a forensically sound fashion, preserving original metadata such as the time and date stamps of the files. Collection included file de-duplication and removal of known superfluous system files.
Triage Content— the data is then organized and ordered by data type, i.e. Word documents, CAD files and end user software created file types. Xpriori works with the client to prioritize the analysis of the collected data depending on client objectives.
Index Content – the indexing process makes the files and file properties searchable through common search tools and specialized tools for particular types of data; it also helps identify files that need to be subjected to an OCR process. This process enables text based and pixel based visual clustering. It also facilitates the extraction of attributes, i.e. P.O. numbers, company names and other important content.
Apply Rules to Documents and Classification Indicia – Business rules are created and then configured for use as basic Boolean and other search rules, categorized consistent with the storage taxonomy, and then applied to the data.
Classification Indicia - At this stage, there will be some percentage of documents that fall into classification categories and, as such, may not require further classification related activities. In some instances these documents will be ready for ingestion in an ECM system. There will also be some percentage of documents that need further processing in order to be indexed and classified. For those documents that have been classified they can then go through the validation process (block g in Fig 1.) and be used exemplars going forward
Subsequent to the application of rules in item “d” above, we typically have a sense of how much content can be immediately classified without further processing – the identification of low hanging fruit. The most common form of a rule is expressed as a Boolean keyword search. For example if an analyst wanted to find all well logs and related closeout reports, they would simply construct a query similar to the following: (well W/2 log) AND closeout W/2 report. This search example would find all documents that contain the word “well” within 2 words of “log” and “closeout” within 2 words of “report”.
f-g. Validation and QA. Once the results are returned, they are validated then subjected to a comprehensive QA process which includes:
Visual confirmation by subject matter experts.
Conversion of unclassified files to normalized PDF format. To further analyze the content, Phase 2 processing requires that the files be converted to a normalized PDF format. Once this is done, the files are then re-indexed and subjected to the processes described in ”d” and “e” above, and then passed to Phase 2 processing.
B. Phase Two Detail: Recursive Extraction of Documents from Container Files and Deployment of Data Similarity Analysis For Clustering
The remaining documents typically require at least two types of transformation to be susceptible to the automated processing for clustering. Many of the documents not processed in Phase I will be in container files – ZIP, PST, TAR files etc. – and have to be removed from the containers. Once the removal is completed: (a) hash based deduplication procedures are applied; and (b) all files that remain are converted to a normalized PDF format.
Application of hash based deduplication further reduces duplicative content and identifies unique files that need to be extracted from a parent object in order to determine whether they fall into a particular classification category. This conversion enables the clustering algorithms to operate across all of the documents consistently. Please note that over time, it is common for text base files to have been converted to other formats really creating duplicates that will not be identified as such prior to application of this process. Visual similarity analysis also enables the identification of non-textual symbols as part of the process. Original files are maintained and the converted files are always available.
The following diagram, Figure 2, and explanatory notes illustrate both processes.
Files Ingestion – files that come out of Phase 1 block “f” are pulled into a tool that recursively extracts files from container objects.
Recursive Extraction – this process targets compressed files such as ZIP, RAR, TAR, PST’s and embedded objects in files;
ZIP, RAR and TAR – these are known compressed file types that may contain other compressed files types or objects that need to be opened before they can be fully de-duplicated against the rest of the data collection.
PST’s, LOTUS NOTES, etc. – these file types are containers that have other objects that should be extracted for comparison against the overall data set.
Hash de-duplication – similar to the process used in Phase 1.
Save information in the project database – all hash values of files included the absolute and relative location of the files is stored in a database for audit and chain of custody purposes.
Fuzzy hash compares – this process compares certain metadata attributes of files to assist in the clustering process.
Save the information in the project database
All files are converted to a normalized format, PDF, such that the text clustering algorithms can be run across the entire corpus of collected, recursively extracted and text enriched content. Automated OCR (“Optical Character Recognition”) procedures are applied where the state of the information requires.
Data Similarity Clustering is applied; visual similarity will disclose same documents appearing as multiple file types – a word file that has been converted to pdf with both files still in the collection; this enables further elimination of duplicates; also visual
Hash, metadata values, and other pertinent information are stored to a project database.
Attributes can be extracted and associated with particular documents as tags or metadata; more on this in Part 4 of this series.
Both text and visual similarity clustering operate based upon a percentage of similarity that can be adjusted by the DCs. The DCs manually do a compare of results at varying percentages and obtain a consistent level of comfort in the result. The process enables the deployment to human judgment at the right point – the point at which similar documents suggesting common attributes have been identified, with duplicates and system files culled.
The system presents clusters for the DC to associate with the existing taxonomy and/or to render new orders of classification; and, to discover duplicates occurring where the same document appears in multiple file types.
The speed of process is further enhanced by the preservation and continued application of the coding that supported the creation of any cluster. At the discretion of the user, the coding can be applied to all new information introduced to a project or beyond. New documents are automatically aggregated to existing clusters. This speeds by a large factor the management of large projects or continuing introduction of new documents to a dataset.
At the Petroleum Company, as mentioned above, sometimes the clusters presented required further analysis and subdivision to achieve collections with sufficient homogeneity to be associated as a group with the taxonomy. There is really a “crossruff” between deployment of the visual/textual similarity tools and other tools such as Boolean search, keyword search, and file metadata to identify content. In these circumstances, the later referenced tools are applied to clusters that are typically much smaller sets of documents. The tools operate far more effectively on the smaller sets.
Finally, in Phase Two, the DCs apply descriptive tags to the documents within a cluster or set of clusters. These tags function to associate the documents to the taxonomy and to identify any other common attributes associated with them. For example, common attributes such as dates, authors etc., and are pulled forward from the document metadata. Metadata is expressed as tags as well. This process is called “attributing” converting the noun to a verb to identify the process. Corporate policy is reflected in the substance of the tags used. Part 4 of the Case Study will provide more information on attributing as part of this case study. At the Petroleum Company, their result was significant improvements to the taxonomies as well as new ways of looking at existing information.
With the foregoing processes completed, the documents are stored consistent with their phase of use and creation.
Conclusion: Three Major Value Propositions Are Presented
While this Part 3 discussion deals primarily with handling legacy information at business units of the petroleum company, there are a number of fundamental value propositions that arise. We mention them briefly here but will develop use cases around them in future editions of the Xpriori Report.
Value Proposition 1 - Data content informs our knowledge of our environment
As illustrated in the foregoing, clustering algorithms now help us understand our data categories in ways heretofore unachievable when working with big data. Clusters can be created without human definition for review and culling or accepting. Much manual effort is avoided and all content is considered. Prior to having the ability to use algorithms to help vast quantities of data self-contextualize itself to a user, people would come at their data from pre-defined point. This pre-defined view of data is grounded in presumptions about the data which by its very nature creates a “data horizon” (data risk or value that is not “visible”, addressable or otherwise readily usable by an organizational stakeholder lives below their data horizon). Having data describe itself to the user allows the user to see data in its complete context. Data blind spots for which there is no or insufficient classification are revealed in terms of relevance to other known data objects or documents within the corpus of content examined.
Value Proposition 2 – Taming big data
What is big data? Big data is characterized by an acronym describing 3 key variables which has been coined as V3
Data Volume – large data volumes. This is a relative term that changes based on innovations in storage device areal density (the number of bits that can be stored in a given area on a device).
Data Variety – the kinds of data, i.e. unstructured, structured, semi-structured and newer polymorphic content formats.
Data Velocity – the rate at which content is created.
The ability to identify substantively similar and duplicative documents gives organizations the ability to select “the” business records that should be kept in an ECM archive. The impact on storage budgets can be significant. The amount of storage needed for an organization can now be projected with pinpoint accuracy. The key metrics that allow us to do this can be generated from metadata; storage growth year over year and document duplicates based on clustering.
Value Proposition 3 – Sustainable, objective and automated data classification based on iterative clustering and informed and automated modification to the storage taxonomy
The greater the corpus of information that is clustered, the more we know about an organization’s document and content types. The process is automated and provides objective review of all content. Each ratified (classified) grouping of document types within an organization becomes the classification exemplar for new documents that enter that particular managed storage environment. New information added to a collection will be subjected to the same code for classification and organization. The ROI comes from dealing effectively with the large increases in unstructured information experienced by all organizations; from development and validation of the storage taxonomies based upon similarity of content; from automating classification of information from external sources such as new entities acquired through M & A; and more.
An Actual Case on Unlocking Value in Information – Part 4
Concepts of Similarity or Sameness are Critical to Human Understanding; A Technology that Automates Clustering/Classification by Similarity is a BIG DEAL!; A Technology that can continually automate classification without much by way of human intervention is an even BIGGER DEAL!
As we indicated in Parts 1-3, in recent months, a large Petroleum Company has launched a series of projects using new automated document clustering technologies to evaluate and organize these largely digital information assets with a view to creation of value and competitive edge. The Company is undergoing a revamp of its enterprise content management strategy and philosophy at the functional and business asset level. As is the case with many organizations, the primary objectives of this exercise include enterprise-wide cost containment, risk reduction and the extraction of value from content while enabling more effective use and management of the asset.
In Part 1, we discussed the goals and values for the project; the need for new technologies to supplant the slow pace of manual review and the small but expert team that was capable of using the new technologies to meet those goals, values and objectives. In Part 2, we addressed the team’s ability to meet the overarching goals for the system of Accuracy, Findability, Consistency and Governance at the various stages of creation, use and storage of information at the Company. We also offered a number of keys to understanding the process and deployment of our technology. In Part 3, we addressed the two phase work flow deployed at the Company and provided some details of the step by step process that was followed.
In this Part 4, we now discuss how “similarity” has been and is the basis for assessment or categorizing all sorts of phenomena and things for most of history, and how it is the basis for classification today even in the new digital settings. We also consider again the main value propositions for our approach in this new era.
II. Similarity As The Basis For Classification
Assessments of similarity seem to be important for a variety of cognitive acts, ranging from problem solving to categorization to memory retrieval. William James (1890/1950) was correct in stating that “this sense of sameness is the very keel and backbone of our thinking”. This statement could appear in any number of articles in scholarly journals dealing with vector and other algorithmic analysis designed to assess similarity of documents, events, or anything expressed within spatial bounds.
When one perceives similarity of various things, expressions or phenomena, one tends to use it as suggesting a context for what makes them “similar”. Context adds meaning to information or even idle expressions. In fact, context is necessary for us to have information. Think about it. I can say the word “Tim” and it means little or nothing. However, if I say “My name is Tim,” the word Tim now has a useful context from which the word can be understood, used or consumed. Without the context, the term would rest on a stack of total miscellany to be ignored until a suitable context is found or provided.
In the case of document classification, eDiscovery, information governance etc., we tend to be dealing with aligning content with events or business activities that, at least in a larger context, have a degree of commonality at some level of understanding or presentation. There is less of a chance of finding the total chaos of, say, 1,000,000 unrelated or totally dissimilar expressions of content, where the content has been created in the course of a particular activity, such as in this case, the finding, drilling for and producing fossil fuels and/or related support activities. In this context, finding 1,000,000 totally dissimilar documents would be unexpected.
One of my colleagues argues that all electronic data that is created anywhere in the world has a threshold of similarity >0% to every other known data type that exists (when one considers metadata in addition to the substantive content). I guess that I am more willing to accept that electronic data created in the context of a business or functional activity will more likely have some threshold of similarity >0%. I can’t prove it directly as I write this monograph, but I am willing to advance it for more than argument’s sake here. This is particularly the case where we are looking for and grouping items based upon “similarity” and then submitting the clusters to users (Document Controllers) for confirmation that they have some useful degree of similarity. Users decide to confirm or not, the proposed associations based upon your knowledge, education, experience and likely some consensus arising from group discussion. The “machine” has only enabled or processed the associations considering all of the content presented in the data set.
III. Automating Clustering By Similarity Saves Time, Money and Speeds Process, and Enables One That is More Accurate Than Purely Human Manual Review
Why is all of this important? Essentially, there are four primary reasons : (a) the machine enabled clustering takes into consideration every document in the data set; (b) one can reasonably classify and/or cull documents by reviewing less than all documents in a cluster, without having read each document; (c) the algorithmically based process takes into consideration non-text based information; and (d) the underlying code that created the clusters is preserved and is automatically applied to information newly introduced to a particular dataset. Let’s take a look at each of these primary reasons.
(a) Consider a process that looks at every document. Many of us possess experience based intuition that will produce thoughtful and productive Boolean searches. If one knows the territory, he can do a pretty good job with those searches… but we are deploying to experientially developed guesses that are not directly derived from the content we desire to assess. Useful? … Well, yes. But, are conclusions drawn from an assessment of all of the content presented? … Well, no. The same is true for predictive coding. We can spot exemplars with some alacrity; however, there is no guarantee that those exemplars will lead to an assessment of every document in the set. Our clustering process does.
(b) Consider information a cluster at a time; not a document at a time. It is easy to think of culling of clusters of third party information – say, groups of periodicals. It is just as easy to recognize after reading several documents in a cluster the substantive basis for their being clustered together and stored based upon a header in a taxonomy. At the Petroleum Company, both activities occurred. There was an existing taxonomy that was available. However, in the process of cluster review, it was substantially updated – more than doubling the number of classification points in the existing taxonomy. The results of the largely automated clustering processes, both textual and visual in nature, caused the team to suggest significant changes to the taxonomy and its hierarchical presentation. The completeness of the existing taxonomy was assessed while new clusters simultaneously suggested a basis for enabling content/data driven changes to it.
Figures 1 and 2 below illustrate the before and after result of data analysis. Figure 1 is the “pre-Copernican” view of the data taxonomy as we knew it. Figure 2 is the result of our using the new clusters of data to add additional categories to the taxonomy, providing a greater degree of granularity in the organization of the content.
Figure 1- Illustration of the taxonomy before clustering - Content Sub-type is blank for all records.
Figure 2 - Post clustering and analysis - Content Sub-type is populated for new records.
(c) Deal with both textual and visual similarity to enable consideration of all content. Deal with both textual and visual similarity to enable consideration of all content. Many documents are presented in image format and contain non-textual symbols or tokens such as logos, illustrative diagrams, pictures, etc. By converting all files in a document set to image format and then comparing them to selected percentages of similarity, Document Controllers are able to find those that are duplicates to text files and also can now include the non-textual material as part of the clustering process. The process and results are illustrated in the diagrams below. Every day, millions of substantively identical files get converted into different file types. Even though the PDF or TIFF or JPEG contains the same information as the actual email or PowerPoint or Word document from which it was derived, they have different hash values. This results in the “system” seeing substantively identical files as being duplicates. Figure 3 below illustrates this point.
Figure 4 below illustrates how visual clustering ignores the file format and focuses on the substance of documents in order to group them based on their similarity with one another. It also illustrates varying results in applying two different percentages of similarity. This visual approach enables organizations to identify substantive duplicates and for the first time, identify potential versions of the same documents that are scattered throughout email and document storage servers. This ability has proved absolutely critical in the engineering, architectural and energy field where it is critical to know that the versions of documents that are being relied upon are the most recent. The results at the Petroleum Company were quite compelling.
(d) Code generated to create the clusters is preserved and is subsequently applied to newly introduced information. This is the key to implementing an automated self-classification system for new information as it is received at the enterprise or to a program.
New clusters might suggest changes to the taxonomy; however, there is no need to recode for documents that fit to previously created clusters. Projects evolve over time with new information being introduced from time to time for extended periods in the future. Post-closing M & A integration is made far easier. Cluster documents from the acquired company and associate those clusters with those of the acquiring company. The same is true for eDiscovery activities from litigation hold to production. It goes on over time and now you can look at everything that you think might even have a remote chance of containing relevant information.
The implications of this process are significant. Once the classification work previously done throughout the prior Phases is codified, the resulting set of rules can operate independent of human interaction on new information when added. Outliers can still be identified and set aside for further action. As illustrated in Figure 5 below, various departments or other organizational entities in the enterprise can have their file shares monitored by a classification engine which is specifically tuned to their specific classification requirements.
Once classified, the documents can be made available through a common set of access tools, search methodologies and the like to assure that stakeholders can access the information that they require and when they need it. On completion of a classification project in a business unit at the Petroleum Company, all of the foregoing was put in place.
IV. What are the value propositions?
Value Proposition 1 - Data content informs our knowledge of our environment
As illustrated in the foregoing, clustering algorithms now help us understand our data categories in ways heretofore unachievable when working with big data. Clusters can be created without human definition for review and culling or accepting. Much manual effort is avoided and all content is considered. We can now view an organization’s enterprise content from a perspective that is completely different from conventional approaches. We no longer have to come at their data from pre-defined point that is, by its nature, grounded in presumptions about the data. Having data describe itself to the user allows the user to see data in its complete context. Data blind spots for which there is no or insufficient classification are revealed in terms of relevance to other known data objects or documents within the corpus of content examined.
Value Proposition 2 – Taming big data
What is big data? Big data is characterized by an acronym describing 3 key variables which has been coined as V3
- Data Volume – large data volumes. This is a relative term that changes based on innovations in storage device areal density (the number of bits that can be stored in a given area on a device).
- Data Variety – the kinds of data, i.e. unstructured, structured, semi-structured and newer polymorphic content formats.
- Data Velocity – the rate at which content is created.
If one accepts the foregoing definition of data, then taming big data is effectively the ability to scale and extrapolate Value Proposition 1 to increasingly large and disparate data volumes. The ability to identify substantively similar and duplicative documents gives organizations the ability to select “the” business records that should be kept in a document archive. The impact on storage budgets can be significant. The amount of storage needed for an organization can now be projected with pinpoint accuracy. The key metrics that allow us to do this can be generated from metadata; storage growth year over year and document duplicates based on clustering.
The impact on understanding and deploying to the right information on a timely basis is even more valuable.
Value Proposition 3 – Sustainable, objective and automated data classification based on iterative clustering and informed and automated modification to the storage taxonomy
The greater the corpus of information that is clustered, the more we know about an organization’s document and content types. The process is automated and provides objective review of all content. Each ratified (classified) grouping of document types within an organization becomes the classification exemplar for new documents that enter that particular managed storage environment. For organizations that grow inorganically (by acquisition or merger), the larger the data volumes under management, it is more statistically probable that there will be significant data clusters that will be related to one another irrespective of traditional data horizon impediments such has language or character sets.
Organizations seeking to implement automated and sustainable data classification solutions such as those described and proposed in this document can embark on these types of initiatives with “out of the gate” ROI that positively impacts bottom line. This value proposition has already impacted the following functional business units at the Petroleum Company: (1) Profit centers, (2) Legal departments, (3) Information security groups, (4) Field engineering, (5) HR, (6) and Marketing.
Similarity is at the core of how we analyze and categorize information. Our automated clustering is based upon similarity and will cluster documents in highly useful ways. This results in more complete classification of documents and understanding of their importance to the project and the enterprise.
To View or Download a pdf of this presentation, click here.