Big Data Technologies, eDiscovery and You

Big Data Ediscovery

A wide variety of technology solutions are available for managing Big Data

Fifteen years ago a personal computer might have held 10 gigabytes of storage, social media was not even a discussion and a cloud only stored precipitation. Discovery of electronic data was often a matter of blowback printing for hard copy review and EDD software, when available, was measured in page increments. Storage, distribution, and processing of data typically leveraged large servers from Sun Microsystems, Compaq and Hewlett-Packard that utilized the UNIX operating system and Oracle relational database technology with a client/server architecture.

Since its initial beginnings, electronic discovery has been transformed drastically, now having an exciting array of technology options for managing data that is exponentially different in variety, scale, source and complexity.

The Explosion of Data across the Enterprise

Yesterday’s 10 gigabyte personal computer has grown to 1 terabyte today, with mobile phones, cameras and other mobile devices easily doubling, if not tripling, yesterday’s PC capacity. This basic example provides great insight into the proliferation in data across enterprises of all sizes and industries, with the following statistics outlining the sheer magnitude of data in today’s world:

  • The Internet carries 1,826 petabytes (1.8 billion gigabytes) of information per day.[1]
  • There are 294 billion emails sent every day.[2]
  • The annual growth rate of unstructured data in the enterprise is 80 percent.[3]
  • In 2011, the amount of information created and replicated surpassed 1.8 zettabytes (1.8 trillion gigabytes).[4]
  • The digital universe is 1.8 trillion gigabytes in 500 quadrillion “files” and more than doubling every two years.[5]
  • The volume of business data worldwide doubles every 1.2 years.[6]

What Makes Data Big

Data in today’s enterprise has evolved from basic text, numbers and dates to include audio, video, geospatial, 3D data, transactional data and social media generated from diverse feeds which can experience millions of updates per second. The term “Big Data” does not refer to a specific volume, type or source but rather a broader trend in data management. According to MongoDB, “Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infra-structure to address efficiently. Said differently, the volume, velocity or variety of data is too great.”[7]

How Did Clouds Become Technical?

Similar to the term “Big Data,” the “cloud” and “cloud computing” have become oft-used terms that seem to confuse more than they explain. The National Institute of Standards and Technology has defined cloud computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.”[8] Software as a Service (SaaS), the typical method for providing eDiscovery applications to end users via a web browser, is a key service model within cloud computing that has been utilized by the legal technology industry for many years.

The technology that enabled the SaaS concept has evolved to support Platform (PaaS) and Infrastructure (IaaS) as a Service delivered privately to a single organization, to the public or a hybrid of the two. Popular examples of these new services are Elastic Cloud Computing (EC2) and Simple Storage Service (S3) from Amazon Web Services, which provide on-demand virtual computing — including popular environments such as Microsoft Windows Server — and data storage infrastructure via the Internet. EC2 On-Demand instances are charged by usage hours, starting at $0.060 per hour, with a small free tier available[9], and the S3 Storage rate is $0.095 per gigabyte for the first terabyte per month.[10]

Big Data Technologies

The need of organizations to efficiently harness and analyze Big Data has fostered development of innovative new technologies. Legacy file and database systems simply did not and do not have the scalability, performance, reliability and affordability associated with Big Data technology, such as Hadoop Distributed File System and NoSQL databases.

NoSQL Database

Defined as “Not Only” a SQL-based relational database management system, the development of NoSQL was driven by the data management struggles of companies such as Google, Amazon, Twitter and Facebook. Traditional relational databases simply could not provide the performance and scalability required to process and analyze the massive volumes of data under management of the companies.

Unlike traditional relational databases that rely on tables and data mapping, NoSQL databases, such as MongoDB and Couchbase, store data in document objects and operate on key/value pairs. These schema-less databases are designed to grow with horizontal scalability, simply adding commodity servers as required.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System is a Java-based platform developed by the Apache Foundation that stores files across multiple commodity servers. Derived from the file system developed by Google for its internal data storage need, HDFS provides applications with the ability to store high volumes of data across a cluster of computers and access that data in parallel, with high performance and availability.

Big Data Technologies in Use

  • In tests conducted by Intel in December 2012, they were able to reduce the time to sort a terabyte of data from approximately four hours to approximately seven minutes using Big Data Technologies. This 97 percent reduction produced near-real-time results at a significantly lower cost than was previously possible with legacy systems.[11]
  • The New York Times used 100 Amazon EC2 instances, Amazon S3 and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).[12]
  • Facebook has the largest reported Hadoop Distributed File System cluster, with over 100 petabytes[13] of disk space across more than 2000 machines.[14]

Benefits of Big Data Technologies

  • Scalability — able to scale out horizontally across multiple commodity servers
  • Performance — can rapidly process large amounts of data in parallel
  • Reliability — fault tolerant, with no need for down time
  • Affordability — low cost commodity servers and applications with Open Source licensing

Big Data and eDiscovery

There is no question that the scope of electronic discovery has grown parallel with the proliferation of data across the enterprise. Today’s projects are now commonly measured in terabytes of data, with the potential of petabytes in the not too distant future. Budgets, timelines and resources have not, however, increased in proportion to project sizes and legal teams are embracing new technologies in order to more efficiently process, manage and review discovery data for litigation, arbitration, investigations and compliance.

While the spotlight has focused on the benefits of using technology assisted review, new platforms have come online that leverage Big Data Technologies to provide superior performance, reliability, security and functionality at ultra-competitive price points. These innovative eDiscovery processing, analysis and review platforms have escaped from the legacy architecture of Microsoft SQL Server, massive multi-core servers, Storage Area Network devices and data centers to the freedom of on-demand cloud computing with Big Data Technologies. Not only does this provide incredible cost efficiencies and lower costs for clients, but the following key differentiators:

  • Near real-time availability of new data for search and review
  • On-demand, automated ability to add technology resources during periods of high demand
  • Parallel processing to maximize performance for projects of all sizes
  • Regional hosting centers around the globe
  • Failure-tolerant architecture with near zero downtime requirements
  • Increased security with world class data centers, in-place and in-transit encryption

Even in an industry so resistant to change, one is hard-pressed to find argument with new solutions that provide such compelling benefits and a valuable alternative to the legacy eDiscovery platforms in use today.

End Notes

  1. The National Security Agency, “The National Security Agency: Missions, Authorities, Oversight and Partnerships,” 9 August 2013
  2. The Radicati Group, “Email Statistics Report, 2010-2014,” April 2010
  3. Symantec, “Symantec Helps Organizations Get Control of Runaway Data Growth,” March 12, 2012
  4. IDC, “Extracting Value from Chaos,” June 2011
  5. IDC, “Extracting Value from Chaos,” June 2011
  6. W. P. Carey School of Business — Arizona State University, “eBay Study: How to Build Trust and Improve the Shopping Experience,” May 8, 2012
  7. MongoDB, “Big Data Explained”
  8. National Institute of Standards and Technology, “The NIST Definition of Cloud Computing,” September 2011
  9. Amazon Web Services, “Amazon EC2 Pricing,” aws.amazon.com, 2013
  10. Amazon Web Services, “Amazon Simple Storage Service (Amazon S3),” aws.amazon.com, 2013
  11. Intel, “Big Data Technologies for Near-Real-Time Results,” 2013
  12. Derek Gottfrid (November 1, 2007). “Self-service, Prorated Super Computing Fun!” The New York Times.
  13. Facebook, “Under the Hood: Hadoop Distributed Filesystem reliability with Namenode and Avatarnode,,” June 13, 2012
  14. Dhruba Borthakur, “Facebook has the world’s largest Hadoop cluster!,” hadoopblog.blogspot.com, May 9, 2010

first published on Dec 6, 2013 in Findlaw