Just as cloud computing has been a buzzword within the IT industry over the past decade, Big Data has rapidly grabbed the attention of CIOs and CTOs around the globe. With promises of squeezing more information out of traditional data sets, Big Data also brings new dimension to data analytics by being able to aggregate, store and manage new types of data in never-before-thought-of ways.
But standing up a Big Data installation is not an overnight project. There are complex decisions to be made prior to implementation as well as during the project process.
I like Lisa Arthur’s definition of Big Data:
“Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.”
There are two important types of data sets: unstructured data and multistructured data. Unstructured data is essentially chaotic and not easily organized or interpreted. A good example of unstructured data is information coming from social media. Multistructured data, as defined by Arthur, “refers to a variety of data formats and types and can be derived from interactions between people and machines” — web logs, for example.
A Gartner survey states that 64 percent of companies had plans to invest in Big Data, or had already done so, in 2013. While media, banking, e-commerce and customer service are industries often cited as embracing Big Data, any company looking to understand more about its customers and operations could benefit from Big Data tools. Innocentive.com lists some other industries that are jumping on the bandwagon, including:
Even if your company’s business line isn’t listed above, you should consider using Big Data to gain further insight into your business.
Why Move to Big Data?
Apart from the obvious, Big Data analytics has some compelling advantages over traditional number crunching. A CIO.com article points out several reasons for moving to Big Data and also outlines some potential challenges.
To briefly summarize the article, companies achieve better data management by using different and new types of data mining. This can potentially give you more insight into data you have already collected or are currently collecting.
To make the process of hosting a Big Data environment even better, many cloud service providers allow for larger data sets through scalable cloud storage and the ability to scale up on processing power.
So you now have multiple sources and types of data, and your company wants to make sense of it. Where do you turn? You move to Big Data analytic environments like Hadoop. Hadoop is an open-source software framework, produced and maintained by the Apache Foundation, for storage and large-scale processing of data sets on clusters of commodity hardware. It consists of four modules:
- Hadoop Common: Libraries and utilities used by other modules
- HDFS (Hadoop Distributed File System): Stores data on commodity machines in a distributed manner
- YARN: Manages compute resources in the clusters
- MapReduce: Programming module for large-scale data processing
When you are dealing with incredibly large data sets of structured and unstructured data, you need to ensure that the data can be stored as well as clustered and distributed across multiple machines. Commodity hardware allows for companies to reduce costs by not having to rely on proprietary or extremely specific hardware requirements.
“Hadoop used to augment or extend traditional forms of analytical processing, such as OLAP, rather than completely replace them,” says Cynthia Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory, in a recent article. “[It] is often deployed to bring large volumes of new types of information into the analytical mix — information that might have traditionally been ignored or discarded.”
Hadoop, by its very nature, is designed for failure: If a node of the cluster containing the data goes down, there are automatic redundancies built into the architecture to ensure data integrity. In order to ensure this integrity and resiliency, the cluster must be set up properly, a task that may be daunting without the proper expertise.
It’s easy enough to set up a single Hadoop node. This is probably a good place to start if you are evaluating the service or testing it out. Apache has a simple setup documented on their site. However, you would not want to implement a single-node setup as your production environment, because it defeats the whole purpose of having redundancy.
Paxel.net has a fairly straightforward setup guide for Linux but clearly states “installation of Hadoop on Linux is not [as] easy as it looks.”
Approaching Your Big Data Project
Any Big Data implementation should not be taken lightly. There are many challenges and risks if done wrong. IBM’s Saracco outlines a few of these challenges, paraphrased below:
- Lack of scoped objectives: If you don’t clearly outline your corporate goals for implementing Big Data within your organization, you are primed to fail.
- Lack of appropriate skill sets: if your organization does not have the skills or knowledge of Big Data or Hadoop, you will waste time and money doing something that an expert could help you implement more efficiently.
- Getting sidetracked: With any new technology, it is quite easy to get distracted in pursuing a particular path, but these distractions can cause delays to your project.
- High demand for data scientists: If you are embarking on a large Big Data project, it’s recommended that you get an expert, such as a data scientist, to guide you. However, they are scarce and in high demand.
When you do undertake your Big Data project, Saracco has the following tips:
- Start with a clear definition of objectives, timelines and executive sponsorship.
- Evaluate the technical options that best fit the scope of the project.
- Think about the Big Data solution to use (e.g., Hadoop.
- Break the complexity of the project into manageable groups, such as data collection, data preparation, data analysis and data sharing.
- Consider the scalability of the Big Data solution that you choose.
You may want to consider working with your hosting provider as a Big Data resource. They often have experienced staff on hand to help you with your Big Data implementation. Similarly, they may also have Big Data solutions as managed services that can dramatically reduce your implementation timeline.
While you may be inclined to resist paying for these services, it’s important to understand that the hidden costs of doing it yourself may be much higher than if you were to hire an expert or pay for a hosted solution or service.
Big Data has the promise to introduce new types of information into the analytical mix that may have been traditionally ignored or overlooked. Big Data allows companies to (re)capture new and unthought-of insights, positioning these companies that embrace Big Data ahead of their competition.
[image: Sergey Nivens/iStock/ThinkStockPhotos]