Big data – Sources


Legacy Data Sources:

HTTP/HTTPS web services



JMS/MQ based services

Text/flat file/csv logs

XML data sources

IM Protocol requests


New Age Data Sources:

High Volume Sources

1. Switching devices data

2. Access point data messages

3. Call data record due to exponential growth in user base

4. Feeds from social networking sites

Variety of Sources

1. Image and video feeds from social Networking sites

2. Transaction data

3. GPS data

4. Call center voice feeds

5. E-mail

6. SMS

High Velocity Sources

1. Call data records

2. Social networking site conversations

3. GPS data

4. Call center – voice-to-text feeds

Big data challenges

  • To understand and prioritize the data from the garbage that is coming into the enterprise. Ninety percent of all the data is noise, and it is a daunting task to classify and filter the knowledge from the noise.
  • In the search for inexpensive methods of analysis, organizations have to compromise and balance against the confidentiality requirements of the data. The use of cloud computing and virtualization further complicates the decision to host big data solutions outside the enterprise. But using those technologies is a trade-off against the cost of ownership that every organization has to deal with.
  • Data is piling up so rapidly that it is becoming costlier to archive it. Organizations struggle to determine how long this data has to be retained. This is a tricky question, as some data is useful for making long-term decisions, while other data is not relevant even a few hours after it has been generated and analyzed and insight has been obtained.
  • With the advent of new technologies and tools required to build big data solutions, availability of skills is a big challenge for CIOs. A higher level of proficiency in the data sciences is required to implement big data solutions today because the tools are not user-friendly yet.

Big data analytics

Analyzing structured data

  • Involves identifying patterns in text, video, images, and other such content. This is different from a conventional search, which brings up the relevant document based on the search string. Text analytics is about searching for repetitive patterns within documents, e-mails, conversations and other data to draw inferences and insights.

Analyzing unstructured data

  • Analyzed using methods like natural language processing (NLP), data mining, master data management (MDM), and statistics. Text analytics use NoSQL databases to standardize the structure of the data so that it can be analyzed using query languages like PIG, Hive, and others. The analysis and extraction processes take advantage of techniques that originated in linguistics, statistics, and numerical analysis.

What is Big Data?


  • A big data solution must address the three Vs of big data: data velocity, variety, and complexity, in addition to volume.
  • Velocity of the data is used to define the speed with which different types of data enter the enterprise and are then analyzed.
  • Variety addresses the unstructured nature of the data in contrast to structured data in weblogs, radio frequency ID (RFID), meter data, stock-ticker data, tweets, images, and video files on the Internet.
  • For a data solution to be considered as big data, the volume has to be at least in the range of 30-50 terabytes (TBs).
  • However, large volume alone is not an indicator of a big data problem. A small amount of data could have multiple sources of different types, both structured and unstructured, that would also be classified as a big data problem

Traditional BI vs Big Data

  • Data is retained in a distributed file system instead of on a central server.

  • The processing functions are taken to the data rather than data being taking to the functions.

  • Data is of different formats, both structured as well as unstructured.

  • Data is both real-time data as well as offline data.

  • Technology relies on massively parallel processing (MPP) concepts.


  • Energy companies monitor and combine usage data recorded from smart meters in real time to provide better service to their consumers and improved uptime.

  • Web sites and television channels are able to customize their advertisement strategies based on viewer household demographics and program viewing patterns.

  • Fraud-detection systems are analyzing behaviors and correlating activities across multiple data sets from social media analysis.

  • High-tech companies are using big data infrastructure to analyze application logs to improve troubleshooting, decrease security violations, and perform predictive application maintenance.

  • Social media content analysis is being used to assess customer sentiment and improve products, services, and customer interaction.