Distributed/Hadoop Storage Layer

Storage layer:

The storage layer is usually loaded with data using a batch process.

The integration component of the ingestion layer invokes various mechanisms—like Sqoop, MapReduce jobs, ETL jobs, and others—to upload data to the distributed Hadoop storage layer (DHSL).

Hadoop:

  • Open source framework
  • Allows us to store huge volumes of data in a distributed fashion.
  • Provides de-coupling between the distributed computing software engineering and the actual application logic that you want to execute.
  • Enables to interact with a logical cluster of processing and storage nodes.

Components:

1. HDFS

  • File system designed to store a very large volume of information (terabytes or petabytes) across a large number of       machines in a cluster.
  • It stores data reliably, runs on commodity hardware
  • Uses blocks to store a file or parts of a file
  • Supports a write-once-read-many model of data access.

2. Map reduce

  • Computes results in batch

 

Storage pattern:

  • Communication from ingestion layer to storage layer
  • Can be implemented based on the performance, scalability, and availability requirements.

 

Big data – Sources

bigdata-2

Legacy Data Sources:

HTTP/HTTPS web services

RDBMS

FTP

JMS/MQ based services

Text/flat file/csv logs

XML data sources

IM Protocol requests

 

New Age Data Sources:

High Volume Sources

1. Switching devices data

2. Access point data messages

3. Call data record due to exponential growth in user base

4. Feeds from social networking sites

Variety of Sources

1. Image and video feeds from social Networking sites

2. Transaction data

3. GPS data

4. Call center voice feeds

5. E-mail

6. SMS

High Velocity Sources

1. Call data records

2. Social networking site conversations

3. GPS data

4. Call center – voice-to-text feeds

OBIEE presentation service doesn’t start without any change

Debug steps:

  • Make sure RPD, Webcat and presentation file instanceconfig.xml are correct and you have done any changes to it, try to revert and restart.
  • If issue still persist and nqserver.log shows error like following,
Error Message From BI Security Service: [nQSError: 46164] HTTP Server returned 404
Authentication failed: invalid user/password.

Resolution:

  • Stop all opmnctl services
  • Stop Weblogic managed server
  • Stop Weblogic Admin server
  • Empty /tmp
  • chmod  777 /tmp
  • Start Weblogic Admin server
  • Start Weblogic managed server
  • Go to Weblogic -> Deployment and check if all the deployments are in active state and if any of the deployment isn’t active, start it.
  • Start opmnctl services
  • If any service doesn’t get started, try opmnctl startall again without stopping, or try startproc.

Big data challenges

  • To understand and prioritize the data from the garbage that is coming into the enterprise. Ninety percent of all the data is noise, and it is a daunting task to classify and filter the knowledge from the noise.
  • In the search for inexpensive methods of analysis, organizations have to compromise and balance against the confidentiality requirements of the data. The use of cloud computing and virtualization further complicates the decision to host big data solutions outside the enterprise. But using those technologies is a trade-off against the cost of ownership that every organization has to deal with.
  • Data is piling up so rapidly that it is becoming costlier to archive it. Organizations struggle to determine how long this data has to be retained. This is a tricky question, as some data is useful for making long-term decisions, while other data is not relevant even a few hours after it has been generated and analyzed and insight has been obtained.
  • With the advent of new technologies and tools required to build big data solutions, availability of skills is a big challenge for CIOs. A higher level of proficiency in the data sciences is required to implement big data solutions today because the tools are not user-friendly yet.

Big data analytics

Analyzing structured data

  • Involves identifying patterns in text, video, images, and other such content. This is different from a conventional search, which brings up the relevant document based on the search string. Text analytics is about searching for repetitive patterns within documents, e-mails, conversations and other data to draw inferences and insights.

Analyzing unstructured data

  • Analyzed using methods like natural language processing (NLP), data mining, master data management (MDM), and statistics. Text analytics use NoSQL databases to standardize the structure of the data so that it can be analyzed using query languages like PIG, Hive, and others. The analysis and extraction processes take advantage of techniques that originated in linguistics, statistics, and numerical analysis.

What is Big Data?

Introduction

  • A big data solution must address the three Vs of big data: data velocity, variety, and complexity, in addition to volume.
  • Velocity of the data is used to define the speed with which different types of data enter the enterprise and are then analyzed.
  • Variety addresses the unstructured nature of the data in contrast to structured data in weblogs, radio frequency ID (RFID), meter data, stock-ticker data, tweets, images, and video files on the Internet.
  • For a data solution to be considered as big data, the volume has to be at least in the range of 30-50 terabytes (TBs).
  • However, large volume alone is not an indicator of a big data problem. A small amount of data could have multiple sources of different types, both structured and unstructured, that would also be classified as a big data problem

Traditional BI vs Big Data

  • Data is retained in a distributed file system instead of on a central server.

  • The processing functions are taken to the data rather than data being taking to the functions.

  • Data is of different formats, both structured as well as unstructured.

  • Data is both real-time data as well as offline data.

  • Technology relies on massively parallel processing (MPP) concepts.

Use

  • Energy companies monitor and combine usage data recorded from smart meters in real time to provide better service to their consumers and improved uptime.

  • Web sites and television channels are able to customize their advertisement strategies based on viewer household demographics and program viewing patterns.

  • Fraud-detection systems are analyzing behaviors and correlating activities across multiple data sets from social media analysis.

  • High-tech companies are using big data infrastructure to analyze application logs to improve troubleshooting, decrease security violations, and perform predictive application maintenance.

  • Social media content analysis is being used to assess customer sentiment and improve products, services, and customer interaction.

How to automate web services testing/perfromance-testing on *nix – Approach

JMeter

Its a tool used for performance testing but as its freely available we can use it for normal testing.

Jenkins

Its a tool which can be used to automate tasks.

Integrate Jenkins and JMeter

JMeter comes with many *unix command, We should be hitting a shell scripts from Jenkins which in turn should invoke JMeter, get and display results on Jenkins console.

Looks simple?

Idea

You can create one JMX per web service or for multiple web services depending upon requirement, now the main problem which we usually face in automation tasks is parameterisation, we want to keep some parameters save for the run time, parameters which we can not hard code, this actually makes the automation the automation. Now first identify those parameters which you want to keep for the run time.

Now when you hit button from Jenkins, it should prepare those parameters in the form its required by the Jenkins, you can write a *nix script for that for example if required.

Now once the parameters are ready you invoke JMeter from your script, now there is a turn here, JMeter JMX file can accept input at run time from two ways:

  1. CSV files
  2. -J[prop name]=[value]

Which ever way you choose, make sure you have prepared your JMX file accordingly in advance before doing this.

For example, if you want to use option-1, you can create CSVs at the run time from scripts but in JMX make sure you provide correct path and name.

And in case you want to go with option-2 there is a different syntax you will have to provide in JMX in order to read variables passed at run time.

 

Now point is you can customize it the way you want, you can use these ways to provide parameters at run because thats what i think matters one of the most in any automation. GUI we have got from Jenkins, scripts and prepration and invoke we can do from shell, testing will be done by JMeter, and parameters well written much about it.

 

Output

Well, written much about how to test a thing, here is, how to get the result, use JMeter result tree option to create as many result CSV files as you want with different name and read these files through your script and display it on Jenkins. You can even keep name and location of files dynamic using two options explained above.

 

Performance testing

Well jenkins even support performance output, great, isnt it? you need to install and configure plugin for that. Find more details here.

https://wiki.jenkins-ci.org/display/JENKINS/Performance+Plugin