Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
Client Profile: Leading Bank (Under NDA)
A leading Client has an existing system collect the account data, account details, transaction details and other account related details, processes them and generates a file which is utilised by email sending system to send out emails. Approximately 5 Million e-statements among ten are need to be processed by the existing system of the client. This process of collecting and processing the data will take around 18hours (approx.) of production batch run time which is very long.
End Users of Data Output
IT team, Customer Care Departments of customers’ organization.
Big Data Problem
The client wants to bring down the data processing time from the current 18 hours.
Solution: Big Data Architecture
Big Data Solution for Data Processing
Big Data Project Lifecycle
The solution provided for batch processing of certain unstructured data is Hadoop because; it can handle processing of unstructured data and also large volumes of data fast by its nature of parallelism technique. For this type of raw data and data like transactions, the backup should be crucial and so Hadoop deals with the back up as three replicas of same data as a part of security rather than secondary name node. The Hadoop approach of map and reduce makes the processing of data fast. This map-reduce processing can be achieved through JAVA, PIG and HIVE and some process named as SQOOP is a map only processor which is the easiest way of importing the data from sql. So, Hadoop is selected as a solution for this project because, it is easier to join, filter and sort the datasets and also good in processing and maintaining back-up.