Itm 4273

Essay by Balgit Kaur • February 21, 2016 • Coursework • 1,953 Words (8 Pages) • 1,019 Views

Essay Preview: Itm 4273

prev next

Report this essay

Page 1 of 8

Baljit Kaur

Anmol Kharbanda

Khou Xiong

Parham Dehnadfar

Richard Gillman

ITM 4273-01

Balu Rajagopal

Group Article Review #1

Differentiators	Hadoop	Data Warehouse	Remarks (Impact on Extraction of Real-time Business Insights)
Data Repository	Raw Data	Aggregate and Refined Data	Hadoop storage on raw data pushes extraction of insights to on-demand questions later as opposed to Data warehouse that is designed to provide insights predefined questions.
Query Format	NoSQL	SQL	The optimizer examines incoming SQL during the Data Warehouse performance and considers various plans for executing each query as fast as possible. More specifically, it achieves this by comparing the SQL request to the extensive data statistics and database design that help identify the best combination of execution steps. Hadoop, on the other hand, does not use any SQL.
Database Technology	HBASE	RDBMS	Hbase is column family oriented unlike RDBMS, which is row oriented With RDBMS there are 1000s of queries/second whereas for Hbase there could be millions of queries/second. With data warehouse max data size is in TBs and with hadoop there's hundreds of petabytes
File-System	HDFS		Hadoop scales out to large clusters of servers and storage using the HDFS to manage huge data sets and spread them across the servers.
Tool	File Copy (Extract, Transform only)	ETL (Extract, Transform, Load)	Hadoop is not an ETL tool. It is a platform that supports running ETL processes in parallel. In data warehousing, however, the ETL server becomes infeasible with big data volumes when moving all the big data to one storage area.

Setup	Multiple machines	Single Relational database(serves as the central store)	Hadoop file system are designed to span multiple machines and can handle huge volumes of data that surpass the capability of any single machine.
Data	Raw data	Structured relational database	Hadoop used HDFS which is often cloud storage (cloud is cheap and flexible). one can still do ETL and create data warehouse using HIVE. With hadoop you have raw data available so you can also define new questions and do complex analyses over all the raw historical data.
Managing and analyzing data	Uses HIVE	ETL	The hadoop toolset allows great flexibility and power of analysis, since it does big computation by splitting a task over large numbers of cheap commodity machines, letting you perform much more powerful, speculative, and rapid analyses than is possible in a traditional warehouse
Running Workloads	Fluctuating	Constant and Predictable	Hadoop has the ability to spin virtual servers up or down on demand within minutes, hadoop in the cloud provides flexible scalability you’ll need to handle fluctuating workloads.
Finance / Cost	Inexpensive	Costly	Hadoop can be a very inexpensive alternative to a data warehouse. you can store your structured data across cheap computing /storage nodes as opposed to adding large servers. In Hadoop you can break up the data and let HDFS, the Hadoop distributed file system handle the 3x copies of each chunk of data. You can use Pig, hive, ambari, or flume to run queries against just as if you are using a data warehouse.

Shades of Grey Areas	Hadoop	Data Warehouse	Remarks
Provisional Data such as Clickstream Data	9 - Hadoop has the advantage of being flexible, time of value and it’s not being limited by the governance committees or administrators.	7 - Is limited to just regional bank but it does provide quick identification of overlapping consumers and comparisons of account quality to existing accounts.	Hadoop stores and refine data, then load some of the refined data into data warehouse for further analysis.
Sandbox Analysis (small samples versus All data)	8 - raw data in quantity, no limitation on data set size,	7 - has clean integrated data	Both Hadoop and data warehouse can determine their data mining by how much data they use.
In-Memory Data Processing	8 - enable fast data processing, avoids duplication of data and eliminates unnecessary data movement.	9 - provides self- service analytics.	By providing self- service analytics, Data warehouse is the better choice.
Complex Batch Analysis	8 - can process massive amounts of data.	8 - can process massive amount of data, runs in minutes and not invoked by business users	Both Hadoop and data warehouse can depend on running complex batch jobs to process massive amounts of data.
Interactive Analysis	7 - Used when run in parallel to achieve scalability and the program is highly complex	9 - SQL programming combined with a parallel data warehouse is probably the best choice	If there is a requirement to run any language and any level of program complexity in parallel, Hadoop is favored
Prediction Analysis	8 - Runs predictive analytics in parallel against enormous quantities of data	8 - small samples of data compared to Hadoop, but data is clean and integrated	Hadoop is favored if the data is too big for a DW to process.
Recommendation Analysis	8 - ideal for pulling apart click streams from websites to find consumer preferences	8 - include components that detect customer activity and use a recommendation engine to persuade the consumer to stay	scrubbed data from Hadoop can be imported into the data warehouse to ensure data governance
Text Analysis	10- Hadoop is a good at finding keywords and performing analysis	3- Relational are not good at parsing text	Hadoop maps text, and after refinement, stores it in Data Warehouses

...

Download as: txt (10.4 Kb) pdf (212.4 Kb) docx (15.1 Kb)

Continue for 7 more pages »

Read Full Essay Save

Only available on ReviewEssays.com