C-rappy Cacophony

Tuesday, July 31, 2012

Big data... Hadoop and such

What is "Big Data"?
The short answer is that size does matter after all.

Sometimes big data is measured in terabytes, petabytes, or more. In the real world, it's usually measured in frustration, annoyance, anxiety, and money down the drain. 

Data becomes "big data" when it basically outgrows your current ability to process it, store it, and cope with it efficiently. Storage has become very cheap in the past decade, which means it has become easy to collect mountains of data. However, our ability to actually process the mountains of data quickly has not scaled as fast. Traditional tools to analyse and store data -- SQL databases, spreadsheets, the Chinese abacus -- were not designed to deal with vast data problems.

The amount of information in the world is now measured in zettabytes. A zettabyte, which is 1021 bytes (that is 1 followed by twenty-one zeroes), is a big number. Imagine you wrote three paragraphs describing your favorite movie - that's about 1 kilobyte. Next, imagine you wrote three paragraphs for every grain of sand on the earth -- that amount of information is in the zettabyte range.

You may "only" have some number of terabytes in your databases, but you still have a lot of data to work with. And that number is only going to balloon in size every year.
It is not advisable to dig out the hole for a pool using only an ice cream scooper; you need a big tool.

What is this Big Tool for Big Data?

Hadoop is the best tool available today for processing and storing herculean amounts of big data . Hadoop throws hundreds or thousands of computers at the big data problem, rather than using single computer.

Hadoop makes data mining, analytics, and processing of big data cheap and fast. Hadoop can take most of your big data problems and unlock the answers, because you can keep all your data, including all of your historical data, and get an answer before your children graduate college.

Apache Hadoop is an open-source project inspired by research of Google. Since you were wondering, Hadoop is named after the stuffed toy elephant of the lead programmer's son. This explains the preponderance of pachyderms wherever Hadoop is mentioned

In Hadoop parlance, the group of coordinated computers is called a cluster, and the individual computers in the cluster are called nodes.

What is Hadoop good at?

Hadoop is awesome:

Hadoop is cheap. 
Hadoop is an open-source Apache project, which means anybody is free to use it. Hadoop runs on commodity hardware (i.e. normal everyday computers), so you don't have to buy million-dollar specialized database machines.

Hadoop is fast. Hadoop can deal with terabytes of data in minutes, and with petabytes in hours. Hadoop is the only way that companies with gigantic amounts of data like Facebook, Twitter, Yahoo, eBay, and Amazon can cost-effectively and quickly make decisions.

Hadoop scales to large amounts of big data storage. Need to add more space? Just add more hard drives to a node, or even add more nodes to your cluster. You never shut down Hadoop.

Hadoop scales to large amounts of big data computation. Is your cluster slow? Just add more nodes to spread out the computation. Hadoop scales almost linearly in many cases - this means you can halve the time it takes to do a job by doubling the number of compute nodes.

Hadoop is flexible with types of big data. Are you dealing with structured data? Great. Do you have semi-structured or unstructured (document-oriented) data? Lovely. Hadoop stores and processes any kind of data.

Hadoop is flexible with programming languages. Hadoop is natively written in Java, but you can access your data in a SQL-inspired language called Apache Hive. If you want a more procedural language for analysis, there is Apache Pig. If you want to get deep into the framework, you can custom-analyse your data by writing code in Java, C/C++, Ruby, Python, C#, QBASIC or anything else.

What is (Plain) Hadoop bad at?

In the real world, just downloading Plain Hadoop from the Apache website and trying to use it has some shortcomings:

Plain Hadoop is hard to to set up. Have you tried setting up this thing? Your best bet may be to kidnap some professors and press them into your service.

Plain Hadoop is hard to manage. How do you do anything? Where is the graphical user interface? Oh, there is none.

Plain Hadoop is hard to keep alive. Hadoop has various single points of failure. When Hadoop collapses, you lose data and you lose time. That hurts.

Plain Hadoop is hard to use. Seriously, this is not a joke. Even adding up a list of numbers is painful.

Plain Hadoop is not secure. Your files are not secure and users can easily corrupt or steal data. I hope you trust everybody.

Plain Hadoop is not optimized for your hardware. Hadoop does not run at full capacity for your hardware, which is like being stuck in second gear.

The good news is that you can have all the good parts of Hadoop with none of the bad parts.

Zettaset Big Data is a faster, more reliable, easier, and secure Hadoop.

Zettaset Big Data is just better.

Rajesh Vijayaraghavan

Posted by rajesh |


 © C-rappy Cacophony 2005 - Powered by Blogger Templates for Blogger