news4geeks.net
19Jun/120

Hadoop becomes critical cog in the big data machine

Apache's Hadoop technologies are becoming critical in helping enterprises manage vast amounts of data, with users ranging from NASA to Twitter to Netflix increasing their reliance on the open source distributed computing platform.

Hadoop has gathered momentum as a mechanism for dealing with the concept of big data, in which enterprises seek to derive value from the rapidly growing amounts of data in their computer systems. Recognizing Hadoop's potential, users are both using the existing Hadoop platform technologies and developing their own technologies to complement the Hadoop stack.

Hadoop's corporate usage now and in the futureNASA expects Hadoop to handle large data loads in projects such as its Square Kilometer Array sky-imaging effort, which will churn out 700TBps when built in the next decade. The data systems will include Hadoop, as well as technologies such as Apache OODT (Object Oriented Data Technology), to cope with the massive data loads, says Chris Mattmann, a senior computer scientist at NASA.


Twitter is a big user of Hadoop. "All of the relevance products [offering personalized recommendations to users] have some interaction with Hadoop," says Oscar Boykin, a Twitter data scientist. The company has been using Hadoop for about four years and has even developed Scalding, a Scala library intended to make it easy to write Hadoop MapReduce jobs; it is built on top of the Cascading Java library, which is designed to abstract away Hadoop's complexity.

Hadoop subprojects include MapReduce, which is a software framework for processing large set sets on compute clusters; HDFS (Hadoop Distributed File System), which provides high-throughput access to application data; and Common, which offers utilities to support other Hadoop subprojects. Movie rental service Netflix has begun using Apache ZooKeeper, a Hadoop-related technology for configuration management. "We use it for all kinds of things: distributed locks, some queuing, and leader election" for prioritizing service activity, says Jordan Zimmerman, a senior platform engineer at Netflix. "We open-sourced a client for ZooKeeper that I wrote called Curator"; the client serves as a library for developers to connect to ZooKeeper.

The Tagged social network is using Hadoop technology for data analytics, processing about half a terabyte of new data daily, says Rich McKinley, Tagged's senior data engineer. Hadoop is being applied to on tasks beyond the capacity of its Greenplum database, which is still in use at Tagged: "We're looking toward doing more with Hadoop just for scale."

Although they laud Hadoop, users see issues that need fixing, such as deficiencies in reliability and job-tracking. Tagged's McKinley notes a problem with latency: "The time to get data in is quite quick and then, of course, I think everybody's big complaint is the high latency for doing your queries." Tagged has used Apache Hive, another Hadoop-derived project, for ad hoc queries. "That can take several minutes to get in a result that in Greenplum would return in a couple of seconds." Using Hadoop is cheaper than using Greenplum, though.

What's in store for Hadoop 2.0Hadoop 1.0 was released late in 2011, featuring strong authentication via Kerberos and support for the HBase database. The release also limits individual users from taking down clusters via constraints on MapReduce. But a new version is on the horizon: HortonWorks CTO Eric Baldeschwieler has provided a road map for Hadoop that includes the upcoming 2.0 release. (HortonWorks has been a contributor to Apache Hadoop.) Version 2.0, which went into an alpha release phase earlier this year, "has an end-to-end rewrite of the MapReduce layer and a pretty complete rewrite of all the storage logic and the HDFS layer as well," Baldeschwieler says.

Hadoop 2.0 focuses on scale and innovation, with Yarn (next-generation MapReduce) and federation capabilities. Yarn will let users add their own compute models so that they do not have to stick to MapReduce. "We're really looking forward to the community inventing many new ways of using Hadoop," Baldeschwieler says. Expected uses include real-time applications and machine-learning algorithms. Scalable, pluggable storage is planned also.

Always-on capabilities in Version 2.0 will enable clusters with no downtime. Scalable storage is planned as well. General availability of Hadoop 2.0 is expected within a year.

(Source: computerworld.com)

 

Got no idea what Hadoop is, but think you need it? You’re not alone
Hadoop is quickly becoming essential infrastructure for enterprises hoping to glean insights from the massive quantities of data they collect. The problem is that relatively few enterprises have ...
READ MORE
You are naked on the Internet
Unless you’re Ted Kaczynski circa 1985, living deep in the woods of Montana far from one of the roving homeless 4G connections we so conveniently enjoy here at South by ...
READ MORE
Many Web app frameworks are vulnerable to a denial-of-service attack targeting the way they handle hash tables, researchers revealed Wednesday, prompting Microsoft to announce an "out-of-band" patch for ...
READ MORE
ross-site scripting flaws plague Web apps, report says
Cross-site scripting flaws are the most prevalent vulnerabilities found in Web applications, posing a risk to data and intellectual property, according to a study of thousands of applications ...
READ MORE
A cottage industry is growing up around virtual padlocks that consumers can place on cloud services so that the vendors themselves can't get to the information -- even ...
READ MORE
Got no idea what Hadoop is, but think
You are naked on the Internet
Flaw in Web app frameworks pushes Microsoft to
ross-site scripting flaws plague Web apps, report says
How to keep the feds from snooping on

Comments (0) Trackbacks (0)

No comments yet.


Leave a comment

Trackbacks are disabled.