Responsive and Adaptive Computing Ecosystem (RACE): Hadoop is Really Dead: Technology Shift in Big Data

Hadoop has been the cornerstone of the Big Data movement. The Hadoop ecosystem has kept growing to stimulate the advanced data analytics. The global Hadoop market was $26.47 billion in 2019, according to the Allied Market Research report. It is forecast to grow to $340.35 billion by 2027, at a CAGR of 37.5%.

In essence, Hadoop is an open-source framework to process huge loads of data, composed of several components: Common, Distributed File System, MapReduce algorithm, Yarn, Hive, HBase, Pig, Ozzie, Sqoop, Flume, Mahout, Atlas, Ambari, Ranger, etc. The key benefits of Hadoop include cost-effective, commodity server, data reliability, scalable computing, NoSQL storage, and simple. Hadoop is a pivotal technology that broke the stranglehold of relational databases, and it forced a paradigm shift in data warehouse and lake in 2010s.

Let’s take a look at how Hadoop evolved from the start. In the very beginning, Google first published a paper on GFS in 2003, followed by another paper on Map Reduce in the following year. In 2005, the Nutch project developers wrote an open-source implementation of GFS and Map Reduce. In 2007, Doug Cutting split out the distributed computing parts from Nutch to create Hadoop. In 2008, Yahoo released Hadoop to open source. In 2011, Hadoop version 1.0 was launched. In 2012, Hadoop version 2.0 in YARN went live. Hadoop 3.0 was rolled out in 2017. The latest version in 2021 is 3.3.1. Hadoop has been widely used for parallel data processing and data lake storage in various industry sectors. Enterprise adoption of Hadoop became mandatory, as pointed out by a principal analyst at Forrester in 2014. More than half of the Fortune 50 companies use Hadoop. Yahoo built a 4000-node cluster, and Quantcast assembled a cluster with 3000 cores. Facebook developed a cluster of 2300 nodes with the storage capacity exceeding 100PB.

On the other hand, there are a few constraints in Hadoop: batch-mode processing, performance limitation, file-based storage, limited storage scalability, small file size, and execution overheads. In retrospect, I introduced Not only Map Reduce (NoMR) to supplement Map Reduce with real-time processing and analytics capabilities in 2012. I initiated Hadoopability as a measure to gauge the aptness of Hadoop in a business context in 2012, because it was difficult to assess the Hadoop suitability for the early adopters more than 10 years ago. Hadoop is the bedrock of the enablement-dependent model in the Big Data Architecture Framework I defined in 2013. In the same year, I singled out the complexity and ambiguity in Hadoop in "Blind Men and Hadoop", which led to Hadooplicability I coined in 2014 as the measure of the Hadoop applicability in enterprise, as it was hard to formulate a solid business case with the Hadoop vagueness. Further, I called out Not only Hadoop (NoHadoop) in 2014 with an emphasis that Hadoop by itself is not enough for real-life Big Data implementations, resulting in a comprehensive open-source stack CHIRPS. Seeing the field changes and momentum shift first-hand in multiple real-world initiatives at PB scale, I also pinpointed “Big Data is really dead” in early 2015.

Needless to say, the adoption pace of Hadoop has declined steadily in recent years. Competitive technologies like Spark started supplementing Hadoop, and then gradually overtook it. New products addressed the gaps that Hadoop did not cover, such as object store. More innovative platforms provide better solutions, e.g., Flink.

Hadoop began falling. Hortonworks exited the market in 2018, merged with Cloudera. We saw MapR run out of cash, forced to close the door in the summer 2019 to be acquired by HPE. Even the dominant distributor Cloudera has been in red for net profit in the last 6 years. After years of sluggish revenues and losing money every quarter, Cloudera became a private firm in October 2021, in an attempt to find a new path.

Hadoop was the elephant in the room, but inevitably became outdated past its prime. Apache Software Foundation retired 10 Hadoop-related projects in 2021: Apex, Chukwa, Crunch, Eagle, Falcon, Hama, Lens, Sentry, Tajo, and Twill. This is another indicator of downfall.

Like it or not, the time has arrived for Hadoop to rest in peace. Cloud computing and cloud-native capabilities have been catalyzing the migration away from on-prem Hadoop clusters. Hadoop is on the path to be completely replaced in the near future. Nevertheless, that does not mean it will vanish quickly in the enterprise world. Just like the death declaration of Mainframe years ago, Hadoop will remain in sight or even revive for a while if renovated properly, while we are marching forward in the data fabric paradigm. Only time will tell.

It is imperative to have a solid Hadoop migrating-off strategy sooner than later, to keep you ahead of the curve. More insights and procedures are detailed in a comprehensive set of guides and workshops, covering the assessment, pathfinding, planning, enablement, bakeoff, investment, operationalization, integration, patterns, quality, validation, tools, governance, best practices, antipatterns, pitfalls, and so on.

For more information, please contact Tony Shan (blog@tonyshan.com) or leave your comments below.

Responsive and Adaptive Computing Ecosystem (RACE)

Saturday, January 1, 2022

Hadoop is Really Dead: Technology Shift in Big Data

No comments:

Post a Comment