Tuesday, November 11, 2014

Big Data Characterized

A recent survey by Evans Data Corp. of Santa Cruz, California found that more than one-third of the people surveyed said the overall size of their organization’s data stores would grow by more than 75% over the next year. Nearly 15% indicated that storage requirements would more than double. 71% of respondents specified that they require advanced processing more than half the time. The findings are in line of the new trend of real-time event processing for Big Data. Hence I complied the secondary Vs to further characterize Big Data:

  • Viscosity: Viscosity is a measure of the resistance encountered in the mass data in the flow. The resistance is from the friction from integration flow rates, different sources of data origination, and transformation required to convert the data into information.  Efficient messaging systems like Kafka provides stronger ordering guarantees in the persistent high-throughput message queue. Streaming technologies like Storm can enable distributed continuous processing of incoming data in real time. Sophisticated CEP engines further strengthen the rule-based event-driven processability of Big Data in support of things like PMML.
  • Virality: Virality is the ability of data to be distributed over networks, measuring the speed of dispersion across peer-to-peer networks. Time and number of crosslinks are vital factors that determine the spreading rate. CDN is a type of large distributed systems to server contents to end users in high performance and availability. P2P-assisted streaming technologies are leveraged in online video by vendors like Netflix. 
  • Vigilance: Project teams need to be watchful for the traps and pitfalls in the Big Data implementations. Not just a handful of organizations have deployed Hadoop extensively in an attempt to process data in real time, not realizing that Hadoop was constructed for batch processing by design. Users must be careful to deal with the data in motion and data at rest. For example, one can leverage Lambda architecture to make full use of both batch- and stream-processing methods for massive quantity of data. Hybrid use of SQL and NoSQL is also advantageous, but be alert of difficulties in operation.

For more information, please contact Tony Shan (blog@tonyshan.com) or leave your comments below. 
 ©Tony Shan. All rights reserved. All standard disclaimers apply here.

No comments:

Post a Comment