Next Gen Real-time Streaming with Storm-Kafka Integration

Next Gen Real-time Streaming with Storm-Kafka Integration

At Infochimps, we are committed to embracing cutting edge technology, while ensuring that the latest Big Data innovations are enterprise-ready. Today, we are proud to deliver on that promise once again by announcing the integration of Storm and Kafka into the Cloud::Streams component of the Infochimps Cloud.

StormKafka 1024x578 Next Gen Real time Streaming with Storm Kafka Integration

Cloud::Streams provides solutions for challenges involving:

  • Large-scale data collection – clickstream web data, social media and online monitoring, financial market data, machine-to-machine data, sensors, business transactions, listening to or polling application APIs and databases, etc.
  • Real-time stream processing – real-time alerting, tagging and filtering, real-time applications, fast analytical processing like fraud detection or sentiment analysis, data cleansing and transformation, real-time queries, distribution to multiple clients, etc.
  • Analytics system ETL – providing normalized/de-normalized data using customer-defined business logic for various analytics data stores and file systems including Hadoop HDFS, HBase, Elasticsearch, Cassandra, MongoDB, PostgreSQL, MySQL, etc.

Storm and Kafka

Recently in my guest blog post on TechCrunch, I mentioned why you should care about Storm and Kafka.

“With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.”

Ultimately, Storm and Kafka form the best enterprise-grade real-time ETL and streaming analytics solution on the market today. Our goal is to put the same technology that Twitter uses to process over 400 million tweets per day — in your hands. Other companies that have adopted Storm in production include Groupon, Alibaba, The Weather Channel, FullContact, and many others.

Nathan Marz, Storm creator and senior Twitter engineer, comments on Storm’s rapid growth:

“Storm has gained an enormous amount of traction in the past year due to its simplicity, robustness, and high performance. Storm’s tight integration with the queuing and database technologies that companies already use have made it easy to adopt for their stream computing needs.”

Storm solves a broad set of use cases, including “processing messages and updating databases (stream processing), doing a continuous query on data streams and streaming the results into clients (continuous computation), parallelizing an intense query like a search query on the fly (distributed RPC), and more.”

Apache Kafka, which was developed by LinkedIn to power its activity streams, provides an additional reliability guarantee, robust message queueing, and distributed publish-subscribe capabilities.


Cloud::Streams is fault-tolerant and linearly scalable, and performs enterprise data collection, transport, and complex in-stream processing. In much the same way that Hadoop provides batch ETL and large-scale batch analytical processing, Cloud::Streams provides real-time ETL and large-scale real-time analytical processing — the perfect complement to Hadoop (or in some cases, what you needed instead of Hadoop).

Cloud::Streams adds important enterprise-class enhancements to Storm and Kafka, including:

  • Integration Connectors to your existing tech environment for collecting required data from a huge variety of data sources in a way that is robust yet as non-invasive as possible
  • Optimizations for highly scalable, reliable data import and distributed ETL (extract, transform, load), fulfilling data transport needs
  • Developer Toolkit for rapid development of decorators, which perform the real-time stream processing
  • Guaranteed delivery framework and data failover snapshots to send processed data to analytics systems, databases, file systems, and applications with extreme reliability
  • Rapid solution development and deployment, along with our expert Big Data methodology and best practices

Infochimps has extensive experience implementing Cloud::Streams, both for clients and for our internal data flows including large-scale clickstream web data flows, massive Twitter scrapes, the Foursquare firehose, customer purchase data, product pricing data, and more.

Obviously, data failover and optimizations are key to enterprise readiness. Above and beyond that though, Cloud::Streams is a joy to work with because of its flexible Integration Connectors and the Developer Toolkit. No matter where your data is, you can access and ingest it with a variety of input methods. No matter what kind of work you need to perform (parse, transform, augment, split, fork, merge, analyze/process, …) you can quickly develop that processor unit, test it, and deploy it as a Cloud::Streams decorator.

One of our most recent customers was able to build an entire production application flow for large-scale social media data analysis using the Infochimps Cloud development framework in just 30 days with only 3 developers. That is both unheard of from an enterprise timeline perspective, as well as an amazing case of business ROI. Big Data is too important to spend months and months developing. Your business needs results now, and the Infochimps Cloud leverages the talent you have today for fast project success.

How much is it worth to you to launch your own revenue generating applications for your customers? Or for your internal stakeholders as part of a Big Data business intelligence initiative? How much value would launching 12 months sooner provide your organization? These are questions which we’re trying to make the answer to obvious.

Steve Blackmon, Director of Data Sciences at W2O Group, explains why they are working with Infochimps and Cloud::Streams:

“Storm and Kafka are excellent platforms for scalable real-time data processing. We are very pleased that Infochimps has embraced Storm and Kafka for Cloud::Streams. This new offering gives us the opportunity to supplement our listening and analytics products with Infochimps’ data sources, to integrate capabilities seamlessly with our partners who also use Storm, and to retain Infochimps’ unique technical team to support and optimize our data pipelines.”

More Information

Check out the full press release here, including quotes from CEO Jim Kaskade and co-founder and CTO Flip Kromer.

You can access additional resources from the Cloud::Streams web page or our general resources directory.

Lastly, check out our previous product announcements! In February, we launched the Infochimps Platform. In April we launched Dashpot as well as our support of OpenStack. In August, we announced the Platform’s newest release.


已标记关键词 清除标记
限时福利1:原价 129 元,最后2天仅需 69 元!后天涨价至98元 限时福利2:购课进答疑群专享柳峰(刘运强)老师答疑服务 限时福利3:购课添加助教领取价值 800 元的编程大礼包 为什么需要掌握高性能的MySQL实战? 由于互联网产品用户量大、高并发请求场景多,因此对MySQL的性能、可用性、扩展性都提出了很高的要求。使用MySQL解决大量数据以及高并发请求已经是程序员的必备技能,也是衡量一个程序员能力和薪资的标准之一。 为了让大家快速系统了解高性能MySQL核心知识全貌,我为你总结了「高性能 MySQL 知识框架图」,帮你梳理学习重点,建议收藏! 【课程设计】 课程分为四大篇章,将为你建立完整的 MySQL 知识体系,同时将重点讲解 MySQL 底层运行原理、数据库的性能调优、高并发、海量业务处理、面试解析等。 一、性能优化篇: 主要包括经典 MySQL 问题剖析、索引底层原理和事务与锁机制。通过深入理解 MySQL 的索引结构 B+Tree ,学员能够从根本上弄懂为什么有些 SQL 走索引、有些不走索引,从而彻底掌握索引的使用和优化技巧,能够避开很多实战中遇到的“坑”。 二、MySQL 8.0新特性篇: 主要包括窗口函数和通用表表达式。企业中的许多报表统计需求,如果不采用窗口函数,用普通的 SQL 语句是很难实现的。 三、高性能架构篇: 主要包括主从复制和读写分离。在企业的生产环境中,很少采用单台MySQL节点的情况,因为一旦单个节点发生故障,整个系统都不可用,后果往往不堪设想,因此掌握高可用架构的实现是非常有必要的。 四、面试篇: 程序员获得工作的第一步,就是高效的准备面试,面试篇主要从知识点回顾总结的角度出发,结合程序员面试高频MySQL问题精讲精练,帮助程序员吊打面试官,获得心仪的工作机会。
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页