qy千赢|提款

<acronym id="60qqi"></acronym>
Showing posts with label BI. Show all posts
Showing posts with label BI. Show all posts

Wednesday, July 31, 2013

Chasing That Killer Application Of Big Data

I often get asked, "what is the killer application of Big Data?" Unfortunately, the answer is not that simple.

In the early days of enterprise software, it was the automation that fueled the growth of enterprise applications. The vendors that eventually managed to stay in business and got bigger were/are the ones that expanded their footprint to automate more business processes in more industries. The idea behind the killerness of some of these applications was merely the existence and some what maturity of business processes in alternate forms. The organizations did have financials and supply chain but those processes were paper-based or part-realized in a set of tools that didn't scale. The objective was to replace these homegrown non-scalable processes and tools and provide standardized package software that would automate the processes after customizing it to the needs of an organization. Some vendors did work hard to understand what problems they were set out to solve, but most didn't; they poured concrete into existing processes.

Traditional Business Intelligence (BI) market grew the same way; the customers were looking for a specific set of reporting problems to be solved to run their business. The enterprise applications that automated the business processes were not powerful enough to deliver the kind of reporting that organizations expected to gain insights into their operations and make decisions. These applications were designed to automate the processes and not to provide insights. The BI vendors created packaged tools and technology solutions to address this market. Once again, the vendors didn't have to think about what application problems the organizations were trying to solve.

Now with the rise of Big Data, the same vendors, and some new vendors, are asking that same question: what's the killer application? If Big Data turns out to be as big of a wave as the Internet or cloud we are certainly in a very early stage. This wave is very different than the previous ones in a few ways; it is technology-led innovation which is opening up new ways of running business. We are at an inflection point of cheap commodity hardware and MPP software that is designed from ground up to treat data as the first class citizen. This is not about automation or filling a known gap. I live this life working with IT and business leaders of small and large organizations worldwide where they are struggling to figure out how best they can leverage Big Data. These organizations know there's something in for them in this trend but they can't quite put a finger on it.

As a vendor, the best way to look at your strategy is to help customers with their Big Data efforts without chasing a killer application. The killer applications will be emergent when you pay attention and observe patterns across your customers. Make Big Data tangible for your customers and design tools that would take your customers away from complexity of a technology layer. The organizations continue to have massive challenges with semantics as well as the location and format of their data sources. This is not an exciting domain for many vendors but help these organizations bring their data together. And, most importantly, try hard to become a trusted advisor and a go-to vendor for Big Data regardless of your portfolio of products and solutions. Waiting for a killer application to get started or marketing your product as THE killer application of Big Data are perhaps not the smartest things to do right now.

Big Data is a nascent category; an explosive, promising, but a nascent category. The organizations are still trying to get a handle on what it means to them. The maturity of business processes and well-defined unsolved problems in this domain are not that clear. While this category plays out on its own don't chase those killer applications or place your bets on one or two killer applications. Just get started and help your customers. I promise you shall stumble upon that killer application during your journey.

About the picture: I took this picture inside a historic fort in Jaisalmer, India that has rich history. History has taught me a lot about all things enterprise software as well as non-enterprise-software.

Wednesday, April 18, 2012

4 Big Data Myths - Part II



This is the second and the last part of this two-post series blog post on Big Data myths. If you haven't read the first part, check it out here.

Myth # 2: Big Data is an old wine in new bottle

I hear people say, "Oh, that Big Data, we used to call it BI." One of the main challenges with legacy BI has been that you pretty much have to know what you're looking for based on a limited set of data sources that are available to you. The so called "intelligence" is people going around gathering, cleaning, staging, and analyzing data to create pre-canned "reports and dashboards" to answer a few very specific narrow questions. By the time the question is answered its value has been diluted. These restrictions manifested from the fact that the computational power was still scarce and the industry lacked sophisticated frameworks and algorithms to actually make sense out of data. Traditional BI introduced redundancies at many levels such as staging, cubes etc. This in turn reduced the the actual data size available to analyze. On top of that there were no self-service tools to do anything meaningful with this data. IT has always been a gatekeeper and they were always resource-constrained. A lot of you can relate to this. If you asked the IT to analyze traditional clickstream data you became a laughing stroke.

What is different about Big Data is not only that there's no real need to throw away any kind of data, but the "enterprise data", which always got a VIP treatment in the old BI world while everyone else waited, has lost that elite status. In the world of Big Data, you don't know which data is valuable and which data is not until you actually look at it and do something about it. Every few years the industry reaches some sort of an inflection point. In this case, the inflection point is the combination of cheap computing — cloud as well as on-premise appliances — and emergence of several open computing data-centric software frameworks that can leverage this cheap computing.

Traditional BI is a symptom of all the hardware restrictions and legacy architecture unable to use relatively newer data frameworks such as Hadoop and plenty of others in the current landscape. Unfortunately, retrofitting existing technology stack may not be that easy if an organization truly wants to reap the benefits of Big Data. In many cases, buying some disruptive technology is nothing more than a line item in many CIOs' wish-list. I would urge them to think differently. This is not BI 2.0. This is not a BI at all as you have known it.


Myth # 1: Data scientist is a glorified data analyst

The role of a data scientist has exponentially grown in its popularity. Recently, DJ Patil, a data scientist in-residence at Greylock, was featured on Generation Flux by Fast Company. He is the kind of a guy you want on your team. I know of a quite a few companies that are unable to hire good data scientists despite of their willingness to offer above-market compensation. This is also a controversial role where people argue that a data scientist is just a glorified data analyst. This is not true. Data scientist is the human side of Big Data and it's real.

If you closely examine the skill set of people in the traditional BI ecosystem you'll recognize that they fall into two main categories: database experts and reporting experts. Either people specialize in complicated ETL processes, database schemas, vendor-specific data warehousing tools, SQL etc. or people specialize in reporting tools, working with the "business" and delivering dashboards, reports etc. This is a broad generalization, but you get the point. There are two challenges with this set-up: a) the people are hired based on vendor-specific skills such as database, reporting tools etc. b) they have a shallow mandate of getting things done with the restrictions that typically lead to silos and lack of a bigger picture.

The role of a data scientist is not to replace any existing BI people but to complement them. You could expect the data scientists to have the following skills:

  • Deep understanding of data and data sources to explore and discover the patterns at which data is being generated. 
  • Theoretical as well practical (tool) level understanding of advanced statistical algorithms and machine learning.
  • Strategically connected with the business at all the levels to understand broader as well deeper business challenges and being able to translate them into designing experiments with data.  
  • Design and instrument the environment and applications to generate and gather new data and establish an enterprise-wide data strategy since one of the promises of Big Data is to leave no data behind and not to have any silos.

I have seen some enterprises that have a few people with some of these skills but they are scattered around the company and typically lack high level visibility and an executive buy-in.

Whether data scientists should be domain experts or not is still being debated. I would strongly argue that the primary skill to look for while hiring a data scientist should be how they deal with data with great curiosity and asking a lot of whys and not what kind of data they are dealing with. In my opinion if you ask a domain expert to be a data expert, preconceived biases and assumptions — knowledge curse —  would hinder the discovery. Being naive and curious about a specific domain actually works better since they have no pre-conceived biases and they are open to look for insights in unusual places. Also, when they look at data in different domains it actually helps them to connect the dots and apply the insights gained in one domain to solve problems in a different domain.

No company would ever confess that their decisions are not based on hard facts derived from extensive data analysis and discovery. But, as I have often seen, most companies don't even know that many of their decisions could prove to be completely wrong had they have access to right data and insights. It's scary, but that's the truth. You don't know what you don't know. BI never had one human face that we all could point to. Now, in the new world of Big Data, we can. And it's called a data scientist.

Photo courtesy: Flickr

Friday, March 30, 2012

4 Big Data Myths - Part I



It was cloud then and it's Big Data now. Every time there's a new disruptive category it creates a lot of confusion. These categories are not well-defined. They just catch on. What hurts the most is the myths. This is the first part of my two-part series to debunk Big Data myths.

Myth # 4: Big Data is about big data

It's a clear misnomer. "Big Data" is a name that sticks but it's not just about big data. Defining a category just based on size of data appears to be quite primitive and rather silly. And, you could argue all day about what size of data qualifies as "big." But, the name sticks, and that counts. The insights could come from a very small dataset or a very large data set. Big Data is finally a promise not to discriminate any data, small or large.

It has been prohibitively expensive and almost technologically impossible to analyze large volumes of data. Not any more. Today, technology — commodity hardware and sophisticated software to leverage this hardware — changes the way people think about small and large data. It's a data continuum. Big Data is not just about technology, either. Technology is just an enabler. It has always been. If you think Big Data is about adopting new shiny technology, that's very limiting. Big Data is an amalgamation of a few trends - data growth of a magnitude or two, external data more valuable than internal data, and shift in computing business models. The companies mainly looked at their operational data, invested into expensive BI solutions, and treated those systems as gold. Very few in a company got very little value out of those systems.

Big Data is about redefining what data actually means to you. Examine the sources that you never cared to look at before, instrument your systems to generate the kind of data that are valuable to you and not to your software vendor. This is not about technology. This is about completely new way of doing business where data finally gets the driver's seat. The conversations about organizations' brands and their competitors' brands are happening in social media that they neither control nor have a good grasp of. At Uber, Bradly Voytek, a neuroscientist is looking at interesting ways to analyze real-time data to improve the way Uber does business. Recently, Target came under fire for using data to predict future needs of a shopper. Opportunities are in abundance.

Myth # 3: Big Data is for expert users    

The last mile of Big Data is the tools. As technology evolves the tools that allow people to interact with data have significantly improved, as well. Without these tools the data is worth nothing. The tools have evolved in all categories ranging from simple presentation charting frameworks to complex tools used for deep analysis. With rising popularity and adoption of HTML 5 and people's desire to consume data on tablets, the investment in presentation side of the tools have gone up. Popular javascript frameworks such as D3 have allowed people to do interesting things such as creating a personal annual report. Availability of a various datasets published by several public sector agencies in the US have also spurred some creative analysis by data geeks such as this interactive report that tracks money as people move to different parts of the country.

The other exciting trend has been the self-service reporting in the cloud and better abstraction tools on top of complex frameworks such as Hadoop. Without self-service tools most people will likely be cut off from the data chain even if they have access to data they want to analyze. I cannot overemphasize how important the tools are in the Big Data value chain. They make it an inclusive system where more people can participate in data discovery, exploration, and analysis. Unusual insights rarely come from experts; they invariably come from people who were always fascinated by data but analyzing data was never part of their day-to-day job. Big Data is about enabling these people to participate - all information accessible to all people.

Coming soon in the Part II: Myth # 2 and Myth # 1.

Tuesday, November 9, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 2

Check out the Part 1 if you haven’t already read it to better understand the context and my disclaimer. This is the Part 2 covering the assertions from 6 to 10.

Assertion 6: Appliances should be "software only."

“In my 40 years of experience as a computer science professional in the DBMS field, I have yet to see a specialized hardware architecture—a so-called database machine—that wins.”

This is a black swan effect; just because someone hasn’t seen an event occur in his or her lifetime, it doesn’t mean that it won’t happen. This statement could also be re-written as “In my 40 years of experience, I have yet to see a social network that is used by 500 million people.” You get the point. I am the first one who would vote in favor of commodity hardware against a specialized hardware, but there are very specific reasons why the specialized hardware makes sense in some cases.

“In other words, one can buy general purpose CPU cycles from the major chip vendors or specialized CPU cycles from a database machine vendor.”

Specialized machines don’t necessarily mean specialized CPU cycles. I hope the word “CPU cycle” is used as metaphor and not to indicate its literal meaning.

“Since the volume of the general purpose vendors are 10,000 or 100,000 times the volume of the specialized vendors, their prices are an order of magnitude under those of the specialized vendor.”

This isn’t true. The vendors who make general-purpose hardware also make specialized hardware, and no, it’s not an order of magnitude expensive.

“To be a price- performance winner, the specialized vendor must be at least a factor of 20-30 faster.”

It’s a wrong assumption that BI vendors use specialized hardware just because of the performance reasons. The “specialized” in many cases for an appliance is simply a specialized configuration. The appliance vendors also leverage their relationship with the hardware vendors to fine tune the configuration based on their requirements, negotiate a hefty discount, and execute a joint go-to-market strategy.

The enterprise software follows value-based pricing and not cost-based pricing. The price difference between a commodity and a specialized appliance is not just the difference of the cost of hardware that it runs on.

“However, every decade several vendors try (and fail).”

Not sure what is the success criteria behind this assertion to declare someone a winner or a failure. Acquisitions of Netezza, Greenplum, and Kickfire are recent examples of how well the appliance companies have performed. The incumbent appliance vendors are doing great, too.

“Put differently, I think database appliances are a packaging exercise”

The appliances are far more than a packaging exercise. Other than making sure that the software appliance works on the selected hardware, commoditized or otherwise, they provide a black box lifecycle management approach to the customers. The upfront cost of an appliance is a small fraction of the overall money that the customers would end up spending during the entire lifecycle of an appliance and the related BI efforts. The customers do welcome an approach where they are responsible for managing one appliance against five different systems at ten different levels with fifteen different technology stack versions.

Assertion 7: Hybrid workloads are not optimized by "one-size fits all."

Yes, I agree, but that’s not the point. It’s difficult to optimize hybrid workloads for a row or a column store, but it is not as difficult, if it’s a hybrid store.

“Put differently, two specialized systems can each be a factor of 50 faster than the single "one size fits all" system in solution 1.”

Once again, I agree, but it does not apply to all the situations. As I discussed earlier, the performance is not the only criteria that matters in the BI world. In fact, I would argue the opposite. Just because the OLTP and OLAP systems are orthogonal, the vendors compromised everything else to gain the performance. Now that’s changing. Let’s take an example of an operational report. This is the kind of report that only has the value if consumed in realtime. For such reports, the users can’t wait until the data is extracted out of the OLTP system, cleaned up, and transferred into the OLAP system. Yes, it could be 50 times faster, but completely useless, since you missed the boat.

The hybrid systems, the once that combine OLTP and OLAP, are fairly new, but they promise to solve a very specific problem, which is real real-time. While the hybrid systems evolve, the computational capabilities of OLTP and OLAP systems have started to change as well. I now see OLAP systems supporting write-backs with a reasonable throughput and OLTP systems with good BI style query performance, all of these achieved through modern hardware and clever use of architectural components.

Let’s not forget what optimization really is. It means desired functionality at reasonable performance. A real-time report, that takes 10 seconds to run could be far more valuable than a report that runs under ten milliseconds, three days later.

“A factor of 50 is nothing to sneeze at.”

Yes, point taken. :-)

Assertion 8: Essentially all data warehouse installations want high availability (HA).

No, they don’t. This is like saying all the customers want five 9 SLA on the cloud. I don’t underestimate the business criticality of a DW if it goes down, but not all the DW are being used 24x7 and are mission critical. One size doesn’t fit all. And, if your DW is not required to be highly available, you need to ask yourself, whether it is fair for you to pay for the HA architectural cost, if you don’t want it. Tiered SLAs are not new, and tiered HA is not a terrible idea.

Let’s talk about the DWs that do require to be highly available.

“Moreover, there is no reason to write a DBMS log if this is going to be the recovery tactic. As such, a source of run-time overhead can be avoided.”

I am a little confused how this is worded. Which logs are we referring to - the source systems or the target systems? The source systems are beyond the control of a BI vendor. There are newer approaches to design an OLTP system without a log, but that’s not up for discussion for this assertion. If the assertion is referring to the logs of the target system, how does that become a run-time overhead? Traditional DW systems are a read-only system at runtime. They don’t write logs back to the system. If he is referring to the logs while the data is being moved to DW, that’s not really run-time, unless we are referring to it as a hot-transfer.

There is one more approach, NoSQL, where eventual consistency is achieved over a period of time and the concept of a “corrupted system” is going away. Incomplete data is an expected behavior and people should plan for it. That’s the norm, regardless of a system being HA or not. Recently Netflix moved some of its applications to the cloud, where they have designed a background data fixer to deal with data inconsistencies.

HA is not black and white, and there are way more approaches, beyond the logs, to accomplish to achieve desired outcome.

Assertion 9: DBMSs should support online reprovisioning.

“Hardly anybody wants to take the required amount of down time to dump and reload the DBMS. Likewise, it is a DBA hassle to do so. A much better solution is for the DBMS to support reprovisioning, without going offline. Few systems have this capability today, but vendors should be encouraged to move quickly to provide this feature.”

I agree. I would add one thing. The vendors, even today, have a trouble supporting offline provisioning to cater to the increasing load. On-line reprovisioning is not trivial, since in many cases, it requires to re-architect their systems. The vendors typically get away with this, since the most customers don’t do capacity planning in real-time. Unfortunately, traditional BI systems are not commodity where the customers can plug-in more blades when they want and take them out when they don’t.

This is the fundamental premise behind why cloud makes it a great BI platform to address such re-provisioning issues with elastic computing. Read my post “The Future Of BI In The Cloud”, if you are inclined to understand how horizontal scale-out systems can help.

Assertion 10: Virtualization often has performance problems in a DBMS world.

This assertion, and the one before this, made me write the post “The Future Of BI In The Cloud”. I would not repeat what I wrote there, but I will quickly highlight what is relevant.

“Until better and cheaper networking makes remote I/O as fast as local I/O at a reasonable cost, one should be very careful about virtualizing DBMS software.”

Virtualizing I/O is not a solution for large DW with complex queries. However, as I wrote in the post, a good solution is not to make the remote I/O faster, but rather tap into the innovation of software-only SSD block I/O that are local.

“Of course, the benefits of a virtualized environment are not insignificant, and they may outweigh the performance hit. My only point is to note that virtualizing I/O is not cheap.”

This is what a disruption initially looks like. You start seeing good enough value in an approach, for certain types of solutions, that seems expensive for other set of solutions. Over a period of time, rapid innovation and economies of scale remove this price barrier. I think that’s where the virtualization stands, today. The organizations have started to use the cloud for IaaS and SaaS for a variety of solutions including good enough self-service BI and performance optimization solutions. I expect to see more and more innovation in this area where traditional large DW will be able to get enough value out of the cloud, even after paying the virtualization overhead.

Thursday, October 28, 2010

Challenging Stonebraker’s Assertions On Data Warehouses - Part 1

I have tremendous respect for Michael Stonebraker. He is an apt visionary. What I like the most about him is his drive and passion to commercialize the academic concepts. ACM recently published his article “My Top 10 Assertions About Data Warehouses." If you haven’t read it, I would encourage you to read it.

I agree with some of his assertions and disagree with a few. I am grounded in reality, but I do have a progressive viewpoint on this topic. This is my attempt to bring an alternate perspective to the rapidly changing BI world that I am seeing. I hope the readers take it as constructive criticism. This post has been sitting in my draft folder for a while. I finally managed to publish it. This is Part 1 covering the assertions 1 to 5. The Part 2 with the rest of the assertions will follow in a few days.

“Please note that I have a financial interest in several database companies, and may be biased in a number of different ways.”

I appreciate Stonebraker’s disclaimer. I do believe that his view is skewed to what he has seen and has invested into. I don’t believe there is anything wrong with it. I like when people put money where their mouth is.

As you might know, I work for SAP, but this is my independent blog and these are my views and not those of SAP’s. I also try hard not to have SAP product or strategy references on this blog to maintain my neutral perspective and avoid any possible conflict of interest.

Assertion 1: Star and snowflake schemas are a good idea in the data warehouse world.

This reads like an incomplete statement. The star and snowflake schemas are a good idea because they have been proven to perform well in the data warehouse world with row and column stores. However, there are emergent NoSQL based data warehouse architectures I have started to see that are far from a star or a snowflake. They are in fact schemaless.

“Star and Snowflake schemas are clean, simple, easy to parallelize, and usually result in very high-performance database management system (DBMS) applications.”

The following statement contradicts the statement above.

“However, you will often come up with a design having a large number of attributes in the fact table; 40 attributes are routine and 200 are not uncommon. Current data warehouse administrators usually stand on their heads to make "fat" fact tables perform on current relational database management systems (RDBMSs).”

There are a couple of problems with this assertion:
  1. The schema is not simple; 200 attributes, fact tables, and complex joins. What exactly is simple?
  2. Efficient parallelization of a query is based on many factors, beyond the schema. How the data is stored and partitioned, performance of a database engine, and hardware configuration are a few to name.
"If you are a data warehouse designer and come up with something other than a snowflake schema, you should probably rethink your design.”

Really?

The requirement, that the schema has to be perfect upfront, has introduced most of the problems in the BI world. I call it the design time latency. This is the time it takes after a business user decides what report/information to request and by the time she gets it (mostly the wrong one.) The problem is that you can only report based what you have in your DW and what’s tuned.

This is why the schemaless approach seems more promising as it can cut down the design time latency by allowing the business users to explore the data and run ad hoc queries without locking down on a specific structure.

Assertion 2: Column stores will dominate the data warehouse market over time, replacing row stores.

This assertion assumes that there are only two ways of organizing data, either in a row store or in a column store. This is not true. Look at my NoSQL explanation above and also in my post “The Future Of BI In The Cloud”, for an alternate storage approach.

This assertion also assumes that the access performance is tightly dependent on how the data is stored. While this is true in the most cases, many vendors are challenging this assumption by introducing an acceleration layer on top of the storage layer. This approach makes is feasible to achieve consistent query performance, by clever acceleration architecture, that acts as an access layer, and does not depend on how data is stored and organized.

“Since fact tables are getting fatter over time as business analysts want access to more and more information, this architectural difference will become increasingly significant. Even when "skinny" fact tables occur or where many attributes are read, a column store is still likely to be advantageous because of its superior compression ability."

I don’t agree with the solution that we should have fatter fact tables when business analysts want more information. Even if this is true, how will column store be advantageous when the data grows beyond the limit where compression isn’t that useful?

“For these reasons, over time, column stores will clearly win”

Even if it is only about rows versus columns, the column store may not be a clear commercial winner in the marketplace. Runtime performance is just one of many factors that the customers consider while investing in DW and business intelligence.

“Note that almost all traditional RDBMSs are row stores, including Oracle, SQLServer, Postgres, MySQL, and DB2.”

Exactly!

The row stores, with optimization and acceleration, have demonstrated reasonably good performance to stay competitive. Not that I favor one over the other, but not all row-based DW are that large or growing rapidly, and have serious performance issues, warranting a switch from a row to a column.

This leads me to my last issue with this assertion. What about a hybrid store – row and column? Many vendors are trying to figure this one out and if they are successful, this could change the BI outlook. I will wait and watch.

Assertion 3: The vast majority of data warehouses are not candidates for mainmemory or flash memory.

I am assuming that he is referring to the volatile flash memory and not flash memory as storage. Though, the SSD block storage have huge potential in the BI world.

“It will take a long time before main memory or flash memory becomes cheap enough to handle most warehouse problems.”

Not all DW are growing at the same speed. One size does not fit all. Even if I agree that the price won’t go down significantly, at the current price point, main memory and flash memory can speed up many DW without breaking the bank.

The cost of DW, and especially the cost of flash memory, is a small fraction of the overall cost; hardware, license, maintenance, and people. If the added cost of flash memory makes business more agile, reduces maintenance cost, and allows the companies to make faster decisions based on smarter insights, it’s worth it. The upfront capital cost is not the only deciding factor for BI systems.

“As such, non-disk technology should only be considered for temporary tables, very "hot" data elements, or very small data warehouses.”

This is easier said than done. The customers will spend significant more time and energy, on a complicated architecture, to isolate the hot elements and running them on a different software/hardware configuration.

Assertion 4: Massively parallel processor (MPP) systems will be omnipresent in this market.

Yes, MPP is the future. No disagreements. The assertion is not about on-premise or the cloud, but I truly believe that cloud is the future for MPP. There are other BI issues that need to be addressed before cloud makes it a good BI platform for a massive scale DW, but the cloud will beat any other platform when it comes to MPP with computational elasticity.

Assertion 5: "No knobs" is the only thing that makes any sense.

“In other words, look for "no knobs" as the only way to cut down DBA costs.”

I agree that “no knobs” is what the customers should thrive for to simplify and streamline their DW administration, but I don’t expect these knobs to significantly drive down the overall operational cost, or even the cost just associated with the DBAs. Not all the DBAs have a full time job to manage and tune the DW. The DW deployments go through a cycle where the tasks include schema design, requirements gathering, ETL design etc. Tuning or using the “knobs” is just one of many tasks that the DBAs perform. I absolutely agree that the no-knobs would certainly take some burden off the shoulders of a DBA, but I disagree that it would result into significant DBA cost-savings.

For a fairly large deployment, there is significant cost associated with the number of IT layers
that are responsible to channel the reports to the business users. There is an opportunity to invest into the right kind of architecture, technology-stack for the DW, and the tools on top of that to help increase the ratio of Business users to the BI IT. This should also help speed up the decision-making process based on the insights gained from the data. Isn’t that the purpose to have a DW to begin with? I see the self-service BI as the only way to make IT scale. Instead of cutting the DBA cost, I would rather focus on scaling the BI IT with the same budget and a broader coverage amongst the business users in an organization.

Monday, October 25, 2010

The Future Of BI In The Cloud



Actual numbers vary based on whom you ask, but the general consensus is that the Business Intelligence (BI) and Analytics in the cloud is a fast growing market. IDC expects a compounded annual growth rate (CAGR) of 22.4% through 2013. This growth is primarily driven by two kinds of SaaS applications. The first kind is a purpose-specific analytics-driven application for business processes such as financial planning, cost optimization, inventory analysis etc. The second kind is a self-service horizontal analytics application/tool that allows the customers and ISVs to analyze data and create, embed, and share analysis and visualizations.

The category that is still nascent and would require significant work is the traditional general-purpose BI on large data warehouses (DW) in the cloud. For the most enterprises, not only all the DW are on-premise, but the majority of the business systems that feed data into these DW are on-premise as well. If these enterprises were to adopt BI in the cloud, it would mean moving all the data, warehouses, and the associated processes such as ETL in the cloud. But then, the biggest opportunities to innovate in the cloud exist to innovate the outside of it. I see significant potential to build black-box appliance style systems that sit on-premise and encapsulate the on-premise complexity – ETL, lifecycle management, and integration - in moving the data to the cloud.

Assuming that the enterprises succeed in moving data to the cloud, I see a couple of challenges, if treated as opportunities, will spur the most BI innovation in the cloud.

Traditional OLAP data warehouses don’t translate well into the cloud:

The majority of on-premise data warehouses run on some flavor of a relational or a columnar database. The most BI tools use SQL to access data from these DW. These databases are not inherently designed to run natively on the cloud. On top of that, the optimizations performed on these DW such as sharding, indices, compression etc. don’t translate well into the cloud either since cloud is a horizontally elastic scale-out platform and not a vertically integrated, scale-up, system.

The organizations are rethinking their persistence as well as access languages and algorithms options, while moving their data to the cloud. Recently, Netflix started moving their systems into the cloud. It’s not a BI system, but it has the similar characteristics such as high volume of read-only data, a few index-based look-ups etc. The new system uses S3 and SimpleDB instead of Oracle (on-premise). During this transition, Netflix picked availability over consistency. Eventual consistency is certainly an option that BI vendors should consider in the cloud. I have also started seeing DW in the cloud that uses HDFS, Dynamo, and Cassandra. Not all the relational and columnar DW systems will translate well into NoSQL, but I cannot overemphasize the importance of re-evaluating persistence store and access options when you decide to move your data into the cloud.

Hive, a DW infrastructure built on top of Hadoop, is a MapReduce meet SQL approach. Facebook has a 15 petabytes of data in their DW running Hive to support their BI needs. There are a very few companies that would require such a scale, but the best thing about this approach is that you can grow linearly, technologically as well as economically.

The cloud does not make it a good platform for I/O intensive applications such as BI:

One of the major issues with the large data warehouses is, well, the data itself. Any kind of complex query typically involves an intensive I/O computation. But, the I/O virtualization on the cloud, simply does not work for large data sets. The remote I/O, due to its latency, is not a viable option. The block I/O is a popular approach for I/O intensive applications. Amazon EC2 does have block I/O for each instance, but it obviously can’t hold all the data and it’s still a disk-based approach.

For BI in the cloud to be successful, what we really need is ability for scale-out block I/O, just like scale-out computing. Good news is that there is at least one company, Solidfire, that I know, working on it. I met Dave, the founder, at the Structure conference reception. He explained to me what he is up to. Solidfire has a software solution that uses solid state drives (SSD) as scale-out block I/O. I see huge potential in how this can be used for BI applications.

When you put all the pieces together, it makes sense. The data is distributed across the cloud on a number of SSDs that is available to the processors as block I/O. You run some flavor of NoSQL to store and access this data that leverages modern algorithms and more importantly horizontally elastic cloud platform. What you get is commodity and blazingly fast BI at a fraction of cost with pay-as-you-go subscription model.
Now, that’s what I call the future of BI in the cloud.
<acronym id="60qqi"></acronym>

泊利彩票官方网站

用户登录乐购100彩票

ag88环亚平台

博体育

霁齐彩票注册

优乐平台

ld乐动|体育|网址

电竞比赛押注app

牛彩彩票首页