Why I picked Flexiant as my next challenge

Dear all, I am really happy and proud to announce that I am joining the Flexiant team starting this week. In the last few years, Flexiant has been building a stunning Cloud Management Platform with the goal of enabling service providers to join cloud space in few easy steps, and with the possibility to still highly differentiate their service.

The cloud infrastructure market landscape is but in its final configuration and I have the ambition to actively contribute to how it will look like in the next few years. I am joining Flexiant in a moment when the cloud industry is facing a terrific growth, with just a bunch of players out there, still immature technologies, vendors struggling to adapt their business model and a general misperception around cloud services. There is plenty of work to do!

But let me give you a little bit more of insights about why I have picked Flexiant and what great things I think we can do together.

A differentiated cloud service

I enjoyed observing the recent signs of a required differentiation in the cloud infrastructure market. After a large consensus around certain technologies, with such a big (and growing) market to conquer, competition is getting tougher, as more players try to come onboard everyday. Although price initially appears as the main competition driver, considering the impressive cloud services portfolio of Amazon Web Services, highly differentiated service offerings will be required for those who seriously aim at competing against the giant.

Why would I want to compete with that giant? Can it be enough for me to offer some complementary service in order to exploit the market reach of Amazon, instead of going against it? Well, we all know the consequences of a unique-player dominated market, we’ve seen it before (Microsoft, Oracle, etc.) and we all can concur that during those times innovation has been slower than ever, with the abuse of dominant positions that negatively affected the customer experience. The opportunity out there is big and I don’t think we want to leave the entire market to one player again, do we? And if the goals of the cloud is to commoditize technology by offering it as-a-service, it’s right there, on the service side, that there is need and opportunity to innovate.

Recently we have had a concrete proof of this need for differentiation. The acquisition of Enstratius by Dell was driven by the need for a highly differentiated cloud service that fills the gap between commodity infrastructures and enterprise requirements. I was lucky enough to have the opportunity to work with the Enstratius team and I can tell they were winning deals whenever it was about governance and compliance, all typical enterprise requirements. But the real news there was Dell dropping its previously announced OpenStack-powered cloud service, something that will never come to life instead. All those players betting on OpenStack wanted to make it the industry standard for building cloud infrastructure and now what? They suddenly remembered they have to compete with each other. And the imperative is: differentiate!

On this matter, our own Tony Lucas (@tonylucas), European pioneer of cloud services and SVP product at Flexiant (if you don’t believe check out this video of Tony talking about cloud with Jeff Barr of AWS back in 2007), has written an extensive White Paper where he scientifically goes through why cloud federation is not the optimal model for competing in the IaaS market, with differentiation as the winning alternative. Beside suggesting everyone in this industry to read it carefully, it reminded me of the biggest failure of cloud federation we have just recently witnessed: vCloud providers. The launch of VMware hybrid cloud service is the clear demonstration that federating providers with the same technology but different cultures, goals and SLAs, does not work. It can be a short term opportunity for the “federatable” cloud software vendor, but a secure failure on the mid-long term. Read Tony’s to understand exactly why.

A matching vision

For those who know me, I am a public cloud only believer. “Private cloud” was just a name given legacy vendors who didn’t want to give up on their on-premises business while having the opportunity to exploit the marketing hype and sell extra stuff to their rich customers. “Hybrid cloud” is how we are naming the period it takes to complete the journey to the public cloud.

Again, the most recent moves of the big guys confirm that public cloud is the way to go. Legacy software vendors are trying to convert themselves into service providers, mostly by acquiring companies rather than innovating from inside (e.g. yesterday’s news on IBM multi-billion acquisition of SoftLayer). So should we foresee a public cloud market dominated by AWS and challenged only by few other big whales? I don’t think so. If AWS really “gets” the cloud, the internal cultural conversion needed within traditional vendors will be painful and won’t really bring anything substantial at least for the next 3 to 5 years. Their current size and the internal resistance to give up on recurring revenue derived from on-premises business, will not let them be a real challenge to AWS in the near term. Instead, small, agile, highly innovative and differentiated niche players are those which will eventually contribute defining the next cloud infrastructure market landscape.

For more scientific evidence of why public clouds will take over the world, I can suggest another brilliant read by Alex Bligh (@alexbligh), the Internet rock star who has been behind Nominet (the UK domain registry) and currently CTO at Flexiant. His detailed methodical analysis led him to a conclusion:

And [so] will be for cloud computing: it’s not the technology that matters per se, it’s the consequent effect on economics. Private cloud is in essence an attempt to use cloud’s technology without gaining any of the efficiencies. It is for service providers to educate their customers and prospects, and the audience will often be financial or strategic as opposed to technical.

Alex Bligh, CTO at Flexiant

An enthusiastic choice

Visionaries like Tony and Alex, a mature product like Flexiant Cloud Orchestrator and the guidance and business savviness of our CEO George Knox (@GeorgeKnox) are all ingredients that will eventually lead to making some real difference in the coming months. Finding myself aligned to the company vision and culture, I am really enthusiastic to be on board and I foresee big things ahead of us. Stay tuned and ping me if you want to know more about Flexiant!

ABOUT FLEXIANT

Flexiant is a leading international provider of cloud orchestration software for on-demand, fully automated provisioning of cloud services. Headquartered in Europe, Flexiant’s cloud management software gives cloud service providers’ business agility, freedom and flexibility to scale, deploy and configure cloud servers, simply and cost-effectively. Vendor agnostic and supporting multiple hypervisors, Flexiant Cloud Orchestrator is a cloud management software suite that is service provider ready, enabling cloud service provisioning through to granular metering, billing and reseller whitelabel capabilities. Used by over one hundred organizations worldwide, from hosting providers, large MSPs and telcos, Flexiant Cloud Orchestrator is simple to understand, simple to deploy and simple to use. Flexiant was named a ‘Gartner Cool Vendor’ in Cloud Management, received the Info-Tech Research Group Trendsetter Award and called an industry double threat by 451 Group. Flexiant customers include ALVEA Services, FP7 Consortium, IS Group, ITEX, and NetGroup. Visit www.flexiant.com.

Virtualization no longer matters

There is no doubt. The product is there. The vision, too. At times, they leave some space to arrogance as well but, come on, they are the market leader, aware of being far ahead than anybody else in this field. A field they actually invented themselves. We almost feel like forgiving that arrogance. Don’t we.

The AWS summit 2013 in London has been just one more time the confirmation that the cloud infrastructure market is there, the potential is higher than ever and that Amazon “gets” it, drives it and dominates it quite undisturbed. All the others struggle to distinguish themselves among a huge amount of technology companies, old and new, who are strongly convinced of having jumped into the cloud business but, I’m pretty sure, the majority of their executives thinks that cloud is just the new name for hosting services.

Before going forward, I want to thank Garret Murphy (@garrettmurphy) for having transferred his AWS summit ticket to me, without even knowing who I was, but simply and kindly responding to my tweeted inquiry. I wish him and his Dublin-based startup 247tech.ie the required amount of luck that, coupled with great talent, leads to success.

Now, I won’t go through the whole event, because being this a roadshow which London wasn’t the first edition, much has been said already here and here. The general perception I had is that AWS is still focusing on presenting the advantages of cloud-based as opposed to on-premises IT infrastructures, showing off the rich toolset they have put in place and eventually bringing MANY (I counted nearly 20 ones) customers testifying how they are effectively using the AWS cloud and what advantages they got doing that. Ok, most of them were the usual hyper-scale Internet companies but I’ve seen the effort to bring enterprise testimonials like ATOC (The Association of Train Operating Companies of the UK). However, they all said to be using AWS only for web facing applications, staging environment or big data analytics. Usual stuff which we know to be cloud friendly.

What really impressed me was the OpsWorks demo. OpsWorks was released not long ago as the nth complementary Amazon Web Service to help operating resilient self-healing applications in the cloud. Aside from the confusion around what-to-use-when, given the large number of tools available (and without considering those from third parties which are growing uncontrolled day by day), there is one evident trend arising from that.

For those who don’t know OpsWorks, it is an API-driven layer built on top of Chef in order to automate the setup, deployment and un-deployment of application stacks. An attempt to the DevOps automation. How this is going to meet customers’ actual requirements while still keeping simplicity (a.k.a. without having to provide a too large number of options) is not clear yet.
During the session demonstrating OpsWorks, the AWS solution architect remarked that no custom AMIs (Amazon Machine Images) are available for selection while creating an application stacks. Someone in the audience immediately complained on Twitter about this, probably because he wasn’t happy about having to re-build all his customizations through Chef recipes on top of lightweight basic OS images, discarding them from his custom VM image.

In fact there are several advantages of moving the actual machine setup to the post-boostrap automation layer. For example, the ease of upgrading software versions (e.g. Apache, MySQL) simply by changing a line in a configuration file instead of having to rebuild the whole operating system image. But mostly because, keeping OS images adherent to the clean vendor releases, you probably will find them available in other cloud providers, making your application setup completely cross-cloud. Of course there are disadvantages too, including the delay added by operations like software download or configuration that may be necessary each time you decide to scale-up your application.

Cross-cloud application deployment. No vendor lock-in. Cool. There is actually a Spanish startup called Besol that is building its entire (amazing) product “Tapp into the Cloud” on the management of cross-cloud application stacks, leveraging a rich library of Chef cookbook templates. And while I was writing this post on a flight from London, Jason Hoffman (@jasonh) was being interviewed by GigaOM and, while announcing a better integration between Joyent and Chef, he mentioned the compatibility between cloud environments as a major advantage of using Chef.

What we’re observing is a major shift from leveraging operating system images towards the adoption of automation layers that can quickly prepare for you whatever application you want your virtual server to host. That means that one of the major advantages introduced by virtualization technology, that is the software manipulation of OS images, one of the triggers of the rise of cloud computing, no longer matters.

Potentially, with the adoption of automation platforms like Chef, Puppet or CFEngine, service providers could build a complete cloud infrastructure service, without employing any kind of hypervisor. And this trend is further confirmed by facts like:

Of course there are still advantages for using a hypervisor, because certain applications require architectures made of many micro-instances for performing parallel computing, thus it’s still necessary to slice a server into many small portions. However, with the silicon processors increasing the number of cores and the ability of using threads, virtualization may not be so important anymore for the cloud.

In the end, I think we no longer can say that virtualization is the foundation of cloud computing. The correct statement could perhaps be that virtualization inspired cloud computing. But the future may leave even a smaller space for that.

IaaS eats the biggest slice

When I read market research firms saying that SaaS is the most adopted cloud model by the enterprises I can’t but concur due to the ease of use and the simplicity of integration with existing IT assets. Actually, the integration ends up being minimal and entirely in developers’ hands, who can make use of the SaaS service usually comprehensive API, thus completely bypassing their internal IT department.

So what about IaaS and PaaS? Should those who invested heavily in those two cloud models start worrying about their choice? No way. As my provocative title says, I am fairly much convinced that the lower layers of the cloud stack eventually share the whole cloud business, with IaaS eating the biggest slice of it, both directly and indirectly.

I am actually writing this post to give further insight and supporting data to a tweet of mine I wrote some time ago:

Now let’s see what I mean by indirectly.

Layers over layers

In computer science we are used to have layers over layers called “abstraction layers“, each one of them aimed at hiding the complexity of the lower one, while providing some kind of added value and an interface for the immediately upper layer to access resources. With the rise of cloud services, the approach of the community has been the same again: using abstraction layers to handle the increased complexity of IT infrastructures, which now involve thousands of resources to be managed and orchestrated.

As mentioned above, there are three main cloud layers largely accepted by the community: IaaS, PaaS and SaaS. However, many cloud providers don’t fit exclusively in one of them as they tend to enlarge their offering with different services at multiple layers of the stack. Since this creates a little confusion among cloud consumers, I want to take the opportunity to present them one more time from a different perspective, trying to concentrate on what added value each layer brings to the stack.

Ok, I still have to work a bit on my ability to visually represent concepts but I hope the above chart can help making some clarification. First, we have raw resources at the bottom of the stack and if we add some elasticity we obtain an IaaS. This is over-simplified as there are certainly more values brought by any good IaaS layer, however, for the sake of understanding, I’ll limit myself to the most evident one: elasticity, a.k.a. the ability to create, destroy, enlarge and shrink computing resources on demand via an API.

Let’s now go upper, we have an IaaS layer and we decide to add some DevOps tools and operations such as middlewares, auto-scaling, application deployment and code validation mechanisms. While doing that, if the principle of abstraction layers is respected, we don’t need to care anymore about how to handle raw resources, since the IaaS provides us with tools to automate their management. What we obtain is a Platform-as-a-Service, an environment where multiple users can deploy their applications.

Eventually, let’s take some business logic to solve a specific problem (i.e. CRM, ERP, etc) and, provided of course that we have done all the multi-tentancy stuff and that we want it to be consumed as-a-service, we are now working at the SaaS layer. At this stage, we can concentrate on making our software more powerful, adding killing features and conquering our market niche. We don’t need (neither we do want, right?) to take care of all the infrastructure to serve our users nor we want to know what hardware lies underneath, as those would be just a distraction from our core business focus.

Sounds logical doesn’t it? All the layers stack up together so nicely and they look so complementary. Indeed they are. In fact, cloud companies end up buying services from other cloud companies that operate at a lower level of the stack. For further evidence, I have done a small research and I found out that most SaaS companies deploy their software on top of a PaaS provider that, itself, deploys its automation layer on top of one (or more) IaaS providers. What does that mean? That if an enterprise adopts a SaaS cloud service and pays for it, eventually some dollars will end up in some IaaS providers’ pocket. You like it or not.

The infrastructure of PaaS providers

To bring supporting examples, let’s check the most popular PaaS providers infrastructures as they’re most likely obliged to reveal their backends in order to inform their customers on their data center locations.

Most popular PaaS services rely on IaaS providers
PaaS Provider Supported Languages IaaS backends
Heroku Ruby, Java, Python, Node.js AWS
Engine Yard Ruby, PHP, Node.js AWS, Verizon Terremark
AppFog PHP, Java, Node, .NET, Ruby AWS, Rackspace, HP Cloud Services
OpenShift PHP, Java, Node.js, Python, Ruby AWS, Rackspace
Nodejitsu Node.js Joyent
AppHarbor .NET AWS
CloudBees Java AWS, HP Cloud Services

The cloud market is known to be huge and it is mandatory for every player in the IT industry today to take up a position, a vision and a direction within this space. If you’re an investor who wants to participate in the cloud opportunity, it is extremely valuable to understand how different cloud models are currently sharing the market. On the other hand, if you’re an enterprise evaluating the adoption of any cloud service, you should be concerned about who’s running the games up and down the cloud stack, as this will eventually affect you service level, your security and your data integrity.

POST UPDATE on 4/8/2013

I’ve been asked by Jack Clarke (@mappingbabel) of ZDnet on what basis did I single out the above PaaS providers as “most popular”. The answer I gave him is press coverage as well as “on the field”, meaning talking to customers and gathering experiences. It’s a simple personal feeling which is not based on any scientific data. I’m actually a field person and not a researcher. Besides, I don’t think any of those provider is really willing to disclose customer data.

However, it’s noteworthy to mention there are other PaaS services offered by large vendors that are difficult to define in terms of popularity; the press usually refers to the vendor as a whole and since they’re no longer in the startup phase, you can’t even measure the funding amount they’ve received from VCs. Despite the difficult measurability, I owe them a mention in this post for being active players in the PaaS landscape, contributing effectively to the cloud awareness battle.

And one can assume the above theory is respected by the above providers as well, for example with Elastic Beanstalk running on top of EC2 and App Engine running on top of Compute Engine. However, given those services are provided by the same vendor as the PaaS provider, they don’t trigger any economic transaction and thus no real shift in the measurement of the market size.

The truth on enterprise private clouds

Oh yes!

It feels so great when someone among the most recognized high tech analysts out there writes down exactly what you think. It’s an endorsement of your own thinking to read James Staten (@Staten7) from Forrester Research on “Why your enterprise private cloud is failing”, where he describes so clearly what you’ve always been thinking and trying to explain.

His blog is saying two important things:

  1. Enterprise private clouds are failing. As I’ve also written on a Quora answer to “What is the future of private cloud?”, no matter what marketing and vendors are saying, efficient, large scale production enterprise private clouds don’t exist as of today. In my opinion, cloud is an extremely new model in delivering IT infrastructure that the culture of its utilization won’t be able to reach the enterprise with a bottom-up approach (evolving from their current infrastructure) but only taking a top-down direction (deploying into public clouds and then migrating back in-house). A revolution as opposed to an evolution.
  2. Enterprise private clouds are failing due to the wrong approach taken by the IT department. Treating the cloud just like an infrastructure stack instead of a service, because “you are building the private cloud without engaging the buyers who will consume this cloud”, Staten says.

And of course, I wasn’t the only one recognizing “the truth” in James Staten’s words. His opinion on failing private clouds echoed throughout the web, generating a large consensus among cloud experts and visionaries such as James Urquhart (@jamesurquhart):

The two cloud models

Much has been already written about different approaches to the cloud and big brains have concluded that all of them can be summarized in two different cloud models. They have been given various names according to the author, but I shall refer to the nomenclature of the OpenNebula blog post.

  1. Datacenter Virtualization model: cloud as an extension of virtualization in the datacenter. Some more automation, service catagloue, etc. VMware vCloud-like approach.
  2. Infrastructure Provision model: a powerful service-oriented API to provision effectively and efficiently commodity computing resources. AWS-like approach.

With reference to the above models, James Staten is basically saying that the Datacenter Virtualization cloud model is wrong. That is not the right approach to implementing a private cloud. Because “a Porsche is [not] just a Volkswagen with better engine, tires, suspension and seats.”

Awesome. I’ve been convinced about that for some time. If you read my very first post on “Cloud Computing is not the evolution of virtualization”, as the title says, I’ve been always considering exclusively the Infrastructure Provision model as the only possible cloud implementation, completely excluding the Datacenter Virtualization to be even called cloud.

And I don’t think this was an extremist approach. As I said many times, cloud is a tremendous opportunity for the enterprise to start thinking differently. In my opinion, cloud will be able to reach the enterprise IT departments only using a top-down approach: from a public cloud implementation to back in-house. Enterprise cloud consumers will try (and love) the public cloud and eventually drive the implementation of something similar within the enterprise itself. But trying to transform the current virtualized infrastructure into a private cloud will simply fail. Fail to deliver a real elastic and service-oriented cloud infrastructure to the real cloud consumers.

Vendors didn’t get it

So what? All enterprise IT departments simply didn’t get it? What’s their problem? It’s a vendor problem. Enterprise software vendors didn’t get it. Every one of them started to think of the cloud as an opportunity (that’s good, as a matter of principle) and they all just tried to profit from the hype. For virtualization technology vendors, that was an easy path: adding a new product to their portfolio to “cloudify” the existing virtualization products, that would have been a natural extension to existing implementations within the enterprise. The perfect scenario for IT departments. Pity that it doesn’t work to deliver what cloud consumers are looking for.

But recently we heard something new from virtualization vendors. They actively started perceiving public clouds, and AWS in particular, as a threat to the workload which is (was?) currently running on their virtualization technology and that’s failing to migrate to private clouds for the above reasons. Despite their very rich cloud products portfolio, workload is still moving from the enterprises to commodity public clouds. Why?

Hearing VMware CEO Pat Gelsinger saying that he finds hard to believe they cannot beat a company that sells books, makes me think they really didn’t get the point at all. Good luck guys.

There is no such thing as the cloud uptime

Yesterday readwrite.com featured an article by Mike Pav titled “Storm Warning: Why 100% Cloud Uptime Is Impossible” and I thought it was such a piece of misinformation that I decided to write this blog post to help clarifying a few things, as I’m really fed up to hear about the unreliability of “the cloud” in general terms.

Titles are usually provocative and I won’t judge its veracity, however there is no such thing as the “Cloud Uptime” because, despite the cloud is considered as a whole, you can imagine that it is made of thousands of components and not all of them go down at once. Therefore, the outage within a cloud service tends to be bigger the more these components are interdependent. I’m going to explain this more in details.

Cloud Outages

The article says “Cloud Outages” are eventually inevitable because doing better than 99.99% availability would cost too much and companies like Netflix (which suffered its cloud provider outage right on Christmas Eve) would still continue using the cloud just because eventually “it does a great job of providing ready-to-use features”. In other words, it says that using the cloud requires a compromise that companies with multi-million businesses are ready to take: losing money from time to time in exchange of the flexibility of the cloud. My dear, I refuse to believe that.

First off, cloud providers do things differently and we can’t generalize. Let’s narrow down to AWS as this is the cloud provider the article mainly refers to. AWS is primarily an IaaS provider with some service components operating at the PaaS layer, such as the ELB (Elastic Load Balancer). In this context, there is no such thing as a “Cloud Outage” but there is the outage of a component of the cloud that your application relies on and that your application has not been instructed to handle in case of failure.

When working at the PaaS layer your freedom is limited. On one hand, you don’t have to worry about how things work underneath because the provider does everything for you but, on the other hand, you also have to rely on it when it comes to availability and SLA. Netflix relied on ELB and their application had no other way to handle its failure than waiting for AWS to fix the problem.

So how should Netflix prevent such things from now on? As others have also said, they should just build their own load balancing service by operating at the IaaS layer. In this case, they would have the freedom and the responsibility to set up multiple LBs in different availability zones or even different data centers, making their application more resilient in case of any infrastructure outage.

The responsibility of a PaaS provider

Later, the article goes through a list of PaaS provider duties in case of an outage. When I read it the second time I figured out that the term PaaS was misused as the author was instead referring to a generic provider offering any kind of services through the cloud.

However, this gives me the chance to say that a real PaaS provider should never ever suffer from any underlying infrastructure outage. The PaaS software should be the very best example of highly available resilient application, architected to exploit most of the isolation/redundancy mechanisms made available by the underlying IaaS. In the end, a PaaS provider employs mostly DevOps who master cloud automation tools and best-practices and who do know how to make an application resilient.

Moreover, a PaaS cloud is not about elasticity or scalability, as the article says, but those two come from the underlying IaaS: it’s the infrastructure that scales, it’s the infrastructure that grows and shrinks fast. Whereas PaaS is all about about automation: automated deployment, auto-scaling, automated failover and recovery on infrastructure failures.

What cloud uptime is about

In conclusion, more than 99.99% is actually possible and there are examples of that. Joyent is one that managed to deliver 99.9999% of uptime in the last 2 years. So how to build more reliable clouds? Simply by architecting an infrastructure with the least possible number of interdependent components. A cloud infrastructure made of distributed and replicated micro-components is capable of delivering scalability and reliability while limiting the impact of an outage, preserving the overall SLA.

Two things to keep in mind for the best uptime of your application in the cloud:

  1. Choose an IaaS provider with an architecture designed to limit the impact of outages. If this sounds too theoretical, then think about EBS (AWS Elastic Block Store) which is a centralized macro-component highly dependent on the network.
  2. Choose to have the freedom to build your own resilient app at the IaaS layer and, if you decide to go PaaS, pick a provider with an refund policy in case of outage that is significative enough for your business.

And in the end, Netflix will keep using the cloud because they learnt from this experience and they know that mastering cloud best-practices can save them from the next (indeed inevitable) infrastructure outage.

Checklist: is my app ready for the cloud?

The cloud is finally losing a bit of the hype and many organizations’ CIOs heard enough that are now ready to do something real with it. And that question comes to their mind: which application do I move first?

Enough has been said about the choice between IaaS, PaaS or SaaS that I assume the first step to the cloud will be towards raw infrastructure, giving up a bit of the sovereignty but still keeping all the power to architect and manage applications.

But the first moves to the cloud will lead many CIOs into a few mistakes. First off, they will think of the cloud as a simple shift of responsibility regarding the infrastructure management, thus making the cloud adoption become only a matter of SLA, data integrity and security.

As a consequence of the same assumption, they will think they could probably move their business critical applications over to the cloud “as is”, looking for a cloud provider that offers exactly the same manageability and features as the ones they were used to in their own data centre.

The cloud is a tremendous opportunity to start thinking differently

I’ve read two interesting articles recently that contain a couple of very important points about doing things with cloud infrastructures. The first one titled “Which Apps to Move to the Cloud?” starts by quoting Forrester research saying:

[…] you shouldn’t be thinking about what applications you can migrate to the cloud. That isn’t the path to lower costs and greater flexibility. Instead, you should be thinking about how your company can best leverage cloud platforms to enable new capabilities. Then create those new capabilities as enhancements to your existing applications… you have to think differently as you approach cloud development. There’s far more power in application design and configuration once you free yourself from assumed reliance on the infrastructure. The end result is new degrees of freedom for developers – if you embrace the new model.

Later, the author goes through the different types of applications being used in the enterprise comparing them to the layers of onions (yeah, just like Ogres). The inner layers are applications with most innovation, intellectual property and value to the company core busines, the outer layers are commodity apps. His conclusion is that maybe the outer layers are better to start with when moving to the cloud.

Again it’s only about risk. Let’s start with a lower risk (of loosing data or interrupting business processes) in exchange of the popular “more flexibility at lower cost” of the cloud.

The second article (very smart read, IMHO) appeared on cio.com tries to think of the cloud in the enterprise world, something that has very few success stories so far, listing some very important advices, one of them attracted my attention:

[…] a leading cloud provider would never consider adding any application to its portfolio without a clear plan for how it will scale over time. Corporate IT? Not so much. “They build infrastructure to scale out,” Paquet says, “but if their applications don’t, what problem have they actually solved?” Think scale first. And that may mean ruling out many packaged application. “Most of them are not built to scale out,” says Paquet.

What we understand from these two articles is that the cloud gives you a new infrastructure footprint that “enables new capabilities” and thus it’s you who have to adapt yourself to the cloud and not vice versa. Moreover, you have “more power in application design and configuration” that application architecture does make a difference.

Ok but… technically speaking, what exactly can I move out today?

Now that we got the above statements, we still want to start moving something to the cloud today and we don’t want to develop everything from scratch adding up delay to our cloud adoption. What both articles don’t explain clearly is: what are the technical characteristics of the applications that I can move to the cloud?

Here’s a checklist that helps you understanding if your application is cloud-ready:

  1. The application must be designed to scale by adding different instances of the same application process one next to another on different machines, applying some kind of mechanism to share the workload without depending on the OS.
    This methodology, that results into a both scalable and resilient app, is strictly required when moving to the cloud as you don’t know what kind of hardware is being used underneath (you can actually easily assume they’re just commodity servers)
  2. The application data store must be partitionable. If you have a high amount of data growing linearly, then you can split it into different chunks, each to be bound to one of the application nodes.
  3. The data store partitions should be able to be replicated on other nodes in order to achieve redundancy.

If you are running any application and that matches the above patterns, you can feel free to move it to the cloud today without worrying about loosing data or interrupting your business processes, provided that you make good use of the application configuration capabilities!

As I’m pretty sure you’ll have to go through some architectural review, while doing that, keep in mind to think only at the application level with nothing strictly depending on the operating system. This will give you extra freedom to migrate between cloud providers and complete self-sufficiency to implement your highly available application tiers in the way you prefer.

If you want to dig deeper into these principles, I advise you to read over Amazon Dynamo Paper that explains the theory and the trade-off between consistency and availability and that inspired great cloud-ready applications like Riak noSQL key-value store.

In conclusion, the cloud enables commodity IT infrastructures at extremely low price. With this in mind, you simply can’t demand that if you move your single instance database onto one virtual machine in the cloud, this will never go down. On the other hand, cloud infrastructures today offer all the mechanisms and features that, if mastered, can help you building the most highly available application clusters you ever had before.

SmartOS improves Node.js debugging experience (part 2)

This blog post is a two-parts series about debugging Node.js applications. The first part focuses on post-mortem debugging tools and practices, the second part illustrates how to debug latency bubbles in production using DTrace.

Debugging Node.js latency bubbles

Soft real-time systems

One thing that came out with Node.js is that it is extremely good for the new breed of applications: Internet facing, soft real-time systems.

A real-time system is one where the timeliness of the system is also its correctness, at some level. There is a clear distinction between hard real-time systems, where being late means failing, and soft real-time systems, where being late means systems just kind of “suck”.

With the rise of mobile, social and HTML5, we’ve seen more and more of this new breed of applications – DIRTy systems (data intensive real-time systems) – that are Internet facing, real-time systems that have a human in the loop. And when humans are in the loop the good news is that deadlines are soft (the system sucks but it doesn’t die – people will just complain), but the bad news is that the demand is typically non-linear.

Let’s imagine you’ve carefully built your real-time mobile application and suddenly a DJ from Cleveland tells all his listeners that they gotta go download your app and… boom! You get 100,000 people show up the same night, 400,000 more people at the end of the week and 1 million people at the end of the month. This happens, it has happened repeatedly, and it will happen again. We are seeing this trend accelerating, and the more computers will be in our pockets, the more we will have to cope with this.

And this is why it’s extremely difficult to deal with the challenge of scalability at the same time with the challenge of delivering data in real-time.

Debugging latency with DTrace

How do you debug these systems when they go wrong? How do you debug the latency bubbles that consist of failures in these kinds of systems?

Bryan Cantrill (@bcantrill) worked extensively in building real-time systems during his career and debugging them has always been a challenge for him. So he developed DTrace to dynamically instrument those systems, being able to walk them while they’re running, grabbing timestamps at different parts of the stack and correlating them to figure out where the latency is coming from.

The question was: how could we take DTrace into Node.js?

As was true for interpreting core dumps, in interpreted environments it’s extremely difficult to figure out from the bottom what is going on at the top of the stack. Bryan and team had a bunch of ideas and one of them was taken from other interpreted environments that instrument the actual VM wherever it’s doing a function call. It’s great and powerful (Erlang did a terrific job on that) but it is too fine grained.

Eventually, they decided to add USDT (Userland Statically Defined Tracing) probes at certain points of interests like HTTP requests, HTTP responses, GC and so on.

But how can we effectively use DTrace to debug our latency in Node.js? Let’s start by listing all the probes available for all my node processes by typing the following command in a SmartOS shell:

dtrace –n –l node*:::

And we’ll get an output like this:

SINDE Part 2 SC 1

Apart from the C++ name mangling, you can actually see the points of interests (USDT probes) named http-client-request, http-client-response, etc.

Let’s go enable all of them so that we can see in real time what our node processes are doing.

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n node*:::’{printf(“%d does %s…n”, pid, probename)}’ –q

On the left you can see the process IDs and on the right what they’re doing:

SINDE Part 2 SC 2

Let’s try to isolate the incoming HTTP activity by instrumenting only the http-client-request:

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n http-client-request’{printf(“%d does a %s to %s on %s”, pid, args[0]->method, args[0]->url, args[1]->remoteAddress)}’ –q

And we get some more information out of it:

SINDE Part 2 SC 3

If we want to see the code actually executed upon HTTP requests, we can generate a stack trace whenever they occur by using the ustack() function:

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n http-client-request’{printf(“%s:n”, args[0]->method); ustack()}’ –q

That prints out the stack backtrace:

SINDE Part 2 SC 4

We printed the actual called method “PUT” (args[0]->method) and right after the stack trace of what was executed upon the request.

But we’re now back to the other problem: what the hell is this? Bryan and team were in front of another challenge: how to turn all of this into V8 frames from the context of the kernel?

And Dave Pacheco (@dapsays), who doesn’t know the definition of impossible (see part 1 of this blog post), has solved this for JavaScript environment. This is how: when V8 starts, it expresses in an intermediate representation how to take one of these frames and turn it into an actual string, and all of that is downloaded into the kernel upon start of the virtual machine. Then, whenever there is a stack trace generated, this time by the jstack() function, the map table is evaluated and frames are turned into proper readable ones.

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~]# dtrace –n http-client-request’{printf(“%s:n”, args[0]->method); jstack()}’ –x jstrackstrsize=8k –q

Now we can see the actual JavaScript that was executed upon a GET:

SINDE Part 2 SC 5

As you may have realized, this is shining a very bright light to what was previously a total black hole. If you have a Node.js program misbehaving without this kind of technology you’re hosed.

During Node Summit back in January 2012, we heard practitioners talking about big problems of Node.js, and it was all about production debuggability. This is what Joyent has invested a lot into with SmartOS, even if the truth is that we did it to debug our own problems, and that’s true also for DTrace!

The remaining challenge was that USDT methodology was difficult to use with JavaScript. Fortunately, Chris Andrews developed the Node.js DTrace provider that allows you to define your own probes (the “points of interest”) entirely using JavaScript.

All of the above is available in Node.js since 0.6.7 and it’s there by default, you don’t have to do anything to enable it.

Visualizing latency

In terms of visualizing latency, another colleague from Joyent – Brendan Gregg (@brendangregg) – has done a terrific job. One of the most common problems is Node.js programs using too much CPU. Brendan hunted it by profiling the CPU at regular intervals, taking the stack traces, aggregating them by smashing the results together, re-sorting and displaying them as a “flame graph”:

SINDE Part 2 SC 6

The stack shows both JavaScript and C++ frames in a way you can easily identify where your program is spending most of the CPU time. And it’s good to know is that all the tools to generate flame graphs are on GitHub and they’re open source, you can already use them in production to find important bugs or latency bubbles throughout your Node.js code.

Conclusion

SmartOS is Joyent’s foundation for the NodeStack, but Node.js runs everywhere. We, at Joyent are not binding Node.js to work only on a particular platform. We’re committed to invest further in SmartOS in a way to make it the natural choice for your production Node.js environment. And we’re going to do this by giving you great technology that allows you to understand your Node.js app in a way you can’t on any other platform.

SmartOS is an open-source project and it can be consumed as-a-service on top of the Joyent Public Cloud where all the above mentioned tools are enabled by default.

But now I would like to hear from you: how you debug your Node.js applications today? Do you consider debugging in production being one of the biggest Node.js challenges?

End of part 2. You can watch all NodeStack videos, including the one by Bryan Cantrill that was summarized here, by registering yourself for free on the conference website.

SmartOS improves Node.js debugging experience (part 1)

This blog post is a two-parts series about debugging Node.js applications. The first part focuses on post-mortem debugging tools and practices, the second part illustrates how to debug latency bubbles in production using DTrace.

Recently I was a spectator of one of the very first online conferences about the rise of a new software stack built for mobile and web applications. The new stack is NodeStack and is comprised of Node.js, MongoDB and SmartOS.  It is intended to replace the now surpassed LAMP stack, as modern applications have to deal with real-time response delivered at scale, required when interacting with an exploding number of mobile devices.

NodeStack conference featured a fabulous talk by Bryan Cantrill (@bcantrill), Joyent SVP of Engineering, going through why Joyent’s operating system, SmartOS, really makes a difference within the stack.  His points were so prescient, or His talk was so good that I decided to give my contribution translating it into this blog post. If debugging your Node.js apps in production sounds like a dream come true, read on!

SmartOS, the foundation of the NodeStack

Does the foundation really matter? It’s often very tempting to dismiss the foundation and concentrate on the appearance of things but, just like with buildings, the foundation is really critical and it doesn’t necessarily matter when things are working as much as when things fail.

When your program fails you need the foundation – the Operating System – to really understand what happened. When your component has failed, it’s gone and all that’s left is inside the Operating Systems, like footprints on what used to be the component.

Sir Maurice Wilkes, the father of computing, built the first stored program computer back in 1949. The first programmer in history already realized that it wasn’t easy to get programs right.

SINDE Part 1 SC 1

SmartOS is Joyent’s open-source operating system, it is a derivative of illumos, the community driven fork of OpenSolaris that was born when the project was made proprietary. It is backed by many former Sun Microsystems engineers, and it is built to be the operating system for the cloud. You may want to check out www.smartos.org to get more information.

Debugging Node.js logic failures

First off, programs fail for internal logic errors. A bug can cause them to die, exit improperly or end up in infinite loops. To debug these kinds of failures you often need tight integration with the underlying OS.

A real use case

To give real examples, Bryan speaks to his own experience, since Joyent builds its entire software for orchestrating their cloud using Node.js. In the past, Bryan and his team, including Ryan Dahl (the creator of Node.js) and Dave Pacheco (@dapsays), were hitting a black hole by experiencing a non-systematic infinite loop inside their application right before deploying in production.

They were looking at the generated stack, which looked like this:

SINDE Part 1 SC 2

Obviously, you have no idea where you are in the code.

They eventually deployed the application in production and even though they expected to see the bug happening immediately, they didn’t see it for months. And here comes the difference between an amateur and a professional: an amateur happily says “the bug just went away!” while the professional knows that the bug is there and will hit him in his weakest moment. In fact, the bug actually appeared when a customer was watching a demo of the software.

Bryan and team decided they would have to write something new to help debug the software.

mdb and v8.so

Historically, we have always looked at core dumps for post-mortem debugging. It’s a very old idea commonly used to debug operating systems, databases, web servers, etc. It is really great because it allows for asynchronous debugging: after a failure, you can restart your system immediately and debug it in parallel.

The problem is that this has not worked well in interpreted environments.

The challenge for Bryan and team was to add support for post-mortem debugging for Node.js. Bryan thought it was basically impossible because it implies you are able to reconstruct the VM state.  From the bottom of the system (the Operating System), it is very difficult to determine what is happening further up the stack. And this is reinforced by the fact that no one has done it satisfactorily so far, neither Java, nor Python, Ruby, Erlang or PHP.  Bryan thought it was an impossible problem to solve. Dave Pacheco proved him wrong.

Among the anecdotes contained in The Soul of a New Machine by Tracy Kidder, there is one about a college hire joining an engineering team.  The senior engineers didn’t have time to look after him, so they gave him an impossible problem to solve (a simulator), just to make him kill some time.  But he came back after a couple of months saying, “the simulator’s done”. To their surprise, the senior engineers realized they didn’t tell him it was an impossible problem. He solved it because he didn’t know it was impossible.

In the same way, Dave Pacheco solved the problem of visualizing a stack trace for interpreted environments. The result is that now we have the v8.so dmod for mdb that can be used for debugging Node.js programs post-mortem.

Let’s take a look at how it works.

[root@23c5d173-9973-4d7c-8935-46c6-23ef47a6 ~/dmod]# mdb corefile
> ::load v8.so
> ::jstack

After loading v8.so, the stack trace we have seen before looks like this, displaying all the actual JavaScript frames:

SINDE Part 1 SC 3

Now it is definitely much easier to identify that the source of the problem is inside the heatmap.js file.

But Dave went one step further. With his dmod, we can also take an arbitrary object and see what the actual arguments are, printed out as JSON. Now, if you look at the following output, you will notice something a bit suspicious, considering that the pathology was an infinite loop.

SINDE Part 1 SC 4

Note that “min” and “max” have the exact same value. The heatmap.js shouldn’t be called with such parameter values but, at the same time, the function should be able to handle this situation without generating an infinite loop. Both the caller and the called were fixed.

This is a concrete example of how to understand a production problem that couldn’t be debugged in any other way than with an effective post-mortem debugging tool.

Memory leaks

“Where is my memory going?” – during the broad adoption of Java in the mid 90’s, we’ve seen the rise of garbage collector problems. Since then, programmers still see their programs spending too much time doing GC. But that happens either because of actual garbage collection or because you’re actually not generating any garbage. In the second case, it means you’ve got a semantic leak: a data structure that you don’t care about that still has a reference somewhere and GC can’t collect it. You will focus on GC as the cause of the problem when it’s just a symptom of the problem. And it’s very easy to keep implicit references in JavaScript that result in heap growth that you don’t know where it’s coming from.

Walking the memory to find the source of leaks is not an easy task, but Dave helped solve another impossible problem. Bryan and team did it by scanning all memory looking for objects that were satisfying all the constraints that a proper JavaScript object should have.

The result is the ::findjsobjects mdb command, which scans the core file and prints out all the objects that are recognized and that can be visualized by piping their address into ::jsprint.

SINDE Part 1 SC 5

But to go hunting our memory leak source, we can even go further and print all JavaScript objects that match a certain object property signature.

> fc431cd1::findjsobjects | ::jsprint

End of part 1. To be continued here.

It’s already happening in Europe

Recently I’ve been reading an article about Europe being an unfriendly environment for entrepreneurship and specifically for startups. I liked the underlying optimism about getting a new beginning, but I think it is completely wrong to consider Europe as a whole when legislation and culture as so different country by country. And it is unfair not to see what Europe has already been doing so far.

Well then, where exactly the new beginning will start from? I’ve been trying to locate the hot spots for Internet startups in the Old Continent and I’ve actually seem much more than what is the common perception of this scenario.

New technologies are arising. Those that are specifically thought for the cloud, thought for scale. Internet and mobile applications frameworks and platforms (like Node.js, MongoDB) are getting more and more popular throughout the entire continent. Just look at the growing number of conferences such as Node Dublin, Node.js Conf in Italy, JSconf EU in Berlin or Railsberry in Krakow. And then notice they usually take place in weekends to let developers join out of their passion, leaving space to creativity and focusing on real innovation.

Moreover, it’s not only about startups. There are Internet companies in Europe that are already at the next stage. They developed a business model. They got profitable. And somebody believed in them, believed in the environment where they settled in and someone was eventually right doing that. Examples like SoundCloud (Germany), Spotify (Sweden), Wonga (UK), JustEat (Denmark) are just a few  that worth mentioning.

So is this just the new beginning? No, it is much more than that, it’s already happening and I really want to be there when that happens. I work for Joyent and we run a public cloud (IaaS) that hosts many of the successful Internet companies in the United States. Many of those have chosen Joyent because our technology is designed for those who make money through the Internet, for they who can’t afford loosing any click. Because one click means money.

But I live in Europe, and I want the next success story to be European.

This is what I work for everyday. I observe the evolving scenario of Internet companies in Europe, supporting conferences (I will be attending the Node Dublin, the most important European Node conference, next week) and helping companies driving their business in a better way by hosting their new generation applications in a new generation cloud. On top of an infrastructure that runs just fast as the bare-metal does, because it was built from the ground up, built with the cloud in mind.

It’s simply so exciting.

Cloud computing is not the evolution of virtualization

Many of you may probably think that after the success of virtualization technology they had to invent something appealing to keep pushing sales and they called it Cloud Computing. And the same people would think that cloud computing is just an extra layer on top of your virtualization management platform for better and coordinated resource management, that provides things like billing, machine catalogues, self-provisioning, etc.

Cloud Computing actually has a much wider meaning (that sometimes makes it simply look like a marketing trend) so today I will narrow it down and focus on cloud infrastructures. The questions I will try to answer are: what is a cloud infrastructure, and when can you say you’re really running your business in the cloud?

To provide the right answers, you have to think of the applications that you want to run on your IT infrastructure. Many of you have probably gone through the server consolidation process that made VMware a billion dollar company: you had lots unused hardware resources but you still wanted to separate operating environments so, no problem, hardware virtualization could solve that for you, without the need to change anything in your application code or architecture. The same application you were running before on the bare-metal would run exactly in the same way inside a virtual machine.

After server consolidation practices became common, somehow the evolution of hardware virtualization went much faster than the evolution of applications. Hypervisor vendors started to provide more and more features to make the underlying hardware always available for running applications, so they could endlessly run without even caring about potential hardware failures.

What people tend to forget when buying powerful hardware platforms is that application failures are much more the primary reason of outages than hardware failures. For this reason, sooner or later you realize that and you have to build up an application-level redundancy in order to implement a real highly available system. But with application-level redundancy, do you still need to have underlying expensive hardware? Why not to run your application on commodity servers?

This question will lead to the real concept of cloud computing. Let’s now try to give a definition: a cloud infrastructure can be called so if it:

  • is scalable and elastic
  • provides process automation (self-provisioning / self-service / billing)
  • is highly available
  • provides full multi-tenancy

And what is the purpose of all of the above? If you think carefully, you’ll realize that it’s all aimed at commoditizing the infrastructure itself. Companies shouldn’t spend anymore time to build up their IT foundations but they should concentrate on their actual business workflows, supported by really innovative applications. Infrastructure is something they want to take for granted.

In this scenario, a cloud platform should have another important characteristic: it has to be cheap.

So can you achieve all of that with a traditional hardware virtualization-powered infrastructure? No.

Scalability will be an issue if you’re using centralized resources (that can’t grow big forever) that are usually necessary for providing hardware-level HA.

You will feel safe thanks to all those automatic live machine migration features but don’t forget that they protect you only from hardware failures. If the application fails there is not much they can do for you. You should protect yourself from application failures by building a redundant application architecture but, if you do so, do you still need expensive hardware-level HA? No, you don’t.

And one more thing, cost. Hardware virtualization infrastructures require complex high-end hardware that won’t get the point of being cheap in order to turn the IT infrastructure into a commodity.

In the end, do you want to run your old legacy application in the cloud? Forget it. Just keep it on your powerful expensive virtualization platform. That will work just fine. But if you’re a visionary who believes in a future that requires performant, scalable, elastic and cheap commodity IT infrastructures, then choose your next applications to be cloud aware. That will take you much further, much faster.