Logo F2FInterview

Cassandra Interview Questions

Q   |   QA
‹‹ previous12345678910 ... 1920

Apache Cassandra is a free, open source, distributed data storage system that differs sharply from relational database management systems.

Cassandra first started as an incubation project at Apache in January of 2009. Shortly thereafter, the committers, led by Apache Cassandra Project Chair Jonathan Ellis, released version 0.3 of Cassandra, and have steadily made minor releases since that time. Though as of this writing it has not yet reached a 1.0 release, Cassandra is being used in production by some of the biggest properties on the Web, including Facebook, Twitter, Cisco, Rackspace, Digg, Cloudkick, Reddit, and more.

Cassandra has become so popular because of its outstanding technical features. It is durable, seamlessly scalable, and tuneably consistent. It performs blazingly fast writes, can store hundreds of terabytes of data, and is decentralized and symmetrical so there’s no single point of failure. It is highly available and offers a schema-free data model.

The Cassandra data store is an open source Apache project available at http://cassandra.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s inbox search problem, in which they had to deal with large volumes of data in a way that was difficult to scale with traditional methods. Specifically, the team had requirements to handle huge volumes of data in the form of message copies, reverse indices of messages, and many random reads and many simultaneous random writes.

The team was led by Jeff Hammerbacher, with Avinash Lakshman, Karthik Ranganathan, and Facebook engineer on the Search Team Prashant Malik as key engineers. The code was released as an open source Google Code project in July 2008. During its tenure as a Google Code project in 2008, the code was updateable only by Facebook engineers, and little community was built around it as a result. So in March 2009 it was moved to an Apache Incubator project, and on February 17, 2010 it was voted into a top-level project.

Cassandra today presents a kind of paradox: it feels new and radical, and yet it’s solidly rooted in many standard, traditional computer science concepts and maxims that successful predecessors have already institutionalized. Cassandra is a realist’s kind of database; it doesn’t depart from the relational model to be a fun art project or experiment for smart developers. It was created specifically to solve a real-world problem that existing tools weren’t able to solve. It acknowledges the limitations of prior methods and faces our new world of big data head-on.

I’m a little surprised how often people ask me where the database got its name. It’s not the first thing I think of when I hear about a project. But it is interesting, and in the case
of this database, it’s felicitously meaningful.

In Greek mythology, Cassandra was the daughter of King Priam and Queen Hecuba of Troy. Cassandra was so beautiful that the god Apollo gave her the ability to see the future. But when she refused his amorous advances, he cursed her such that she would still be able to accurately predict everything that would happen—but no one would believe her. Cassandra foresaw the destruction of her city of Troy, but was powerless to stop it. The Cassandra distributed database is named for her. I speculate that it is also named as kind of a joke on the Oracle at Delphi, another seer for whom a database is named.

Large Deployments
You probably don’t drive a semi truck to pick up your dry cleaning; semis aren’t well suited for that sort of task. Lots of careful engineering has gone into Cassandra’s high availability, tuneable consistency, peer-to-peer protocol, and seamless scaling, which are its main selling points. None of these qualities is even meaningful in a single-node deployment, let alone allowed to realize its full potential.

There are, however, a wide variety of situations where a single-node relational database is all we may need. So do some measuring. Consider your expected traffic, throughput needs, and SLAs. There are no hard and fast rules here, but if you expect that you can reliably serve traffic with an acceptable level of performance with just a few relational databases, it might be a better choice to do so, simply because RDBMS are easier to run on a single machine and are more familiar.

If you think you’ll need at least several nodes to support your efforts, however, Cassandra might be a good fit. If your application is expected to require dozens of nodes, Cassandra might be a great fit.

Lots of Writes, Statistics, and Analysis
Consider your application from the perspective of the ratio of reads to writes. Cassandra is optimized for excellent throughput on writes.

Many of the early production deployments of Cassandra involve storing user activity updates, social network usage, recommendations/reviews, and application statistics. These are strong use cases for Cassandra because they involve lots of writing with less predictable read operations, and because updates can occur unevenly with sudden spikes. In fact, the ability to handle application workloads that require high performance at significant write volumes with many concurrent client threads is one of the primary features of Cassandra.

According to the project wiki, Cassandra has been used to create a variety of applications, including a windowed time-series store, an inverted index for document searching, and a distributed job priority queue.

Geographical Distribution
Cassandra has out-of-the-box support for geographical distribution of data. You can easily configure Cassandra to replicate data across multiple data centers. If you have a globally deployed application that could see a performance benefit from putting the data near the user, Cassandra could be a great fit.

Evolving Applications
If your application is evolving rapidly and you’re in “startup mode,” Cassandra might be a good fit given its schema-free data model. This makes it easy to keep your database in step with application changes as you rapidly deploy.

Cassandra is still in its early stages in many ways, not yet seeing its 1.0 release at the time of this writing. There are few easy, graphical tools to help manage it, and the community has not settled on certain key internal and external design questions that have been revisited. But what does it say about the promise, usefulness, and stability of a data store that even in its early stages is being used in production by many large, well-known companies?

The list of companies using Cassandra is growing. These companies include:

  • Twitter is using Cassandra for analytics. In a much-publicized blog post (at http://engineering.twitter.com/2010/07/cassandra-at-twitter-today.html), Twitter’s primary Cassandra engineer, Ryan King, explained that Twitter had decided against using Cassandra as its primary store for tweets, as originally planned, but would instead use it in production for several different things: for real-time analytics, for geolocation and places of interest data, and for data mining over the entire user store.
  • Mahalo uses it for its primary near-time data store.
  • Facebook still uses it for inbox search, though they are using a proprietary fork.
  • Digg uses it for its primary near-time data store.
  • Rackspace uses it for its cloud service, monitoring, and logging.
  • Reddit uses it as a persistent cache.
  • Cloudkick uses it for monitoring statistics and analytics.
  • Ooyala uses it to store and serve near real-time video analytics data.
  • SimpleGeo uses it as the main data store for its real-time location infrastructure.
  • Onespot uses it for a subset of its main data store.

Cassandra is also being used by Cisco and Platform64, and is starting to see use at Comcast and bee.tv for personalized television streaming to the Web and to mobile devices. There are others. The bottom line is that the uses are real. A wide variety of companies are finding use cases for Cassandra and seeing success with it. As of this writing, the largest known Cassandra installation is at Facebook, where they have more than 150TB of data on more than 100 machines. Many more companies are currently evaluating Cassandra for production use in different projects, and a services company called Riptano, cofounded by Jonathan Ellis, the Apache Project Chair for Cassandra, was started in April of 2010. As more features are added and better tooling and support options are rolled out, anticipate even broader adoption.

‹‹ previous12345678910 ... 1920

In order to link this F2FInterview's page as Reference on your website or Blog, click on below text area and pres (CTRL-C) to copy the code in clipboard or right click then copy the following lines after that paste into your website or Blog.

Get Reference Link To This Page: (copy below code by (CTRL-C) and paste into your website or Blog)
HTML Rendering of above code: