Logo F2FInterview

HBase Interview Questions

Q   |   QA
‹‹ previous12

HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as it's primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.

However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are:

  •     Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
  •     Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
  •     Automatic RegionServer failover
  •     Hadoop/HDFS Integration: HBase supports HDFS out of the box as it's distributed file system.
  •     MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
  •     Java Client API: HBase supports an easy to use Java API for programmatic access.
  •     Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
  •     Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
  •     Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

HBase isn't suitable for every problem.

First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.

Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.

Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.

HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only. 

HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.

Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests.

There's a very big difference between storage of relational/row-oriented databases and column-oriented databases. For example, if I have a table of 'users' and I need to store friendships between these users... In a relational database my design is something like:

Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or maybe two depending on how it's impelemented) row for each friendship.

In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = 'myid';

The cost of this relational query continues to increase as a user adds more friends. You also begin to have practical limits. If I have millions of users, each with many thousands of potential friends, the size of these indexes grow exponentially and things get nasty quickly. Rather than friendships, imagine I'm storing activity logs of actions taken by users.

In a column-oriented database these things scale continuously with minimal difference between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships.

Rather than a friendships table, you could just have a friendships column family in the users table. Each column in that family would contain the ID of a friend. The value could store anything else you would have stored in the friendships table in the relational model. As column families are stored together/sequentially on a per-row basis, reading a user with 1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is just in the shipping of this information across the network which is unavoidable. In this system a user could have 10,000,000 friends. In a relational database the size of the friendship table would grow massively and the indexes would be out of control. 

‹‹ previous12

In order to link this F2FInterview's page as Reference on your website or Blog, click on below text area and pres (CTRL-C) to copy the code in clipboard or right click then copy the following lines after that paste into your website or Blog.

Get Reference Link To This Page: (copy below code by (CTRL-C) and paste into your website or Blog)
HTML Rendering of above code: