MongoDB – Sharding


What is Sharding:

If you head over to the Wiki page then you’ll see that their definition for sharding is as follows:

Database sharding is a method of horizontal partitioning in a database or search engine. Each individual partition is referred to as a shard or database shard.”

What does this mean for us?  It means that your data can be split across multiple database instances thereby making your data collections smaller, which in turn makes your indexes smaller, which in turn makes your Create, Read, Update, Delete (CRUD) operations faster.  In the context of MongoDB the developer can set up shard keys which will be used to distribute the data across multiple shards.  MongoDB uses an option of a Master/Slave configuration where there is only one Master and multiple slaves.  They did this for efficiency purposes and so you can set up your code so as to always write to the Master while all your reads can be done against the Slaves.

Why use Sharding

Replica sets offers you certain features as follows (from MongoDB Site):

  • Scaling out to thousands of nodes
  • Easy addition of new machines
  • Automatic balancing for changes in load and data distribution
  • Zero single points of failure
  • Automatic failover

All this and much more can be gleaned from the MongoDB web site here.

How do I create a Shard

Now to move into how you would go about creating a sharding setup.  Basically you would want two servers to start with.  Mongo allows you to add more as and when you need them.  This section will simply give you the short commands that you should run from Command Prompt in windows.  I will then discuss the various parameters on the command and why they’re important to know about.  Once again, all this information can be found on the Mongo site, but if you’re new to Mongo like myself and you just want a quick overview and set up so you can start playing with this “new” technology, then this guide will help you get up and running quickly.  Please follow the steps on my Introduction page to get MongoDB downloaded and running before attempting this step in the process.

To set this up do as follows:

  1. Create a batch file named: RunServer10000.bat to the root of your MongoDB installation’s bin folder.
  2. Add the following lines to the batch file:
    mkdir “data/localhost10000”
    mongod –rest –shardsvr –port 10000 –dbpath data/localhost10000 –logpath data/localhost10000/log.txt
    PAUSE
  3. Create another batch file named: RunServer10001.bat to the root of your MongoDB installation’s bin folder.
  4. Add the following lines to the batch file:
    mkdir “data/localhost10001”
    mongod –rest –shardsvr –port 10001 –dbpath data/localhost10001 –logpath data/localhost10001/log.txt
    PAUSE
  5. Run RunServer10000.bat
  6. Run RunServer10001.bat

That’s it now you should have two replica set instances running.  In the next section I’ll discuss how to configure the collections in the shards.  For now, let’s go through what we’ve just done.

If you open up the batch file called RunServer10000.bat in Notepad, then you’ll see four lines explained as follows:

  1. mkdir “data/localhost10000”
    If you’re not familiar with Command Line scripts then the simple answer to this one is that it creates the directory structure “data/localhost1000” for you under your default Mongo bin directory.
  2. mongod –rest –shardsvr –port 10000 –dbpath data/localhost10000 –logpath data/localhost10000/log.txt
    This has a couple of bits to take into consideration as follows:

    1. mongod
      This runs the Mongo Database instance.
    2. –rest
      This runs the Mongo database in a restful manner which will allow you to open the web admin UI that comes built in with MongoDB usually found at:  http://localhost:11000/ if you run MongoDB on port 10000.  You should be able to pick up the port number for the admin UI in the log file which I’ll explain in point below.
    3. –shardsvr
      This enables sharding on this specific instance, which I’ll explain in more detail later.
    4. –port 10000
      Specifies the port on which this specific instance of the mongod.exe should run.  This is the port you’ll use to connect to this instance.
    5. –dbpath data/localhost10000
      This specifies where to put our database.  If you don’t add this parameter Mongo will try and create the database in the default location of “c:\data\db” and will not start if that directory structure does not exist.  Personally I don’t like things on my C drive, which is why I’m specifying this parameter and ensuring that the directory structure exists by adding it to my batch file as specified above.
    6. –logpath data/localhost10000/log.txt
      This specifies where Mongo should write it’s log files for this specific instance.  Personally I like having my log file and my database files in the same directory if I’m running in development/test mode, because it keeps things nice and together.  In Production I would split the logging off to its own drive for speed and backup purposes.
    7. 3. PAUSE
      This is only in the batch file because I want to see if something went wrong and it’s not written to the Mongo log file.  You can remove it…

If you would like to get more information you can go here for further reading.  For this example though I’m not going to go into more detail seeing that I’m trying to explain things in as simple a manner as possible to get you up and running as fast as possible.

How do I set up a Configuration Server

A configuration database instance is where you’ll be saving the configuration for your shards.  To set this up do as follows:

  1. Create a batch file named: RunConfigServer10002.bat to the root of your MongoDB installation’s bin folder.
  2. Add the following lines to the batch file:
    mkdir “data/localhost10002”
    mongod –rest –port 10002 –dbpath data/localhost10002 –logpath data/localhost10002/log.txt
    PAUSE
  3. Run RunServer10002.bat

The only setting that changes when we’re running a Configuration database instance is the “–configsvr” parameter that get added and the “–enablesharding” parameter that falls away.  The “–configsvr” parameter let’s the mongod process know that this instance will be used for configuration.

How do I set up a Routing Service

A routing service is the service which will enable our applications to connect to the shard.  On startup it reads the configuration in the configuration database to understand how the underlying shards are configured.  To set this up do as follow:

  1. Create a batch file named: RunRoutingService.bat to the root of your MongoDB installation’s bin folder.
  2. Add the following lines to the batch file:
    mongos –configdb localhost:10002 > tmp/ RunRoutingService.txt
    PAUSE
  3. Run RunRoutingService.bat

There are a few new parameters to note here as follows:

  1. mongos –port 10003 –configdb localhost:10002 > tmp/ RunRoutingService.txt
    This has a couple of bits to take into consideration as follows:

    1. mongos
      This is the routing process that we’ll be connecting to from our application.  This service will potentially be installed on every AppServer.
    2. –port 10003
      This specifies the port on which the routing service will be running.
    3. –configdb localhost:10002
      This parameter tells the routing process which Configuration Server to connect to for it’s configuration.
    4. > tmp/ RunRoutingService.txt
      This simply tells the routing service where to write it’s logs to.

Now that you have your shards up and running, your configuration database has been set up and your routing service is connected to it you can start configuring collections and databases for sharding.

How do I configure collections for Sharding

Now that we have our two “servers” named “localhost:10000” and “localhost:10001” and our configuration server as well as our routing service up and running we should configure them so they can start talking to each other.  To make this process running the MongoDB console against routing service a bit easier you need to do as follows:

  1. Create a batch file named: RunMongoDBConsoleRoutingService.bat to the root of your MongoDB installation’s bin folder.
  2. Add the following lines to the batch file:
    mongo.exe localhost:10003
    PAUSE

Now that you have that let me quickly explain what it means.

  1. 1. mongo.exe
    This is the application which gives you the MongoDB Console where you can do all kinds of funky things which you can read about on Mongo’s site. 
  2. 2. localhost:10003
    This specifies which instance you want to connect to.  In our case we want to connect to the routing service which is running on port 10003.

Once you have this up and running you should see something like:

MongoDB shell version: 1.6.2
connecting to: localhost:10000/test
>

Now that you have the console running you can type in the following:

  1. use admin
    This will switch you to the admin database.
  2. db.runCommand( { addshard : “localhost:10000”, name : “shard10000” } );
    This will add the server “localhost:10000” to our cluster.  The name that we specify is an optional parameter which you can leave out.  MongoDB will assign a name automatically, but I like naming my shards, because then I can name it something that will make sense to me…
  3. db.runCommand( { addshard : “localhost:10001”, name : “shard10001” } );
    This will add our second server to the shard configuration.

Now that we’ve added the two servers to our sharding configuration we have to specify what we want to shard.  There are two options when it comes to our data, 1) is that we can shard whole databases; 2) we can shard only certain collections.  I’m not going to go too much into detail and will be jumping straight into the how.

To shard a database you can run the following command:

db.runCommand( { enablesharding : “” } );

If you do sharding on a database level then the routing service will automatically place the collections of the database on different shards.  This is all handled by the routing service and you will never know where your collections are stored.  In this scenario each collection will only exist on one shard at a time and could be move around without you knowing about it.

To shard a collection you can run the following command:

db.runCommand( { enablesharding : “”, key:  } );

In this scenario you will be splitting the collections based on the key pattern that you’ve specified.  I’ll explain how this works in a location based scenario a bit later on.  For now, you need to know that while MongoDB allows you to specify keys to split the underlying collection in the database, it does not allow you to specify which shard to write to.  So, if you set this up MongoDB will decide where to place the data.  There is a way to force this to happen here, but that doesn’t mean that MongoDB will keep that configuration.

How do I add a new shard server to the configuration:

Now that you have your cluster of two servers up and running you may need to add another server into the mix.  This is extremely simple as follows:

  1. Create a batch file named: RunServer10004.bat to the root of your MongoDB installation’s bin folder.
  2. Add the following lines to the batch file:
    mkdir “data/localhost10004”
    mongod –rest –shardsvr –port 10004 –dbpath data/localhost10004 –logpath data/localhost10004/log.txt
    PAUSE
  3. Run RunServer10004.bat

That should run your new sharding server instance.  Now to configure it all you have to do is run the batch file you created in the “How do I configure the set” section above called:  RunMongoDBConsoleRoutingService.bat.  This will start the MongoDB console by connecting to our routing service where you should then run the following simple command:

db.runCommand( { addshard : “localhost:10004”, name : “shard10004” } );

This will configure this new server for you in the cluster and the routing service will start adding collections to this new shard.

How to configure a shardkeypatternobject

In this scenario I will be using the following Object Model to demonstrate how you can use a shardkeypatternobject to do location based sharding on your cluster.  This means that data for South Africa will be available on one shard and data for the UK will be available on a different shard.  I’ll explain how you can query these two shards separately and how you can have a global view of data through the routing service.  After everything is done, I’ll explain how you can move a collection from one shard to another one, but do remember that this can change based on what the routing service decides to do.

If our object model looks as follows:

Then we would want to do Sharding based on the “Country” property on the “Customer” class.  In our scenario we would like to do location based sharding which means that we would want to create the key using the “Country” property.  In this scenario my database’s name will be “Customers” and my collection will be “Customer” which makes my namespace “Customers.Customer”.  So, if you run the MongoDB console against our routing service by running “RunMongoDBConsoleRoutingService.bat” file then we can do the following steps:

  1. We’re going to set variables equal to our three databases as follows:
    db = db.getSisterDB( “test” );
    config = db.getSisterDB ( “config” );
    admin = db.getSisterDB ( “admin” );
  2. Once we have our databases in variables we insert a few documents as follows:
    1. db.Customers.save({“MongoId”:”7ce9ce33-1b8b-4afa-bd54-7e0547ddb6d5″,”CustomerId”:1,”FirstName”:”John”,”Country”:”South Africa”});
    2. db.Customers.save({“MongoId”:”7ce9ce33-1b8b-4afa-bd54-7e0547ddb6d6″,”CustomerId”:2,”FirstName”:”Mary”,”Country”:”England”});
    3. db.Customers.save({“MongoId”:”7ce9ce33-1b8b-4afa-bd54-7e0547ddb6d7″,”CustomerId”:3,”FirstName”:”Mike”,”Country”:”South Africa”});
    4. Now that we have some data in our “Customers” database, we’ll enable the database for sharding like so:
      db.runCommand( { enablesharding : “Customers” } );
    5. MongoDB expects an Index on the property that you would like to use in your key.  This is an optional step whereas if you don’t define an Index, then MongoDB will do it for you.  You can create it as follows:
      db.Customers.ensureIndex( { “Country” : 1 } );
    6. Now that we have some data and our database “Customers” is enabled for sharding, we create the key as follows:
      admin.runCommand( { shardcollection : “Customers.Customers” , key : { Country : 1 } } );

That’s it.  Now you have a key on which Sharding will happen.  You can have a look here on the MongoDB web site if you are looking for more information.

Conclusion

The only drawback that I can see with sharding in MongoDB is that you can’t specify where the data should be placed.  You can force it, but this may be changed as the routing service sees fit and the collections could be moved.  What is really nice is that the routing service knows where to go fetch the data and how to aggregate it all back to your application.  What you have to take into consideration with MongoDB is that was built with fail-over and high availability in mind.  Sharding in combination with Replica Sets gives you and extremely powerful toolset to play with.  I’ll be covering these two toolsets in combination in an upcoming article.

Related Articles

Advertisement
  1. December 2, 2011 at 14:38

    Great job !!!

    The only hurdle I faced during the setting up shard on my localhost is… we should add allowLocal to the adShard command else Mongo ignores sharding on local …

    db.runCommand( { addshard : “localhost:10000″, name : “shard10000″ ,allowLocal:true} );

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: