Big Data

Archive for the ‘Big Data’ Category

MongoDB – Part 1

Posted: November 23, 2014 in Big Data
Tags: BigData, MongoDB, NoSQL, RoboMongo

Introduction

Databases are the heart of any small or enterprise level software applications. Hence choosing a cost effective, high performance and scalable database is very essential for the success of the project in general and business in particular.

MongoDB is a scalable, high performance open source NoSQL Database.

MongoDB stores data using a flexible document data model that is similar to JSON. Documents contain one or more fields, including arrays, binary data and sub-documents. Fields can vary from document to document. This flexibility allows development teams to evolve the data model rapidly as their application requirements change.

Developers access documents through rich drivers available in all popular programming languages. Documents map naturally to the objects in modern languages, which allows developers to be extremely productive. Typically, there’s no need for an ORM layer.

Features

1)Document-Oriented Storage

JSON (Java Script Object Notation) Document oriented (actually BSON – Binary Script Object Notation). This facilitates dynamic schemas which offer simplicity and power.

Data in MongoDB has a flexible schema. Collections do not enforce document structure. This flexibility gives you data-modeling choices to match your application and its performance requirements.

2)Full Index Support

Index on any attribute, just like you’re used to.

3)Replication & High Availability

Mirror across LANs and WANs for scale and peace of mind.

4)Auto-Sharding

Scale horizontally without compromising functionality.

5)Querying

Rich, document-based queries.

6)Map/Reduce

Flexible aggregation and data processing.

Relational databases save data in tables and rows and our application code hardly ever does that. There is a misalignment of application code objects to tables and rows in database. Hence we end up using some Object Relational Mapping technology such as NHibernate to translate the database structure into application layer objects. This can be avoided using NoSQL database like MongoDB.

In MongoDB there is no schema to define, no tables and no relations between collections of objects. Every document we save in MongoDB could be very simple or quite complex based on the needs of the application layer which makes the life of the developer very easy and helps in writing cleaner code with less effort. Also since there is no schema defined for any document saved in the database it can be extended/changed as needed.

Why should I go for MongoDB?

1)Data Set is Big & Schema is Not Stable

Adding new columns to RDBMS can lock the entire database in some database, or create a major load and performance degradation in other. Usually it happens when table size is larger than 1GB. As MongoDB is schema-less, adding a new field, does not affect old rows (or documents) and will be instant. Other plus is that you do not need a DBA to modify your schema when application changes.

2)NO DBA Assistance

If you don’t have a DBA, and you don’t want to normalize your data and do joins, you should consider MongoDB. MongoDB is great for class persistence, as classes can be serialized to JSON and stored AS IS in MongoDB.

3)Need High Availability (Cloud and Real Life)

Setting replica Set (set of servers that act as Master-Slaves) is easy and fast. Moreover, recovery from a node (or a data center) failure is instant, safe and automatic

4)You Expect a High Write Load

MongoDB by default prefers high insert rate over transaction safety. If you need to load tons of data lines with a low business value for each one, MongoDB should fit.

Setting up MongoDB Server

Download the latest Mongo DB setup from the following location and install. You can download for either 32 bit operating system or 64 bit operating system.

http://www.mongodb.org/downloads

Once you complete the installation of the setup navigate to the installation folder and you will find a bunch of executables and one among them is the Mongo Daemon executable.

Open command prompt with Run as administrator option.

Default location for MongoDB is \data\db

Once the folder exists on the file system then we can start the MongoDB server daemon. In case if we would like to store data in a different location then first create the database and later start MongoDB daemon service.

I have created a folder called D:\Database\MongoDB
Navigate to the folder in command prompt.
Identify the path of the mongod.exe file on your machine

Start MongoDB server daemon service

Note: Press CTRL + C to shut down the server.

To learn about the additional options type mongod –help | more

You will realize that we have option to write log files and set the verbose. Limit the maximum number of connections and providing all the inputs through a configuration file etc…

Let’s create a sample configuration file as shown below.

And start the service with the help of configuration file as shown below

If you do not want to start and stop the server daemon, you can as well choose to install it as a windows service on the machine.

Use the install option as shown below to install it as a windows service on the machine.

Now we can use net start mongodb and net stop mongodb commands to start and stop it.

Or alternately use the services console to start and stop the service as needed.

Test the connection to MongoDB server

Once the server is started use mongo,exe shell to connect to the server.

Open command prompt and type mongo.exe

It will connect to the server

How to insert a document

db.messages.insert({name:’Kishore’, message: ‘Hello’})

The name of the collection that we are creating in the above document is messages. It is equivalent to our table in RDBMS terminology

Now type db.messages.find()

It will retrieve all the entries in the collection.

To close the shell type exit in the command prompt

RoboMongo

We can also download and install a tool called RoboMongo which will be an UI based client.

Hope you have enjoyed this post and in my next post, I will try to explain the concept of MongoDB Replication with examples.

BIG DATA & NO-SQL DATABASES

Posted: November 9, 2014 in Big Data
Tags: Big Data, MongoDB, No-SQL Databases

Off late, I have came across some names like CassandraDB, MongoDB, CouchDB etc. from my friends who are working with open source technologies.

When I started studying and understanding these new technologies, I realized that these are all non-relational or no-SQL database solutions meant for Big Data Analytics.

So I thought it would be more appropriate for me to understand what Big Data is before I could proceed further with no-SQL databases exploration. In this article I am going to share some of my insights about Big Data and no-SQL databases with you all.

According to one of the latest study, every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.

Big Data is the term used to describe a massive volume of both structured and unstructured data that is so large that it’s difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity.

Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.

Big data is important to business and society. More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.

After the advent of Internet, any company’s BI & Analytics doesn’t solely depend on business data. As we evolved with Internet, we have started adding weblogs, videos, Images, sensor data, 3^rd partly application data like Facebook and twitter to our systems. I mean to say that these days’ organizations are witnessing more of unstructured data than structured data. It all should now be included in the analysis for decision making.

Popular sites like Facebook, Twitter, YouTube, Instagram and LinkedIn have all been exploded in user groups and they are the major content producers on Internet. The amount of images to Facebook or the number of videos to YouTube or unfathomable.

Data is different today.
80% of enterprise data is unstructured.
Unstructured data is growing 2X faster than structured

3 V’s of BIG data

In 2001, industry analyst Doug Laney (currently with Gartner) articulated the now mainstream definition of big data as the three Vs of big data: volume, velocity and variety.

Volume. Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.

Velocity. Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.

Variety. Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.

Why Big Data should Matter to any organization?

The real issue is not about acquiring large amounts of data. It’s what we do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable cost reductions, time reductions, new product development and optimized offerings.

For example 80% of medical data is unstructured and is clinically relevant. Data resides in multiple places like individual EMRs, lab and imaging systems, physician notes, medical correspondence, claims etc. By Leveraging Big Data, we can build sustainable healthcare s systems, collaborate to improve care and outcomes.

For instance, by combining big data and high-powered analytics, it is possible to:

Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.
Optimize routes for many thousands of package delivery vehicles while they are on the road.
Generate retail coupons at the point of sale based on the customer’s current and past purchases.
Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.
Recalculate entire risk portfolios in minutes.
Quickly identify customers who matter the most.
Use data mining to detect fraudulent behavior.

Big Data is gaining popularity since storage is becoming cheaper and cheaper.

NOSQL DATABASES

NoSQL technology was pioneered by leading internet companies — including Google, Facebook, Amazon, and LinkedIn — to overcome the limitations of 40-year-old relational database technology for use with modern web applications. Today, enterprises are adopting NoSQL for a growing number of uses cases, a choice that is driven by four interrelated megatrends: Big Users, Big Data, the Internet of Things, and Cloud Computing.

There are many NOSQL databases which are gaining momentum in the context of Big Data.

Big Data storage with Key value pairs databases:

Azure table Storage
Redis
MemcacheDB
HamsterDB
DynamoDB

Big Data storage with Column Family Stores databases:

Hbase
CassandraDB
Amazon Simple DB

Big Data storage with Document Stores databases:

MongoDB
CouchDB

NoSQL database doesn’t mean that it won’t support any querying capabilities.

Query Language
Fast Performance
Horizontal Scalability
Replication
Load Balance
File Storage

However we will miss some of the features which we are very much used to in any RDBMS database.

No Joins support
No support for Transactions
No support for constraints

If your business needs would not allow you to have an ideal database schema with pre-defined list of tables and it needs to be changed with the ever growing needs of the users, then you need to look out for NOSQL databases which will

I have started exploring very interesting feature of mongoDB, which is a scalable open source high performance document oriented database.

Developers access documents through rich, idiomatic drivers available in all popular programming languages. Documents map naturally to the objects in modern languages, which allows developers to be extremely productive. Typically, there’s no need for an ORM layer.

Stay tuned for my next post on mongoDB features, installation and usage details.

Kishore Borra

Welcome

Recent Posts

Categories

Archives