Off late, I have came across some names like CassandraDB, MongoDB, CouchDB etc. from my friends who are working with open source technologies.
When I started studying and understanding these new technologies, I realized that these are all non-relational or no-SQL database solutions meant for Big Data Analytics.
So I thought it would be more appropriate for me to understand what Big Data is before I could proceed further with no-SQL databases exploration. In this article I am going to share some of my insights about Big Data and no-SQL databases with you all.
According to one of the latest study, every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
Big data is a popular term used to describe the exponential growth and availability of data, both structured and unstructured.
Big Data is the term used to describe a massive volume of both structured and unstructured data that is so large that it’s difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves too fast or it exceeds current processing capacity.
Big data is a collection of data from traditional and digital sources inside and outside your company that represents a source for ongoing discovery and analysis.
Big data is important to business and society. More data may lead to more accurate analyses. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
After the advent of Internet, any company’s BI & Analytics doesn’t solely depend on business data. As we evolved with Internet, we have started adding weblogs, videos, Images, sensor data, 3rd partly application data like Facebook and twitter to our systems. I mean to say that these days’ organizations are witnessing more of unstructured data than structured data. It all should now be included in the analysis for decision making.
Popular sites like Facebook, Twitter, YouTube, Instagram and LinkedIn have all been exploded in user groups and they are the major content producers on Internet. The amount of images to Facebook or the number of videos to YouTube or unfathomable.
- Data is different today.
- 80% of enterprise data is unstructured.
- Unstructured data is growing 2X faster than structured
3 V’s of BIG data
In 2001, industry analyst Doug Laney (currently with Gartner) articulated the now mainstream definition of big data as the three Vs of big data: volume, velocity and variety.
Volume. Many factors contribute to the increase in data volume. Transaction-based data stored through the years. Unstructured data streaming in from social media. Increasing amounts of sensor and machine-to-machine data being collected. In the past, excessive data volume was a storage issue. But with decreasing storage costs, other issues emerge, including how to determine relevance within large data volumes and how to use analytics to create value from relevant data.
Velocity. Data is streaming in at unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data velocity is a challenge for most organizations.
Variety. Data today comes in all types of formats. Structured, numeric data in traditional databases. Information created from line-of-business applications. Unstructured text documents, email, video, audio, stock ticker data and financial transactions. Managing, merging and governing different varieties of data is something many organizations still grapple with.
Why Big Data should Matter to any organization?
The real issue is not about acquiring large amounts of data. It’s what we do with the data that counts. The hopeful vision is that organizations will be able to take data from any source, harness relevant data and analyze it to find answers that enable cost reductions, time reductions, new product development and optimized offerings.
For example 80% of medical data is unstructured and is clinically relevant. Data resides in multiple places like individual EMRs, lab and imaging systems, physician notes, medical correspondence, claims etc. By Leveraging Big Data, we can build sustainable healthcare s systems, collaborate to improve care and outcomes.
For instance, by combining big data and high-powered analytics, it is possible to:
- Determine root causes of failures, issues and defects in near-real time, potentially saving billions of dollars annually.
- Optimize routes for many thousands of package delivery vehicles while they are on the road.
- Generate retail coupons at the point of sale based on the customer’s current and past purchases.
- Send tailored recommendations to mobile devices while customers are in the right area to take advantage of offers.
- Recalculate entire risk portfolios in minutes.
- Quickly identify customers who matter the most.
- Use data mining to detect fraudulent behavior.
Big Data is gaining popularity since storage is becoming cheaper and cheaper.
NoSQL technology was pioneered by leading internet companies — including Google, Facebook, Amazon, and LinkedIn — to overcome the limitations of 40-year-old relational database technology for use with modern web applications. Today, enterprises are adopting NoSQL for a growing number of uses cases, a choice that is driven by four interrelated megatrends: Big Users, Big Data, the Internet of Things, and Cloud Computing.
There are many NOSQL databases which are gaining momentum in the context of Big Data.
Big Data storage with Key value pairs databases:
- Azure table Storage
Big Data storage with Column Family Stores databases:
- Amazon Simple DB
Big Data storage with Document Stores databases:
NoSQL database doesn’t mean that it won’t support any querying capabilities.
- Query Language
- Fast Performance
- Horizontal Scalability
- Load Balance
- File Storage
However we will miss some of the features which we are very much used to in any RDBMS database.
- No Joins support
- No support for Transactions
- No support for constraints
If your business needs would not allow you to have an ideal database schema with pre-defined list of tables and it needs to be changed with the ever growing needs of the users, then you need to look out for NOSQL databases which will
I have started exploring very interesting feature of mongoDB, which is a scalable open source high performance document oriented database.
MongoDB stores data using a flexible document data model that is similar to JSON. Documents contain one or more fields, including arrays, binary data and sub-documents. Fields can vary from document to document. This flexibility allows development teams to evolve the data model rapidly as their application requirements change.
Developers access documents through rich, idiomatic drivers available in all popular programming languages. Documents map naturally to the objects in modern languages, which allows developers to be extremely productive. Typically, there’s no need for an ORM layer.
Stay tuned for my next post on mongoDB features, installation and usage details.