Friday, November 1, 2013

Business Intelligence

A guest post by Ben Keeping, a freelance developer based in the UK, specializing in applications written in .NET and Ruby, using lean and agile techniques. He can be contacted .



The world runs on data, and business's demand for intelligence on how that data is being used can only increase as more and more data is made public. Demand for real time, or near real time reporting, statistics and aggregates is now the norm, as opposed to those reports being run once per week. This blog post will explore some of the approaches for delivering business intelligence.




THE PROBLEM



You have a highly normalized database with millions of rows, and you are unable to run real-time queries in order to generate statistics or aggregations because the query would be too slow, or use too many of the database server's resources.



Imagine an online ordering site which is backed by this kind of database schema:



(Note that I'm not saying this is a great schema - it's just an example of a database that is highly normalized for the purposes of this post. Although NoSQL is gaining popularity, the majority of companies probably have normalized SQL-based dbs).



The business wishes to run a suite of reports, which could include:



* Product types per country per order date

* Product types per account types

* Product types per user type



and so on.



CLASSIC BATCH JOB/ ETL-LIKE PROCESS



Extract-Transform-Load (ETL) is a classic and popular method for generating data warehouses. The general theory is that you have one or more sources of data that you extract data from. That data is then transformed into your desired format, and the results are loaded into your target database.



Perhaps as part of the transformation process, you de-normalize some data. For example, instead of having a Users table that has a key to UserTypes, you create a Users table with the actual user type of Commercial or Partner or Individual.



The reports are then run against the reporting database, and emailed out, or made available on a website, or perhaps you're using some tool like Crystal Reports or Business Objects, so people can query the data away from your production database.



This kind of approach works for a lot of people, and can be fairly powerful, because you're not customizing the data too much, you're retaining much of the original schema, and it's fairly quick to write new reports.



The downside is that because the data is still fairly normalized, if you start hitting the millions of rows mark, and you have lots of users querying your reporting database, performance could once again become an issue.



DISK IS CHEAP



Another approach would be to embrace de-normalization, because disk space is cheap. Instead of generally retaining the normalized schema, you create tables designed to meet a specific report.



So during your ETL process, you create a table per report. For example, for the report 'Product types per user type,' you create a Producttypesperusertype table, which can be queried much more simply because you've done all of the hard work of joining the tables during the ETL.



So you're table would look something like this:



COUNT

PRODUCT TYPE

USER TYPE



4345089

Electrical equipment

Individual



1242368

Electrical equipment

Commercial



352966

Electrical equipment

Partner



493611

DVDs

Individual



36583

DVDs

Commercial



265892

DVDs

Partner



With your data highly optimized for your report, the reports can be run quickly, with further sub queries and aggregations possible in real time.



But what if your ETL process is taxing your production database too much when you run the batch job? Perhaps the job was running overnight, because traffic is slower to your site then, but now it's still running at 9 a.m. and impacting your production system?



DOMAIN EVENTS & PUB/SUB



Sometimes handling large amounts of business intelligence data requires a different approach.



A domain event captures the state of a domain object at a certain time. Often implemented using an asynchronous publish/subscribe pattern, they provide an excellent way of acting upon domain changes non-invasively - i.e., without affecting the core control of the code issuing the domain event.



For example in our ordering system, when creating an order, a domain event could be raised:



OrdersService { Create(product, user) { order = new Order(product, user) order.Save() DomainEvents.Publish(Events.OrderCreated, order) } }



Once this domain event is fired, the subscribed handlers will pick up the change, extract the required information for a particular report, and then save that change.

So for our report that aggregates product types by user types, you may implement something such as:



ProductByUserOrderCreatedHandler { Handle(event, order) { if (event != Events.OrderCreated) return productType = order.Product.Type userType = order.User.Type Note: SQL used for brevity - you'd may use some other persistence method! sql = "update ProductsbyUsers set count = count+1 where producttype = productType and usertype = userType" sql.Execute() } }



You'd implement different handlers to update the data for each report. So the above handler may be just extracting product type and user type in order to update the aggregated products by users table.



But you may have another handler that focused on products by countries, or products by account types and so on. Each of these handlers would be pushing data into their own report-based table, meaning that you're generating near-real-time report data without impacting your production database.



NOSQL-BASED SOLUTIONS



So far we've been talking about SQL-based solutions, but there are also the suite of NoSQL-based technologies, which can act as great reporting data stores. Using the domain event & pub/sub patterns, means you can really leverage different technology without affecting your core codebase. NonSQL DBs such as Mongo have some great features for aggregations.



In an asynchronous system, one problem of updating counts of things is that you may run into thread safety issues. So, if you had several threads attempting to update a count of DVD's ordered by the user type 'Commerical', then we could potentially end up with an incorrect count.



Mongo has a great operator called $inc, which updates a field by an amount in an atomic operation.



So imagine our 'products by user types' document looked like this :



{ count: 1638494, producttype: 'DVDs', usertype: 'Commercial' }



We could issue a mongo command against our reporting collection like this:



db.productsbyusertypes.update( { producttype: 'DVDs', usertype: 'Commercial' }, { $inc: { count: 1 }, } )



This means that rather than trying to work out how to increment our aggregated products by user types collection in a thread-safe manner, Mongo can handle it for us.



SUMMARY



Handling massive amounts of data, and reports for business intelligence that are real time, or near real time, is a complicated issue. There are lots of ways to implement it, and lots of technologies to help you do it. De-normalizing data may be enough to make a reporting system perform well, or you may need to generate specialized tables. Domain events have a certain complexity to them, but can be an excellent solution that operates without impacting your production system.



For more details about Business Intelligence, see the resources below from Safari Books Online.



Not a subscriber? Sign up for a .



SAFARI BOOKS ONLINE HAS THE CONTENT YOU NEED



is a full overview of what comprises business intelligence. It is intended to provide an introduction to the concepts to uncomplicate the learning process when implementing a business intelligence program.



illustrates how organizations can make better, faster decisions about their customers, partners, and operations by turning mountains of data into valuable business information that's always at the fingertips of decision makers. You'll learn what's involved in using business intelligence to bring together information, people, and technology to create successful business strategies-and how to execute those strategies with confidence.



offers a deep look into the latest research, tools, implementations, frameworks, architectures, and case studies within the field of Business Intelligence Management.



ABOUT THE AUTHOR



Ben Keeping is a freelance developer based in the UK. He currently specializes in applications written in .NET and Ruby, using lean and agile techniques. He can be contacted .
Full Post

No comments:

Post a Comment