Exploring the Oracle Data Miner Interface

Posted: March 22nd, 2010 | Author: Brendan Tierney | Filed under: Business Intelligence, Oracle, Training, Uncategorized, data mining | Tags: , , , , | No Comments »

Once you have successfully installed Oracle Data Miner and established a connection you will be presented with the main ODM window with the navigator pane on the left hand side. All the tasks that you will need to do in ODM can be accessed from the Navigator pane or by the menu across the top of the window.2-1

    The Navigator

Before commencing any data mining exercise you will want to explore your data. To do this you will need to expand the Data Sources branch of the tree in the Navigator pane.

When expanded you will get a list of all users in the database who has some data visible to you or publicly to all users in the database. You only need to look for your Oracle Data Miner user. This will be the one you created as part of the installation steps given previously. By clicking on your ODM user you will see the branch will divided into Views and Tables. By expanding these branches you will be able to see what views and tables have been defined in your ODM user. The tables and views that you will see will be those that were defined as part of the installing steps given previously.

    2-2
    Loading/Importing Data

There are a number of ways of getting data into your ODM schema. As your ODM schema is like any other oracle schema you can use all the techniques you know to copy and load data into the schema. The ODM tool on the other hand has a data Import option, which can be found under the Data menu option. ODM uses SQL*Loader as the tool to load data into your ODM schema.

Before you select the Import option you will need to set the location of the SQL*Load executable. To do this, select Preferences from the Tool menu (Tool -> Preferences). In the Environment tab enter the location or browse for the executable.2-3

When you have specified the location of SQL*Loader you can now load CSV type files into you ODM schema. Select the Import option from the Data menu. This will open the File Import Wizard. After the introduction screen you can enter the name of a text file or search for the file.

2-4

In the example given here we will load a file of transactions from a grocery retailer for organic purchases.
On the next screen of wizard you can specify what the delimiter is, state if the first record contains the field headers. The Advanced Setting button allows you to enter some of the SQL*Loader specific setting, like skip count, max errors, etc.

2-5

The next screen of the wizard allows you to view the imported fields that were in the first row of the organics.csv file. You may need to do some tidying up of the column names, change the data types and the sizes for each attribute.

The next screen of the wizard allows you to specify the name of the table that the data should be inserted into. If a new table is needed then you can give the table name. If the data is to be appended to the data of an existing table then that table can be select from the drop down list. In our example you will need to select that the data will go into a new table and give the name of the new table name, e.g. ORGANIC_PURCHASES. At this point you have entered all the details and you can proceed to the last screen of the wizard and select the Finish button. It will take a minutes or so for the data to be loaded. Once the data is loaded the table will appear in the Navigator pane under Tables and the table structure will appear in the main part of the ODM window. You can now select the Data tab to look at the data that has been loaded.

2-6

    Exploring Data

Before beginning any data mining task we need to performs some data investigation. This will allow us to explore the data and to gain a better understanding of the data values. We can discover a lot by doing this can it can help us to identify areas for improvement in the source applications, as well as identifying data that does not contribute to our business problem (this is called feature reduction), and it can allow us to identify data that needs reformatting into a number of additional features (feature creation). A simple example of this is a date of birth field provides no real value, but by creating a number of additional attributes (features) we can now use the date of birth field to determine what age group they fit into.

To begin exploring the data in ODM, we need to select the table or view from the navigation tree. Hole the mouse over the table/view name and right click. From the menu that now displays, select the Show Summary Single-Record. We will come back to the other menu options when we cover the next topic, Data Transformations.

2-8

The Data Summarization screen will present some high level information for your data. Taking our Organics data we can explore the ranges of values for each attribute. There is a default setting of 10, which is the number of divisions that ODM will divide the data into. During the data exploration exercise you may want to vary this value from 5 to 20 to see if there are any groupings or trends in the data. This value can be changed by pressing the Select All button followed by the Preferences button.

The following couple of example will illustrate how using the data exploration tool in ODM can help you discover information about your data in a simple graphical manner. The alternative is to log into SQL*Plus or Oracle Developer to write and execute queries to do something similar.

For the fist example we will select he Gender attribute. Although we have our number of bins set to 10, we only get 5 bins displayed. The tool will try to divide the data in to 10 equally spaced bins, but if insufficient data exists then it will present the histogram with the existing set of distinct value.

2-9

From this example on the Gender attribute we can see that there are five distinct values. F for female, M for male, U for unknown, “” (space) for records that contain a space and Other for records that contain a null. From this data we can work out that maybe we should only have three possible vales (F, M, U) but we have 105 records where we do not have a value and this equates to just over 10% of the data samples. This is a sizeable number of records. So one of the steps in the Data Transforation phase (or data clean-up) is what do we work out what to do with these records. This can include removing the data from the data set, working our what the correct value should be or changing the value to U for unknown.

For our second example we will look at the MINING_DATA_BUILD_V view (this can be found under the views branch in the navigator pane). Again right click on this object and select Show Summary Single-Record to bring up the Data Summarisation window. One of the first things that you will notice is that we get a lot more details relating to each attribute.

2-10

Some of these extra details include the average, max, min variance and the number of null values. Again we can go exploring the data, changing the number of bins to varying sizes to see if there is any hidden information in the data. An example of this is if we select AGE and set the number of bins to 10. We get a nice histogram showing that most of our customers are in the 31 to 46 age ranges. So maybe we should be concentrating on these.

2-11

Now if we change the number of bins to 30 can get a completely different picture of what is going on in the data. Now we can see that there are a number of important age groups what stand out more than others. If we look at the 21 to 34 age range, in the first histogram we can see that there is not much change between each of the age bins. But when we look at the second histogram for the 30 bins for the same 21 to 34 age range we get a very different view of the data. In this second histogram we see that that the ages of the customers vary a lot. What does mean. Well it can mean lots of different things and it all depends on the business scenario. In our example we are looking at an electronic goods store. What we can deduce from this second histogram is that there are a small number of customers up to about age 21. Then there is a big jump. Is this due to people having obtained their main job after school having some disposable income. This peak is followed by a drop off in customers followed by another peak, drop off, peak, drop off etc. Maybe we can build a profile of our customer based on their age just like what our financial organisations due to determine what products to sell to use based on our age and life stage.

2-12

From this histogram we can maybe categorise the customers into the follow

• Early 20s – out of education, fist job, disposable income
• Late 20s to early 30s – settling down, own home
• Late 30s – maybe kids, so have less disposable income
• 40s – maybe people are trading up and need new equipment. Or maybe the kids have now turned into teenagers and are encouraging their parents to buy up todate equipment.
• Late 50s – These could be empty nesters where their children have left home, maybe setting up home by themselves and their parents are building things for their home. Or maybe the parents are treating themselves with new equipment as they have more disposable income
• 60s + – parents and grand-parents buying equipment for their children and grand-children. Or maybe we have very techie people who have just retired
• 70+ – we have a drop off here.

As you can see we can discover a lot in the day by changing the number of bins and examining the data. The important part of this examination is trying to relate what you are seeing from the graphical representation of the data on the screen, back to the type of business we are examining. A lot can be discovered but you will have to spend some time looking for it.

In my next posting, I will cover some of the Data Transformation function that are available in Oracle Data Miner.

Brendan Tierney


10 Reasons you really need predictive analytics

Posted: February 4th, 2010 | Author: Brendan Tierney | Filed under: Best Practice, Business Intelligence, Data Warehouse, analytic functions, data mining | Tags: , , | No Comments »

SPSS have recently posted and article called “10 Reasons you really need predictive analytics“. I thought it would be interesting to post the main points from this article to illustrate that not all predictive analytic projects involve Data Mining, but involve a number of different techniques and looking the the business data in a different way. Yes data mining can be a very important element in some of the following

1. Get a higher return on your data investment
Your organization has a significant investment in data – data that contains critical information about every aspect of your business. Today more than ever, you need to get the best return on the data you have collected–and predictive analytics is the most effective way to do this. Predictive analytics combines information on what has happened in the past, what is happening now, and what’s likely to happen in the future to give you a complete picture of your business.

2. Find hidden meaning in your data
Predictive analytics helps you maximize the understanding gained from your data. It enables you to uncover hidden patterns, trends, and relationships and transform this information into action.

3. Look forward, not backward
Unlike reporting and business intelligence solutions that are only valuable for understanding past and current conditions, predictive analytics helps organizations look forward. By leveraging sophisticated statistical and modeling techniques, you can use the data you already have to help you anticipate future events and be proactive, rather than reactive.

4. Deliver intelligence in real time
Your business is dynamic. With predictive analytics, you can automatically deploy analytical results to both individuals and operational systems as changes occur, helping to guide customer interactions and strategic nd tactical decision making.

5. See your assumptions in action
Advanced analytical methods give you the tools to develop hypotheses about your organization’s toughest challenges and test them by creating predictive models. You can then choose the scenario that is likely to result in the best outcome for your organization.

6. Empower data-driven decision making
Better processes help people throughout your organization make better decisions every day. Predictive analytics enables your organization to automate the flow of information to match your business practices and deliver the insights gained through this technology to people who can apply them in their daily work.

7. Build customer intimacy
When you know each of your customers or constituents intimately—including what they think, say, and do—you can build stronger relationships with them. Predictive analytics gives you a complete view of your customers, and enables you to capture and maximize the value of each and every interaction.

8. Mitigate risk and fraud
Predictive analytics helps you evaluate risk using a combination of business rules, predictive models, and information gathered from customer interactions. You can then take the appropriate actions to minimize your organization’s exposure to fraudulent activities or highrisk customers or transactions.

9. Discover unexpected opportunities
Your organization can use predictive analytics to respond with greater speed and certainty to emerging challenges and opportunities, helping you to keep pace in a constantly changing business environment.

10. Guarantee your organization’s competitive advantage
Predictive analytics can drive improved performance in every operational area, including customer relations, supply chain, financial performance and cost management, research and product development, and strategic planning. When your organization runs more efficiently and profitably, you have what it takes to out-think and out-perform your competitors

So what is Predictive Analytics. Check out the description on Wikipedia

Let me know you views and comments on the above.

Brendan Tierney


Oracle Data Miner – New Resources

Posted: January 27th, 2010 | Author: Brendan Tierney | Filed under: Best Practice, Business Intelligence, Data Warehouse, Training, data mining | Tags: , , , , | No Comments »

Over the past couple of weeks a couple of new web resources have appeared on Oracle Data Miner

The first one is that Charlie Berger, the director of Oracle Data Mining Product Management, has started a blog specifically for Oracle Data Miner. Check it out,
http://blogs.oracle.com/datamining/

If you are already using Oracle Data Miner or are interested in following its developments why not join the Oracle Data Miner Facebook group
http://www.facebook.com/pages/Oracle-Data-Mining/287065104533?ref=mf


Why Has Data Mining Struggled So Much?

Posted: November 20th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence, Data Warehouse, Oracle, data mining | Tags: , , , , | No Comments »

Bill Inmon has recently posted an article on “Why has Data Mining struggled so much?”

The article discusses 7 diferent reasons why data mining has struggled, as it has been around for a very long time.

The main points are
1. We have been waiting a long time for it to become available in a usable way
2. Data mining is considered an academic focused with very few practitioners. But this is become less so
3. Data mining requires a different set of skills. Yes you need data management skills but you also need some data mining skills. I will be making a posting focusing on the skill sets required for data mining in the coming weeks.
4. Some industries and application areas are more suited to data mining than others. The difficult is in identifying suitable projects.
5. Data for Data Mining is unclean. Not if you use a data warehouse. Idealy an organisation who has a matur-ish BI infrastrucure will benefit must from a Data Mining project
6. Data is incomplete. Yes you may need to enrich the data from various sources. But again if you have a Data Warehouse you will have most of these
7. Approaches to data mining inadequate. Alot of the approches to data mining projects as based on its statistical history. New problem areas are evolving all the time and we can use data mining in lots of different way.

To view Bill Inmon’s article – click here.

To view our 2 training courses on data mining – click here

Brendan Tierney


Good Oracle Data Mining Link & Book

Posted: November 3rd, 2009 | Author: Brendan Tierney | Filed under: Books, Business Intelligence, Oracle, Web Mining, data mining | Tags: , , , | No Comments »

The following link is a good resource giving details of various aspects of Oracle Data Mining. It is by BC Consulting.
http://www.dba-oracle.com/data_mining/

There is also a link to a book on Oracle Data Miner which covers the version of ODM for 10g, but some of the material in the book also applies for the 11g version. The book is by Dr. Ham and is available from Rampant Books

http://www.rampant-books.com/book_2006_1_oracle_data_mining.htm

 

Brendan Tierney


Data Warehousing Books: Design and architecture

Posted: October 31st, 2009 | Author: Uli Bethke | Filed under: Books, Business Intelligence, Data Warehouse, Data Warehousing Books | Tags: , , , , | 2 Comments »

In another post I have covered data warehousing books in the world of Oracle. We’ve also had a look at data warehousing and business intelligence books for project management and business analysis. Today we will look at data warehousing and business intelligence books that look at the technical design and architecture of a data warehouse solution.

Must Have

DW 2.0: The Architecture for the Next Generation of Data Warehousing: Bill Inmon revisits his data warehouse architecture. Addresses the following issues: Real-time BI, unstructured data, the enterprise data warehouse and change, the data life cycle, time variance of data. Very useful from a conceptual point of view, but not enough detail.

The Data Warehouse Toolkit- The Complete Guide to Dimensional Modelling. My first book on data warehousing. Still valuable today. Great for dimensional modelling data marts or small non-realtime Enterprise Data Warehouses based on Kimball’s conformed dimensions. It also has a good overview on industry specific data model patterns in a dimensional context. A must have.

The Data Model Resource Books Vol 1-3: The books describe fundamental data modeling patterns that can be applied and reused across the enterprise. If you are assigned the task of modelling an Enterprise Data Warehouse, these books give you great insight into best practices in data modelling. Volume 2 offers industry specific data model patterns and provides invaluable information to better understand the issues at hand in a particular industry. Personally I find it that you should actually start with volume 3 as this is the most generic of the three books. Also if you only get one of the books get volume 3.

If you have a requirement around near-real time data warehousing and operational business intelligence I recommend to look into Dan Linstedt’s data vault modelling techniques. The Business of Data Vault Modeling will get you started.

Some more recent additions to the data warehouse architecture league of books includes Building and Maintaining a Data Warehouse and Advanced Data Warehouse Design. The first of these walks us through all the technical areas of a data warehouse project: source system analysis, database design, bi reporting, data quality, metadata. In my opinion, the best chapter is on data integration and ETL. There are very few dedicated ETL books out there and this is one of the few that touches on the subject, albeit from a high level. In Advanced Data Warehouse Design the authors discuss the shortcomings of existing data warehouse implementations focusing mainly on spatial and temporal data, e.g. the shortcomings of slowly changing dimensions when capturing changes over time. They propose a truly temporal and spatial data warehouse. Examples are given in MS SQL Analysis Service (temporal) and Oracle OLAP (temporal and spatial).

To my knowledge the only book out there dedicated to the physical design of databases is Physical Database Design: the database professional’s guide to exploiting indexes, views, storage, and more. Most of the stuff covered here is for advanced users. It covers Oracle, DB2, SQL Server, and for some of the MPP stuff Teradata. Personally I found the chapter on physical design for a shared nothing architecture, and the chapter on hardware (CPU architecture, disks, server sizing etc.) the most useful.

               
               
               

Dr. Ronnie Abrahiem, Software Engineer at CIBER has recently published a book on combining SOA and data warehousing in a near-real time environment. This looks quite interesting but I haven’t read the book myself. It has the rather long title Data Warehousing with Service-oriented Architecture: Designing and Implementing Prototype Models For an Integration of Near-Real-Time Data Warehousing Architecture with Service-oriented Architecture. I am currently working on a project where we want to integrate a SOA based MDM solution with the data warehouse. The book may offer some interesting insights around this.

Should Have

If you have a lot of aggregate tables in your warehouse I recommend to have a look at Mastering Data Warehouse Aggregates for a formalised methodology and some really useful tips and tricks around an aggregate navigator.

Another recent addition to data warehouse design books is Data Warehouse Design: Modern Principles and Methodologies. Very useful chapter on ETL and quite affordable.

Could Have

Data Warehouse Design Solutions. This is useful as a second reference for industry specific dimensional models. However, it can not replace Kimball’s original book on the subject.

Clickstream Data Warehousing. If you are implementing a data warehouse for web analytics you should have a look here. However, in light of the explosion of data volumes and with Hadoop and MapReduce at hand this one is slightly obsolete.


What is Data Mining ?

Posted: October 28th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence, Training, Uncategorized, data mining | Tags: , , , , , , , , , | No Comments »

In this weeks topic I will explore what is Data Mining, the different meanings, how the term is used, etc. I will give you my interpretation of what it is and how other descriptions of data mining can be categorised.

Every article you read, every presentation you hear, etc. you get a slightly different description, or should it be that they hint to a description of how they use data mining in the products or their applications. By giving this hint at what data mining is they try to claim that they are using it, as it gives their products, applications and services a higher degree of sophistication compared to others. There is also the idea that it is a one of those trendy terms that is thrown out without them really knowing what it is really about.

Data Mining Definition

One of the most commonly cited definitions of what data mining is, “..it is the non-trivial extraction of previously unknown and potentially useful information from data” by Usama Fayyad et al (Chief Data Officer, Yahoo Inc) in their landmark paper back in 1996.

Based on this definition data mining is does not involve some basic analytics, decision making based on some defined rules, being able to identify events based on current data, etc. But these type of scenarios are typically talked about as being data mining. If we go back to the definition by Fayadd above, by say the “non-trivial” it means that we cannot write some code/queries to pull data out of our data that answers some simple questions. Another important part of the definition is “potentially useful information”, tells use that some times and may in a lot of cases, data mining does not give use anything useful. But it can give us useful information only if we have a good understanding of the data, the business rules of the data, the meta-data, how the rules and the data relate to each other, etc. All of this requires extensive experience of working with the data. Who is best at doing this, but database designers and developers. People with a statistics background (typical what you see in data mining roles) have to go and learn all about the data, the business rules, the meta-data etc. This can be a huge waste of time and resources as the database people are generally ignored.

Some examples

I was at an IT conference last week (I was co-author of a paper on Opinion Data Mining). One of the key note talks was given by a technical lead in IBM (one of two thousand in the company). He gave some good examples of how Business Intelligence (BI) could be used to manage the energy needs of a new city being build out in the middle east. He also gave another example of how BI is being used in and around Galway city and coast line. There were several mentions of data mining during his talk, but I don’t think any of his examples reflected what data mining is. Yes he did give examples of how you can intelligently use your data. For example, if an object is spotted out in Galway bay then you can predict where this object will come to shore. But data mining is not the technique that is used in this case. Instead it is a rules based type system, that takes into account a number of factors, link the size of the object, the current position, currents, wind direction, etc. Using these rules (and not data mining) they can identify the landing position and let all the necessary bodies know this (like the coast guard, Galway county council, environmental control, etc).

Generally data mining can be used when you have a mature BI environment in your organisation that includes not just transactional and business reporting, but also data warehousing, data analytics, prediction systems (based on rules), etc. Data mining allows you to explore for and identify patterns in your data (and you need lots of data really). Going back to the definition of data mining a lot of the results from a data mining project may not be of any value. What you are looking for are the nuggets of gold that exists in the data and you may take some time to fine these, if they exist at all.

One of the aims of this weeks posting was to explore what data mining really is. At this point I haven’t really talked much about what it is, but what I hope you have gotten so far is that the term data mining is overly used in the IT world and can be seen as one of those trendy words that organisation like use (and use incorrectly). Data mining is used as an umbrella term that covers any processing of your data that involves a bit a processing, applying some rules and some analytics.

Over the coming weeks we will explore what Data Mining really is and what are the different stages of a Data Mining project.

The next posting will be about CRISP-DM, which is a industry neutral, product neutral data mining life cycle.


Data warehousing for free! Terabyte sized data warehouse and business intelligence without license costs

Posted: October 26th, 2009 | Author: Uli Bethke | Filed under: Business Intelligence, Data Warehouse | Tags: , , , | No Comments »

Greenplum

This is no joke. Greenplum on 19 October announced a free single node edition of its analytical database.

For those of you who haven’t heard about Greenplum, they are a provider of an MPP database software that runs on commodity hardware (unlike some its competitors such as Teradata, Netezza, or recently Oracle with Exadata). The database is based on open source database software PostgreSQL, however, is closed source itself.

Features of the database include Massively Parallel Processing, redundancy, compression, row-level or column oriented data storage, compression, partitioning, SQL standard including SQL 2003 OLAP (analytic functions etc.), MapReduce support, ODBC & JDBC support.

So what restrictions are there for the single node edition. Obviously you are only allowed to run it on a single node. Below is an extract from the Greenplum datasheet:

  • Unlimited production usage on a single commodity x86 server using up to 2 CPU sockets (and unlimited CPU cores), or in a single virtual machine using up to 8 virtual CPU cores.
  • Fully parallel SQL and MapReduce processing leverages multi-core parallel-processing engine for every query.
  • No storage capacity cap: from GBs to 10s of TBs.
  • Hybrid row and column-oriented processing.
  • Free community support as well as a low-cost, paid support option.

Of course, the full power of Greenplum’s shared nothing architecture only materialises with multiple nodes. But the company says that you can expand seamlessly from a single-node to multi-node architecture.

Documentation is installed when you install the single-node edition. Couple of thousand pages long but tiny compared to the beast you get with the Oracle database.

Use cases

I can see two immediate use cases for this:

(1) Greenplum themselves promote this offering as part of their Enterprise Data Cloud. They have a vision of self service data marts. Based on this, data analysts can go to the Enterprise Data Warehouse and via interfaces create their own data marts for in depth analysis outside the EDW. Have a look at Curt Monash’s excellent article on the future of data marts.

(2) I can see another use case for departmental solutions. You could set up your first couple of subject areas or data marts on a single node machine and if you reach limits on this single node, add more nodes to scale out. Or if you don’t reach this limit just stay on this setup forever.

So why are they giving away data warehouses for free? In another article, Curt Monash gives the following reasons:

  • Adding value to its Enterprise Data Cloud story
  • Seeding the market for future enterprise sales
  • Depriving competitors of revenue, perhaps at enterprises too small to ever be paying Greenplum customers

Microstrategy

Combine the Greenplum offering with Microstrategy’s free Reporting Suite, and you have a best of breed departmental solution for zilch.

The following restrictions apply to the Microstrategy BI tool:

- 100 named users for the frontend of the BI tool and the BI server
- Two named users for the semantic layer module
- Limited to one CPU. I presume it is limited to one CPU core, but this is not clear from the website
- Two named users for the other modules in their BI suite, e.g. OLAP reporting etc.

Have a look at their website for a full set of features and conditions.

For the right set of requirements the above is an attractive and very cost-effective combination. On top of that it is scalable. So if you grow out of it just scale out and add on.


Getting Started with Oracle Data Mining

Posted: October 20th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence | Tags: , , | No Comments »

This blog posting is my first in the area of Data Mining, in particular performing Data Mining using Oracle.  Over the coming months I will be making regular postings in the areas of data mining and associated activities.  The approach that I take to data mining is slightly different to what would be considered the main stream approach.  Last year I gave a presentation at the UKOUG annual conference in Birmingham titled “You don’t need a PhD to do Data Mining”. What I have discovered over the past 12 years from working in this area is that it is dominated by people with a background in statistics (and many of them have PhDs). But having worked on many data mining projects, most of the work that needs to be done, can be done by and is generally done by Database People.

If you are interested in Data Mining then hopefully my postings over the coming months will help you gain a better understanding of what it is about, what you can do, what kind of outputs you get and how the outputs can be used. 

The following list of books and links will give you a good start in gaining some idea of what it entails

Oracle Data Mining Concepts Book – Parts A & B

Getting Started with Oracle 11g Data Mining

Why & What is Data Mining

As I add each posting I will add some additional links and resources, so over the next few months we will have a comprehensive list of resources for Data Mining in Oracle.  We will put together a separate web page to contain all these resources.  This will complement the list of resources that Uli Bethke is putting together on ODI.

We also offer two training courses on data mining. The first one of these will be aimed at managers and novices in the area. It will give a high level overview on key data mining concepts and issues. The second course will specifically deal with data mining on Oracle using PL/SQL and the Oracle data mining tool. Have a look at our training pages for details.

Let me know if you have any suggestions on topics that you would like me to cover.


Essential skills any BI Consultant should have. Or why a fool with a tool is still a fool.

Posted: June 26th, 2009 | Author: Uli Bethke | Filed under: Business Intelligence | Tags: , , , , | 4 Comments »

Today I’m gonna be mean.

Before we start though, here is my definition of a BI consultant. In my eyes, a BI consultant is the guy at the frontend of a data warehouse/business intelligence project. He creates and designs reports, and is responsible for setting up the logical layer between the data source(s) and the ad hoc query tool. His main task is to retrieve a s**tload of data in an ultra fast way to keep the business folks happy. In this narrow definition a BI consultant is not a data integration specialist or a data warehouse designer, and certainly not a data miner.

In medias res: A while back I was working for a consulting firm. They had this principal BI consultant and one day I had the “luck” to work in a project with him. We ran an INSERT statement to load a couple of million rows into a table. After a minute or so my friend got nervous and started to query the table to check on the progress of our data load. He got very pale when his query did not return any records and seriously thought he had discovered a major bug in the database system. But this guy had a tool. A BI tool. He was the hero of charts, ad hoc, and drag and drop. The master of the Business Objects universe.  Was he able to string a simple query together? No. Did he have the skills to get him out of his performance mess. Nope.

Why is it not enough to just master the BI tool? And more importantly what skills do you need to become a true master?

Let me try and explain with an analogy. Airline pilots are highly paid individuals. Rightly so. Do you think this is because of their fabolous auto pilot skills? Or has it to do with the fact that they are trained to do the right things in extremely precarious situations? If you want to become a pilot you need years and years of training and experience. But once you have mastered the art of flying, it shouldn’t matter much whether you fly around in a Boeing or an Airbus. It’ll take a while to adapt but at the end of the day a plane is just a plane. It obeys the laws of gravity. The same is true for BI tools. Once you have mastered the core skills and concepts you can quickly and easily transfer them to any BI tool in the world and be a great success. For example with my knowledge of Business Objects, it took me a long weekend to learn the core functionality of OBIEE.

So what are these skills? As we have learned, the main task of a BI consultant is to query data fast, ultra fast that is. A report should not take longer than ten seconds to run. Having an excellent knowledge of SQL is imperative for this. At a minimum this means to be able to write complex sub-queries and multi table joins. In a data warehouse environment, however, this is usually not enough. Often a BI consultant needs to be able to query data recursively and do complex inter-row calculations. In the past we had to use expensive self-joins for this type of requirement. Not anymore though. Most if not all RDBMS today allow you to use analytic functions to perform inter row comparisons and aggregations. I am still perplexed that few consultants out there have mastered this essential skill. I myself first learned about the real power of analytic functions from my very good friend Maciej Kocoon (Maciejk, pints next weekend?). A true magician with SQL. Another great source to learn more about analytic functions is the O’Reilly SQL Cookbook.

Of course, a BI consultant also needs to be able to troubleshoot performance problems. This normally means that he needs to have a good understanding of database concepts in general, and the particular  database system that is used in the project. At a minimum he needs to be able to trace queries and to read and interpret explain plans. Data modelling should be another core competency of a BI consultant. This includes both 3NF and dimensional modelling techniques. It also can’t do any harm to have business analysis skills to understand report requirements and visualisation techniques to create killer reports. Throw a good understanding of web technologies into the mix to customise the reporting environment, and you should be able to excel in any project. Ahh, I almost forgot OLAP, which should be easy once you have mastered analytic functions.

If you are an employer and have to choose between a guy who’s had years of experience in a particular tool but knows precious little about the inner workings of a database, data modelling, and SQL and someone who is a database expert with excellent knowledge of SQL, but has little knowledge of the BI tool at hand, always go for the latter. It will pay off. Guaranteed. And it will save you from creating embarassing support tickets..