Posted: November 20th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence, Data Warehouse, Oracle, data mining | Tags: Bill Inmon, data mining, data mining discussion, Data Mining Project, data mining training | No Comments »
Bill Inmon has recently posted an article on “Why has Data Mining struggled so much?”
The article discusses 7 diferent reasons why data mining has struggled, as it has been around for a very long time.
The main points are
1. We have been waiting a long time for it to become available in a usable way
2. Data mining is considered an academic focused with very few practitioners. But this is become less so
3. Data mining requires a different set of skills. Yes you need data management skills but you also need some data mining skills. I will be making a posting focusing on the skill sets required for data mining in the coming weeks.
4. Some industries and application areas are more suited to data mining than others. The difficult is in identifying suitable projects.
5. Data for Data Mining is unclean. Not if you use a data warehouse. Idealy an organisation who has a matur-ish BI infrastrucure will benefit must from a Data Mining project
6. Data is incomplete. Yes you may need to enrich the data from various sources. But again if you have a Data Warehouse you will have most of these
7. Approaches to data mining inadequate. Alot of the approches to data mining projects as based on its statistical history. New problem areas are evolving all the time and we can use data mining in lots of different way.
To view Bill Inmon’s article – click here.
To view our 2 training courses on data mining – click here
Brendan Tierney
Posted: November 19th, 2009 | Author: Uli Bethke | Filed under: Oracle Data Integrator (ODI) | No Comments »
OdiSqlUnload is a handy tool to quickly dump the content of a table into a CSV file. This is a lot quicker than first creating a file data server, interfaces etc. However, OdiSqlUnload has one major drawback. It does not allow you to load the header columns of your table(s) into the CSV. This makes the tool pretty much useless for a lot of purposes.
I propose the following workaround.
We create a procedure with three steps
(1) Create a header CSV from data dictionary or work reposiotry meta data via OdiSqlUnload
(2) Create a content CSV from your database table(s)
(3) Append file2 to file1 with OdiFileAppend
For step one we use the following command
OdiSqlUnload "-FILE=D:/xxx/<%=odiRef.getContext("CTX_NAME")%>/odi_files/DQError/<%=odiRef.getSession("SESS_NO")%>_header.csv" "-DRIVER=oracle.jdbc.driver.OracleDriver" "-URL=jdbc:oracle:thin:@xx.1.xx.xx:xxxx:odi" "-USER=<%=odiRef.getSchemaName("ORCL_XXX_TRG", "D")%>" "-PASS=xxxxxxxxxxxx" "-FILE_FORMAT=VARIABLE" "-FIELD_SEP=," "-ROW_SEP=\r\n" "-DATE_FORMAT=yyyy/MM/dd HH:mm:ss" "-CHARSET_ENCODING=ISO8859_1" "-XML_CHARSET_ENCODING=ISO-8859-1"
SELECT
LTRIM(SYS_CONNECT_BY_PATH(column_name, ','),',') path
FROM (
SELECT
MAX(column_id) OVER (PARTITION BY table_name) as tot,
column_name,
column_id
FROM user_tab_cols WHERE table_name = UPPER('<%=odiRef.getOption("TABLE_NAME")%>')
)
WHERE column_id = tot
START WITH column_id = 1
CONNECT BY PRIOR column_id = column_id -1
We put the context and our session_no into the file name for our header.csv.
We then query the user_tab_cols data dictionary table and use SYS_CONNECT_BY_PATH to create a CSV from the columns. If you are using Oracle 11GR2 I recommedn to use the new analytic function LISTAGG instead. As you can see from the script above we pass the name of the table into the procedure via the table_name option.
Technology for the step is Sunopsis API
For step two we use the following command
OdiSqlUnload "-FILE=D:/xxx/<%=odiRef.getContext("CTX_NAME")%>/odi_files/DQError/<%=odiRef.getSession("SESS_NO")%>_load.csv" "-DRIVER=oracle.jdbc.driver.OracleDriver" "-URL=jdbc:oracle:thin:@xx.1.xx.xx:xx:odi" "-USER=<%=odiRef.getSchemaName("ORCL_XXX_TRG", "D")%>" "-PASS=xxxxxxxxxxxxxxxx" "-FILE_FORMAT=VARIABLE" "-FIELD_SEP=," "-ROW_SEP=\r\n" "-DATE_FORMAT=yyyy/MM/dd HH:mm:ss" "-CHARSET_ENCODING=ISO8859_1" "-XML_CHARSET_ENCODING=ISO-8859-1"
SELECT * FROM <%=odiRef.getOption("TABLE_NAME")%>
The above will generate the content CSV for the table name that we specify via the option table_name of our procedure. Obviously the table name should be the same as in step one.
Finally in step three we append the content CSV to the header CSV
OdiFileAppend "-FILE=D:/xxx/<%=odiRef.getContext("CTX_NAME")%>/odi_files/DQError/<%=odiRef.getSession("SESS_NO")%>*.csv" "-TOFILE=D:/ODI/<%=odiRef.getContext("CTX_NAME")%>/odi_files/DQError/<%=odiRef.getSession("SESS_NO")%>_errors.csv" "-CASESENS=NO"
In order to master scripting in ODI I recommend the following books.
Java BeanShell
Scripting in Java: Languages, Frameworks, and Patterns
Jython
The Definitive Guide to Jython: Python for the Java Platform.
Jython Essentials (O’Reilly Scripting)
Posted: November 17th, 2009 | Author: Uli Bethke | Filed under: Oracle, Oracle Data Integrator (ODI) | Tags: odi purge log, oracle data integrator purge log, snp_exp_txt | No Comments »
We all know that we should purge the ODI log on a regular basis. if we log everything or if we run near real time batches. You can either purge the log manually from the Operator module or you can use the ODIPurgeLog tool.
As per Metalink 424663.1 the following tables contain information concerning Session runtime executions
SNP_SESS_TXT_LOG
SNP_SESS_TASK_LOG
SNP_STEP_LOG
SNP_TASK_TXT
SNP_SESS_TASK
SNP_SESS_STEP
SNP_SEQ_SESS
SNP_VAR_SESS
SNP_SESSION
SNP_EXP_TXT (starting with ODI 10.1.3 versions):
SNP_SESS_TASK_LOG.I_TXT_TASK_MESS = SNP_EXP_TXT.I_TXT
SNP_STEP_LOG.I_TXT_STEP_MESS = SNP_EXP_TXT.I_TXT
SNP_VAR_SESS.I_TXT_VAR = SNP_EXP_TXT.I_TXT
SNP_VAR_SESS.I_TXT_DEF_T = SNP_EXP_TXT.I_TXT
SNP_SESSION.I_TXT_SESS_MESS = SNP_EXP_TXT.I_TXT
SNP_SESSION.I_TXT_SESS_PARAMS = SNP_EXP_TXT.I_TXT
When you perform a purge log these are the tables that are cleaned up. The tables are cleaned up via DELETE DML. As a result space is not released and the is not lowered.
In order to release the unused space you can shrink the tables from Oracle 10 onwards, e.g.:
ALTER TABLE snp_exp_txt ENABLE ROW MOVEMENT;
ALTER TABLE snp_exp_txt SHRINK SPACE;
ALTER TABLE snp_exp_txt DISABLE ROW MOVEMENT;
Posted: November 16th, 2009 | Author: Uli Bethke | Filed under: Oracle, SQL for Analysis, analytic functions | Tags: analytic functions | No Comments »
I promise that this will be the last post on analytic functions this month; but I came across this proposal to extend analytic functions with a comparative window function on Tom Kyte’s blog yesterday (also make sure to review the actual paper). Basically this new type of analytic function would allow you to compare the current row in a window to any other row in the window. I recently had a requirement where this would have come in handy. Imagine a dataset made up of product category, product id, product referential id, e.g.:
Computers,1,2
Computers,1,3
Computers,1,4
Computers,4,9
The requirement was to find those products whose reference products were also in the dataset. In the example above this would have been the record Computers,1,4. Something that can only be easily solved with a self-join. With this new proposal you could write this as
SELECT * FROM (
SELECT
prod_cat.
prod_id.
prod_ref_id,
MAX(CASE WHEN prod_ref_id = INDEX(prod_id,anchor_row) THEN 1 ELSE 0) OVER (PARTITION BY prod_cat ROWS BETWEEN UNBOUNDED PRECEDING UNBOUNDED FOLLOWING) ref_exists_ind
FROM
prod_ref
)
WHERE ref_exists_ind = 1
Pretty cool…
Posted: November 15th, 2009 | Author: Uli Bethke | Filed under: Oracle, SQL for Analysis, analytic functions | Tags: analytic functions index, first_value oracle, last_value, performance analytic functions, row_number() oracle | No Comments »
Thank you all for your entries to the recent quiz. The guys from Packt should be shortly sending out the e-books to the winners.
Today I will walk you through some of the possible solutions and also briefly touch upon performance of these.
Last week’s contest was based on a classic SQL problem. Anthony Molinaro, the author of the SQL Cookbook has named this the Knight Values problem because it is analogous to a knight’s path in the game of chess: “You determine the result the way a knight determines a new location: by jumping to a row then turning and jumping to a different column”. In our example we first have to find the highest amount sold for a customer and from there then get the corresponding time_id. Another complicating factor is that we want the query to be deterministic, i.e. we just want to get back one record per customer. So what happens if there is a tie between the amount_sold, e.g. the same amount_sold exists on 5/11/2009 and 7/11/2009. If we come across such a tie we would like to select the amount_sold with the highest time_id.
Note: You can make this query only deterministic if there is a unique key on the table. Otherwise you may end up with a tie. There is a way around this though by randomly selecting one of the values or performing a SELECT DISTINCT. I will give two examples further down.
Of course you can do this in classic SQL using a subquery:
SELECT
cust_id,
amount_sold,
MAX(time_id)
FROM (
SELECT
a.cust_id,
a.time_id,
a.amount_sold
FROM
customer a JOIN (
SELECT
cust_id,
--amount_sold,
MAX(amount_sold) amount_sold
FROM
customer
GROUP BY
cust_id
) b ON (a.cust_id = b.cust_id AND
a.amount_sold = b.amount_sold)
)
GROUP BY
cust_id,
amount_sold
A query such as the following would not be deterministic as it does not resolve the potential tie outlined above.
SELECT
*
FROM
customer
WHERE
(cust_id, amount_sold) in (
SELECT
cust_id,
max(amount_sold)
FROM
customer
GROUP BY
cust_id
);
You can clearly see this if you insert the following record into our customer table
INSERT INTO customer VALUES (2,TO_DATE('05-FEB-2001','DD-MON-YYYY'),1545.78);
An alternative to the above is the use of analytic functions. You can either use ROW_NUMBER(),LAST_VALUE/FIRST_VALUE or KEEP FIRST/LAST with DENSE_RANK.
Let’s first have a look at LAST_VALUE.
SELECT cust_id, time_id, MAX(amount_sold) FROM (
SELECT
cust_id,
time_id,
LAST_VALUE(time_id) OVER (PARTITION BY cust_id ORDER BY amount_sold, time_id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) time_max_amount,
amount_sold
FROM
customer
)
WHERE
time_max_amount = time_id
GROUP BY
cust_id, time_id
So what are we doing here? We first break up the customer table by cust_id and order the resultset by amount_sold. We then grab the last value in this ordered resultset, which will give us the time_id for the maximum amount_sold for each customer. If you want to avoid a tie between the same amount_sold to a customer on two different days you need to include a unique value in the ORDER BY clause. In our case we are including the time_id to avoid a tie. If you are using LAST_VALUE you need to include the windowing clause ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING. As per documentation: “If you omit the windowing_clause of the analytic_clause, it defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This default sometimes returns an unexpected value, because the last value in the window is at the bottom of the window, which is not fixed. It keeps changing as the current row changes. For expected results, specify the windowing_clause as RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Alternatively, you can specify the windowing_clause as RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING.”
We could of course use the FIRST_VALUE function where we can use the default windowing behaviour to our advantage and omit the clause altogether. Thus making it easier to read. We have to slightly change our query though:
SELECT cust_id, time_id, MAX(amount_sold) FROM (
SELECT
cust_id,
time_id,
FIRST_VALUE(time_id) OVER (PARTITION BY cust_id ORDER BY amount_sold DESC, time_id DESC) time_max_amount,
amount_sold
FROM
customer
)
WHERE
time_max_amount = time_id
GROUP BY
cust_id, time_id
One reader sent in a query to avoid the subquery altogether. Well, in Teradata you could have done this using analytic functions in the predicate clause.
SELECT DISTINCT
cust_id,
FIRST_VALUE(amount_sold) OVER (PARTITION BY cust_id ORDER BY amount_sold DESC, time_id) amount_sold,
FIRST_VALUE(time_id) OVER (PARTITION BY cust_id ORDER BY amount_sold DESC, time_id) time_id
FROM
customer
Another way of solving this with analytic functions is to use KEEP FIRST/LAST with dense_rank
SELECT cust_id, time_id, MAX(amount_sold) FROM (
SELECT
cust_id,
time_id,
MAX(time_id) KEEP(DENSE_RANK LAST ORDER BY amount_sold) OVER ( PARTITION BY cust_id) time_max_amount,
amount_sold
FROM
customer
)
WHERE
time_max_amount = time_id
GROUP BY
cust_id, time_id
If you don’t have a unique key in your dataset you can make the result deterministic by randomly selecting one of the tied records. This is achieved via row_number(). The solution below is my favourite as it is the most readable of the lot and always deterministic.
SELECT
cust_id,
time_id
FROM (
SELECT
cust_id,
time_id,
row_number() OVER (PARTITION BY cust_id ORDER BY amount_sold DESC, time_id DESC) rn FROM customer
)
WHERE
rn = 1
A couple of words on performance.
The pure SQL approach will perform a full scan of the customer table twice. This means that it will generate twice the amount of logical I/O. However, the analytic functions will use up a lot more memory or even worse temp disk space for sorting based on the PARTITION and the ORDER BY clause. By creating an index on the columns used in the partition clause and the order by clause (such as the one below) you can generate a WINDOW NO SORT explain plain. As a result the query will have to do no sorting at all and should generally speaking perform better than the pure SQL. However, like everything in life it all comes down to your specific situation.
CREATE INDEX idx ON customer
(cust_id,AMOUNT_SOLD DESC,time_id desc)
If you want to read up on analytic functions I recommend the following books:
SQL Cookbook. This is my number one.
The Art of SQL.
SQL Hacks
SQL Tuning
Advanced SQL Functions in Oracle 10g. This one has more examples on analytic functions in Oracle.
Posted: November 7th, 2009 | Author: Uli Bethke | Filed under: Irish BI SIG | Tags: essbase, Irish BI SIG | 1 Comment »
Don’t miss Mark Rittman’s presentations on Essbase and OBIEE integration and a technical introduction to Hyperion Planning. There is still time to register. As usual the event will take place at the Oracle offices in Dublin. Paul Weston from Ireland’s leading Essbase consultancy BDMS will give an overview on new features in Essbase 11.
Personally I have no experience with Essbase myself. I am still not convinced that there is an awful lot you can do with OLAP that you can’t do with analytic functions and all the other new and exciting features for analysis in the Oracle core database. So I am looking forward to learn something new here.
For a full schedule of the day’s events have a look at the UK OUG website.
Posted: November 6th, 2009 | Author: Brendan Tierney | Filed under: Best Practice, Training, data mining | Tags: Data Mining Project, Data Mining Skills, data mining training, Getting Started with Data Mining, How to Prepare for Data Mining | No Comments »
There is a new article by Eric King of The Modeling Agency on How to Prepare for Data Mining. It has some interesting points on various aspects that you would need to look out for.
I think some of the important things is that you need to plan such a project carefully, that you understand what you can get out of a data mining project, and have an appreciation of the type of techniques/technologies involved.
Brendan Tierney
Posted: November 4th, 2009 | Author: Uli Bethke | Filed under: Books, Oracle, SQL for Analysis | Tags: sql quiz, win two ebooks | No Comments »
We’ll do something completely different today and have a quiz. And you can even win something. Thanks to the guys at Packt Publishing I have two e-books to giveaway.
Oracle Essbase 9 Implementation Guide
and
Business Process Execution Language for Web Services
When you visit their website make sure to have a look at their free section. They are always looking for bloggers to review books.
In medias res.
Here comes the question:
In retail sales analytics, business folks often need to know on which day a customer bought the item with the highest amount sold. How can we achieve this with pure SQL on the Oracle database? In Oracle there are at least three different ways to answer this question. Two of the solutions use analytic functions. The third one uses a self-join. In the resultset I would like to see this on a customer by customer basis. For our sample data below I would like to get the date on which the amount_sold was largest together with the cust_id. This should return two records on for cust_id 2 and one for cust_id 4.
And here comes the data.
The first two readers who post me two correct answers will win one of the books. Please let me know which book you prefer and I’ll try to accomodate this.
Tip: Two steps will get you to the right answer. First you need to find the highest amount sold for a customer and from there, get the date on which this transaction occured.
Posted: November 3rd, 2009 | Author: Brendan Tierney | Filed under: Books, Business Intelligence, Oracle, Web Mining, data mining | Tags: Brendan Tierney, Oracle Data Mining, Oracle Data Mining Book, Oracle Data Mining links | No Comments »
The following link is a good resource giving details of various aspects of Oracle Data Mining. It is by BC Consulting.
http://www.dba-oracle.com/data_mining/
There is also a link to a book on Oracle Data Miner which covers the version of ODM for 10g, but some of the material in the book also applies for the 11g version. The book is by Dr. Ham and is available from Rampant Books
http://www.rampant-books.com/book_2006_1_oracle_data_mining.htm
Brendan Tierney