10 Reasons you really need predictive analytics

Posted: February 4th, 2010 | Author: Brendan Tierney | Filed under: Best Practice, Business Intelligence, Data Warehouse, analytic functions, data mining | Tags: , , | No Comments »

SPSS have recently posted and article called “10 Reasons you really need predictive analytics“. I thought it would be interesting to post the main points from this article to illustrate that not all predictive analytic projects involve Data Mining, but involve a number of different techniques and looking the the business data in a different way. Yes data mining can be a very important element in some of the following

1. Get a higher return on your data investment
Your organization has a significant investment in data – data that contains critical information about every aspect of your business. Today more than ever, you need to get the best return on the data you have collected–and predictive analytics is the most effective way to do this. Predictive analytics combines information on what has happened in the past, what is happening now, and what’s likely to happen in the future to give you a complete picture of your business.

2. Find hidden meaning in your data
Predictive analytics helps you maximize the understanding gained from your data. It enables you to uncover hidden patterns, trends, and relationships and transform this information into action.

3. Look forward, not backward
Unlike reporting and business intelligence solutions that are only valuable for understanding past and current conditions, predictive analytics helps organizations look forward. By leveraging sophisticated statistical and modeling techniques, you can use the data you already have to help you anticipate future events and be proactive, rather than reactive.

4. Deliver intelligence in real time
Your business is dynamic. With predictive analytics, you can automatically deploy analytical results to both individuals and operational systems as changes occur, helping to guide customer interactions and strategic nd tactical decision making.

5. See your assumptions in action
Advanced analytical methods give you the tools to develop hypotheses about your organization’s toughest challenges and test them by creating predictive models. You can then choose the scenario that is likely to result in the best outcome for your organization.

6. Empower data-driven decision making
Better processes help people throughout your organization make better decisions every day. Predictive analytics enables your organization to automate the flow of information to match your business practices and deliver the insights gained through this technology to people who can apply them in their daily work.

7. Build customer intimacy
When you know each of your customers or constituents intimately—including what they think, say, and do—you can build stronger relationships with them. Predictive analytics gives you a complete view of your customers, and enables you to capture and maximize the value of each and every interaction.

8. Mitigate risk and fraud
Predictive analytics helps you evaluate risk using a combination of business rules, predictive models, and information gathered from customer interactions. You can then take the appropriate actions to minimize your organization’s exposure to fraudulent activities or highrisk customers or transactions.

9. Discover unexpected opportunities
Your organization can use predictive analytics to respond with greater speed and certainty to emerging challenges and opportunities, helping you to keep pace in a constantly changing business environment.

10. Guarantee your organization’s competitive advantage
Predictive analytics can drive improved performance in every operational area, including customer relations, supply chain, financial performance and cost management, research and product development, and strategic planning. When your organization runs more efficiently and profitably, you have what it takes to out-think and out-perform your competitors

So what is Predictive Analytics. Check out the description on Wikipedia

Let me know you views and comments on the above.

Brendan Tierney


Why Has Data Mining Struggled So Much?

Posted: November 20th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence, Data Warehouse, Oracle, data mining | Tags: , , , , | No Comments »

Bill Inmon has recently posted an article on “Why has Data Mining struggled so much?”

The article discusses 7 diferent reasons why data mining has struggled, as it has been around for a very long time.

The main points are
1. We have been waiting a long time for it to become available in a usable way
2. Data mining is considered an academic focused with very few practitioners. But this is become less so
3. Data mining requires a different set of skills. Yes you need data management skills but you also need some data mining skills. I will be making a posting focusing on the skill sets required for data mining in the coming weeks.
4. Some industries and application areas are more suited to data mining than others. The difficult is in identifying suitable projects.
5. Data for Data Mining is unclean. Not if you use a data warehouse. Idealy an organisation who has a matur-ish BI infrastrucure will benefit must from a Data Mining project
6. Data is incomplete. Yes you may need to enrich the data from various sources. But again if you have a Data Warehouse you will have most of these
7. Approaches to data mining inadequate. Alot of the approches to data mining projects as based on its statistical history. New problem areas are evolving all the time and we can use data mining in lots of different way.

To view Bill Inmon’s article – click here.

To view our 2 training courses on data mining – click here

Brendan Tierney


What is Data Mining ?

Posted: October 28th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence, Training, Uncategorized, data mining | Tags: , , , , , , , , , | No Comments »

In this weeks topic I will explore what is Data Mining, the different meanings, how the term is used, etc. I will give you my interpretation of what it is and how other descriptions of data mining can be categorised.

Every article you read, every presentation you hear, etc. you get a slightly different description, or should it be that they hint to a description of how they use data mining in the products or their applications. By giving this hint at what data mining is they try to claim that they are using it, as it gives their products, applications and services a higher degree of sophistication compared to others. There is also the idea that it is a one of those trendy terms that is thrown out without them really knowing what it is really about.

Data Mining Definition

One of the most commonly cited definitions of what data mining is, “..it is the non-trivial extraction of previously unknown and potentially useful information from data” by Usama Fayyad et al (Chief Data Officer, Yahoo Inc) in their landmark paper back in 1996.

Based on this definition data mining is does not involve some basic analytics, decision making based on some defined rules, being able to identify events based on current data, etc. But these type of scenarios are typically talked about as being data mining. If we go back to the definition by Fayadd above, by say the “non-trivial” it means that we cannot write some code/queries to pull data out of our data that answers some simple questions. Another important part of the definition is “potentially useful information”, tells use that some times and may in a lot of cases, data mining does not give use anything useful. But it can give us useful information only if we have a good understanding of the data, the business rules of the data, the meta-data, how the rules and the data relate to each other, etc. All of this requires extensive experience of working with the data. Who is best at doing this, but database designers and developers. People with a statistics background (typical what you see in data mining roles) have to go and learn all about the data, the business rules, the meta-data etc. This can be a huge waste of time and resources as the database people are generally ignored.

Some examples

I was at an IT conference last week (I was co-author of a paper on Opinion Data Mining). One of the key note talks was given by a technical lead in IBM (one of two thousand in the company). He gave some good examples of how Business Intelligence (BI) could be used to manage the energy needs of a new city being build out in the middle east. He also gave another example of how BI is being used in and around Galway city and coast line. There were several mentions of data mining during his talk, but I don’t think any of his examples reflected what data mining is. Yes he did give examples of how you can intelligently use your data. For example, if an object is spotted out in Galway bay then you can predict where this object will come to shore. But data mining is not the technique that is used in this case. Instead it is a rules based type system, that takes into account a number of factors, link the size of the object, the current position, currents, wind direction, etc. Using these rules (and not data mining) they can identify the landing position and let all the necessary bodies know this (like the coast guard, Galway county council, environmental control, etc).

Generally data mining can be used when you have a mature BI environment in your organisation that includes not just transactional and business reporting, but also data warehousing, data analytics, prediction systems (based on rules), etc. Data mining allows you to explore for and identify patterns in your data (and you need lots of data really). Going back to the definition of data mining a lot of the results from a data mining project may not be of any value. What you are looking for are the nuggets of gold that exists in the data and you may take some time to fine these, if they exist at all.

One of the aims of this weeks posting was to explore what data mining really is. At this point I haven’t really talked much about what it is, but what I hope you have gotten so far is that the term data mining is overly used in the IT world and can be seen as one of those trendy words that organisation like use (and use incorrectly). Data mining is used as an umbrella term that covers any processing of your data that involves a bit a processing, applying some rules and some analytics.

Over the coming weeks we will explore what Data Mining really is and what are the different stages of a Data Mining project.

The next posting will be about CRISP-DM, which is a industry neutral, product neutral data mining life cycle.


Getting Started with Oracle Data Mining

Posted: October 20th, 2009 | Author: Brendan Tierney | Filed under: Business Intelligence | Tags: , , | No Comments »

This blog posting is my first in the area of Data Mining, in particular performing Data Mining using Oracle.  Over the coming months I will be making regular postings in the areas of data mining and associated activities.  The approach that I take to data mining is slightly different to what would be considered the main stream approach.  Last year I gave a presentation at the UKOUG annual conference in Birmingham titled “You don’t need a PhD to do Data Mining”. What I have discovered over the past 12 years from working in this area is that it is dominated by people with a background in statistics (and many of them have PhDs). But having worked on many data mining projects, most of the work that needs to be done, can be done by and is generally done by Database People.

If you are interested in Data Mining then hopefully my postings over the coming months will help you gain a better understanding of what it is about, what you can do, what kind of outputs you get and how the outputs can be used. 

The following list of books and links will give you a good start in gaining some idea of what it entails

Oracle Data Mining Concepts Book – Parts A & B

Getting Started with Oracle 11g Data Mining

Why & What is Data Mining

As I add each posting I will add some additional links and resources, so over the next few months we will have a comprehensive list of resources for Data Mining in Oracle.  We will put together a separate web page to contain all these resources.  This will complement the list of resources that Uli Bethke is putting together on ODI.

We also offer two training courses on data mining. The first one of these will be aimed at managers and novices in the area. It will give a high level overview on key data mining concepts and issues. The second course will specifically deal with data mining on Oracle using PL/SQL and the Oracle data mining tool. Have a look at our training pages for details.

Let me know if you have any suggestions on topics that you would like me to cover.


Competitive Business Intelligence: web scraping with Oracle.

Posted: January 6th, 2009 | Author: Uli Bethke | Filed under: Business Intelligence, Web Mining | Tags: , , , , , , , | 1 Comment »

In my opinion, one of the trends for Business Intelligence in 2009 (and the years to come) will be the integration of externally available data (data not found within the organisation itself, e.g. data in magazines, the web, libraries etc.) into the data warehouse and into an organisation’s business processes. Using BI to monitor the external environment that an organisation operates in, will grow in importance for decision making.

“Decision makers [...] need information about what is going on outside the organization as well as inside.[...] Macroenvironmental analysis [...] examines the economic, political, social, and technological events that influence an industry”.
From: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales p.4.

However, this is not fully understood by the wider Business Intelligence community, as can be seen from the quote below. (This is a quote from an article on BI in one of the local business weeklies here in Dublin):

“BI tools are fundamentally about using data which an organisation already has - whether in databases, CRM systems, financial and accounting packages, ERP systems or elsewhere”.

This perspective is too narrow. While it is fundamental to use BI to mine and analyse data that an organisation owns, it is as important to integrate data from external sources such as the web to optimize the internal decision-making process. Organisations that understand this requirement will have the edge over their competitors. For executives to make informed decisions they need to be able to look at intra-organisational events as well as the competitive environment.

“Strategic management is the art and science of directing companies in light of events both inside and outside the organization. In addition to understanding their own operations, managers must understand the rest of the industry. For example, should a company try to be a low-cost producer or a best-cost producer? How can a company differentiate its product line? Should the focus be on the entire market or on a niche? Without understanding what others are doing, making decisions about these types of issues leads to unexpected results.”
From: Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales.

Web mining, data mining and text mining techniques will be of fundamental importance to implement this new breed of BI.

In this series we will have a look at all three areas. In today’s article I will show you, how we can implement web mining techniques with Oracle. In part two of this series we will then look at how we can use data mining techniques in general and survival analysis in particular to analyse macro environmental data from the web. Finally, in the third part we will look at how we can use text mining to classify and cluster the extracted data.

So, what we will do today, is harvest macro environmental business intelligence of real estate data. I thought it might be interesting to look at property related data because of the recent bursting of the property bubble. The site we will extract data from is property.ie.

The information we harvest can be used to (amongst other things)

- Identify areas where houses sell the quickest (have a short survival rate).
- Identify features of houses that sell the quickest.
- Find properties that are near other properties
- Create a taxonomy/classification to browse properties by features
- Monitor price increases or decreases.
- Use a combination of all of the above.

In the case studies that follows I am using Oracle 11.1.0.6.

1. Create a user and assign the relevant permissions

Let’s log on as a DBA user, e.g. SYS and execute the following stuff:

SQL> create user real_estate identified by real_estate;

User created.

SQL> grant connect to real_estate;

Grant succeeded.

SQL> grant resource to real_estate;

Grant succeeded.

SQL> CREATE TABLESPACE reales
  2  DATAFILE 'D:\ORACLE\ORADATA\ORCL\REALES01.dbf' size 512M
  3  extent management local autoallocate;

Tablespace created.

This will give us user real_estate with connect and resource grants.

Next we need to create an Access Control List (ACL) for this user. The ACL will allow us to access to the property.ie website, but prevents access to any other websites. ACLs are new in Oracle 11. If you are using Oracle 10 you need to adapt permissions for this.

SQL> begin
  2          dbms_network_acl_admin.create_acl (
  3                  acl             => 'utl_http.xml',
  4                  description   => 'Normal Access',
  5                  principal       => 'REAL_ESTATE',
  6                  is_grant       => TRUE,
  7                  privilege       => 'connect',
  8                  start_date    => null,
  9                  end_date      => null
 10          );
 11  end;
 12  /

On line 5 the principal needs to be in capital letters. Otherwise Oracle will return an error.

Next we assign the property.ie site to the ACL:

SQL> begin
  2      dbms_network_acl_admin.assign_acl (
  3      acl => 'utl_http.xml',
  4      host => 'www.property.ie',
  5      lower_port => 1,
  6      upper_port => 10000);
  7  end;
  8  /

Finally we give execute permission on utl_http and dbms_lock

SQL> grant execute on utl_http to real_estate;

Grant succeeded.

SQL>
SQL> GRANT EXECUTE ON dbms_lock TO real_estate;

Grant succeeded.

SQL> spool off

2. Create Tables

Next we need to create the tables to store the extracted information.

SQL> CREATE TABLE seed_html (
  2     html CLOB
  3  )
  4  TABLESPACE REALES
  5  PCTFREE 0
  6  /

Table created.

SQL> CREATE TABLE seed (
  2     part_of_link VARCHAR2(30),
  3     num_pages NUMBER,
  4     num_property NUMBER,
  5     area VARCHAR2(30)
  6  )
  7  TABLESPACE REALES
  8  PCTFREE 0
  9  /

Table created.

SQL> CREATE TABLE property_html (
  2     part_of_link VARCHAR2(30),
  3     HTML CLOB,
  4     link VARCHAR2(255)
  5  )
  6  TABLESPACE REALES
  7  PCTFREE 0
  8  /

Table created.

SQL> CREATE TABLE property_description (
  2    property_id    NUMBER,
  3    prop_code      NUMBER,
  4    prop_desc      CLOB,
  5    activity_date  DATE,
  6    latitude       NUMBER,
  7    longitude      NUMBER
  8  )
  9  TABLESPACE REALES
 10  PCTFREE 0
 11  /

Table created.

SQL> CREATE TABLE property
  2  (
  3    PROPERTY_ID      NUMBER,
  4    LINK             VARCHAR2(1000),
  5    PROP_CODE        NUMBER,
  6    PRICE            NUMBER,
  7    ADDRESS          VARCHAR2(500),
  8    ROOMS            VARCHAR2(500),
  9    AREA             VARCHAR2(50),
 10    VALID_FROM_DATE  DATE,
 11    VALID_TO_DATE    DATE,
 12    DATE_REMOVED     DATE,
 13    VALID_IND        NUMBER,
 14    DELETE_IND       NUMBER
 15  )
 16  TABLESPACE REALES
 17  PCTFREE    10
 18  /

Table created.

SQL> CREATE TABLE PROPERTY_HELPER
  2  (
  3    LINK        VARCHAR2(1000),
  4    PROP_CODE   NUMBER,
  5    PRICE       NUMBER,
  6    ADDRESS     VARCHAR2(500),
  7    ROOMS       VARCHAR2(500),
  8    AREA        VARCHAR2(50),
  9    DELETE_IND  NUMBER
 10  )
 11  TABLESPACE REALES
 12  PCTFREE    0;

Table created.

SQL> CREATE TABLE PROPERTY_ATTRIBUTES
  2  (

  3    LINK       VARCHAR2(4000 BYTE),
  4    PROP_CODE  NUMBER,
  5    PRICE      NUMBER,
  6    ADDRESS    VARCHAR2(4000 BYTE),
  7    ROOMS      VARCHAR2(4000 BYTE),
  8    AREA       VARCHAR2(30 BYTE)
  9  )
 10  TABLESPACE REALES
 11  PCTFREE    0;

Table created.

SQL> CREATE SEQUENCE seq_property
  2    START WITH 1
  3    MAXVALUE 999999999999999999999999999
  4    MINVALUE 1
  5    NOCYCLE
  6    CACHE 20
  7    NOORDER
  8  /

Sequence created.

Note: Because we will be dealing with very little data initially I have not added any indexes to these tables. Once volume of data grows and we have a better understanding of query patterns we should add relevant indexes.

3. Extract the property seed

Before we get stuck into things I recommend you get familiar with the functionality, navigation etc. of the property.ie website. This will make it easier to understand what we will be dealing with in the next couple of sections. For the purpose of this exercise we will limit the extract process to properties in county Dublin, as we don’t want to put too much pressure on the property.ie web servers. At the same time, though, we want to gather enough information to perform some proper analysis: we will include all areas in Dublin in our extract process. If you have a look at the frontpage of the property.ie website you will see that each area also lists the number of properties available in this area. This information will become relevant for the later stages of our extract exercise.

The procedure below extracts the HTML part of the property.ie frontpage which contains the areas and the number of properties in each area.

SQL> CREATE OR REPLACE PROCEDURE extract_seed_html
  2
  3  IS
  4
  5  -- exec  extract_seed_html
  6
  7  BEGIN
  8
  9     EXECUTE IMMEDIATE 'TRUNCATE TABLE seed_html';
 10
 11    -- utl_http.set_proxy([http://][user[:password]@]host[:port])
 12
 13     INSERT INTO seed_html
 14     SELECT TO_CLOB(to_clob(DBMS_LOB.SUBSTR (html,4000,5900)) || to_CLOB(DBMS_LOB.SUBSTR (html,4000,9900))) FROM (
 15        SELECT HTTPURITYPE('http://www.property.ie/').getclob() AS html FROM dual
 16     )
 17
 18     COMMIT;
 19
 20  END extract_seed_html;
 21  /

Procedure created.

On line 11 I have commented out the use of a proxy server. If you are using a proxy or want to anonymize your requests remove the comment and fill in your proxy info such as username, password, host, and port.

On line 15, we are using the HTTPURITYPE function to retrieve the HTML code of the property.ie frontpage and extract the HTML content of the property area dropdown. HTTPURITYPE uses the http_utl package.

HTML                                                                              OCCURENCE
-------------------------------------------------------------------------------- ----------
<select id="area" name="s[a_id][]">                                                    1
<option value="">All areas</option>

<select id="area" name="s[a_id][]">                                                    2
<option value="">All areas</option>

<select id="area" name="s[a_id][]">                                                    3
<option value="">All areas</option>

<select id="area" name="s[a_id][]">                                                    4
<option value="">All areas</option>

<select id="area" name="s[a_id][]">                                                    5
<option value="">All areas</option>

We will now strip this piece of information of any HTML noise.

SQL> CREATE OR REPLACE PROCEDURE load_seed
  2
  3  IS
  4
  5  -- exec load_seed
  6
  7  BEGIN
  8
  9     EXECUTE IMMEDIATE 'TRUNCATE TABLE seed';
 10
 11     INSERT /*+ APPEND */ INTO seed
 12     SELECT
 13       REPLACE(TRIM(REGEXP_REPLACE(REGEXP_SUBSTR(html, '[A-Z][a-z].*\([0-9]{1,3}\)',1,occurence),'\([0-9]{1,3}\)')),' ','-') AS prep_for_link,
 14       CEIL(TO_NUMBER(REGEXP_REPLACE(REGEXP_SUBSTR(REGEXP_SUBSTR(html, '[A-Z][a-z].*\([0-9]{1,3}\)',1,occurence),'\([0-9]{1,3}\)'),'\(|\)'))/10) AS num_pages,
 15       TO_NUMBER(REGEXP_REPLACE(REGEXP_SUBSTR(REGEXP_SUBSTR(html, '[A-Z][a-z].*\([0-9]{1,3}\)',1,occurence),'\([0-9]{1,3}\)'),'\(|\)')) as num_property,
 16       REGEXP_SUBSTR(html, '[A-Z][a-z].*\([0-9]{1,3}\)',1,occurence) as area
 17     FROM
 18     ( SELECT html,occurence FROM seed_html
 19     CROSS JOIN (
 20        SELECT level occurence FROM dual CONNECT BY level <= 190) );
 21
 22     COMMIT;
 23
 24  END load_seed;
 25  /

Procedure created.

On lines 18-20, we do a cross join between our seed_html table with an inline view that returns the numbers 1 to 190. This is done using the CONNECT BY clause. We have chosen 190 here as the upper limit, because there will never be more than 190 areas in county Dublin.

The inline view returns the following.

SQL> SELECT html,occurence FROM seed_html
  2      CROSS JOIN (
  3         SELECT level occurence FROM dual CONNECT BY level <= 190);

We then use regular expressions to parse each occurrence of an area and the number of properties in this area on a step by step basis. At the end of this article there are a couple of links to regular expressions tutorials. This is the first time that I have used them myself, so I am sure the above could have been done in a more elegant and more performant way.

In our seed table, we should now have the following information

PART_OF_LINK                    NUM_PAGES NUM_PROPERTY AREA
------------------------------ ---------- ------------ ------------------------------
Adamstown                               1           10 Adamstown (10)
Ard-Na-Greine                           1            5 Ard Na Greine (5)
Artane                                  5           43 Artane (43)
Ashtown                                 3           29 Ashtown (29)
Aylesbury                               1            8 Aylesbury (8)
Ayrfield                                2           14 Ayrfield (14)
Balbriggan                             18          176 Balbriggan (176)
Baldonnell                              1            3 Baldonnell (3)
Baldoyle                                3           23 Baldoyle (23)
Balgriffin                              2           16 Balgriffin (16)
Ballinteer                              4           33 Ballinteer (33)
Ballsbridge                             6           51 Ballsbridge (51)

4. Extract HTML for property master pages

Each property area has one or more property master pages. On each master property page there are no more than 10 properties listed. Users of the property.ie site can page through these master pages. By clicking on a property on the master page they get to the details page for this property.

The URL template for the master page is

http://www.property.ie/property-for-sale/dublin//p_
/, e.g. http://www.property.ie/property-for-sale/dublin/balbriggan/p_2/

With the information from the seed table, we will iterate over the master property page in our next procedure and parse information that we are interested in from this page. What we will do first though is introduce an error handling procedures. This is necessary to handle errors in case we lose connectivity.

SQL> CREATE OR REPLACE  PROCEDURE raise_err (
  2        p_errcode   IN   NUMBER := NULL,
  3        p_errmsg    IN   VARCHAR2 := NULL
  4     )
  5     IS
  6        l_errcode   NUMBER := NVL (p_errcode, SQLCODE);
  7        l_errmsg    VARCHAR2(1000) := NVL (p_errmsg, SQLERRM);
  8     BEGIN
  9
 10
 11        IF l_errcode BETWEEN -20999 AND -20000
 12        THEN
 13           raise_application_error (l_errcode, l_errmsg);
 14        /* Use positive error numbers */
 15        ELSIF     l_errcode > 0
 16              AND l_errcode NOT IN (1, 100)
 17        THEN
 18           raise_application_error (-20000, l_errcode || '-' || l_errmsg);
 19        /* Can't EXCEPTION_INIT -1403 */
 20        ELSIF l_errcode IN (100, -1403)
 21        THEN
 22           RAISE NO_DATA_FOUND;
 23        /* Re-raise any other exception. */
 24        ELSIF l_errcode != 0
 25        THEN
 26           EXECUTE IMMEDIATE
 27             'DECLARE myexc EXCEPTION; ' ||
 28             '   PRAGMA EXCEPTION_INIT (myexc, ' ||
 29                   TO_CHAR (l_errcode) || ');' ||
 30             'BEGIN  RAISE myexc; END;';
 31
 32        END IF;
 33  END;
 34  /

Procedure created.

Procedure raise_err raises any errors during extract. But let’s move on to actually extracting the HTML for the master property pages via our seed table.

SQL> CREATE OR REPLACE PROCEDURE extract_prop_html (p_area IN VARCHAR2)
  2
  3  IS
  4
  5  -- exec extract_prop_html (NULL)
  6
  7     CURSOR c_seed
  8     IS
  9     SELECT
 10        part_of_link
 11     FROM
 12        seed a
 13     WHERE
 14        UPPER(part_of_link) = COALESCE(UPPER(p_area),UPPER(part_of_link)) AND
 15        NOT EXISTS (
 16          SELECT NULL FROM property_html b WHERE a.part_of_link = b.part_of_link );
 17
 18     l_part_of_link VARCHAR2(30);
 19
 20  BEGIN
 21
 22
 23     BEGIN
 24
 25         FOR r_seed IN c_seed
 26         LOOP
 27
 28            l_part_of_link := r_seed.part_of_link;
 29
 30            INSERT INTO property_html
 31            SELECT
 32               part_of_link,
 33               REGEXP_SUBSTR(REPLACE(REPLACE(REPLACE(html,CHR(10),''),CHR(13),''),CHR(9),''),'searchresults_container.*summary_info'),
 34               link
 35            FROM (
 36                SELECT
 37                   part_of_link,
 38                   HTTPURITYPE('http://www.property.ie/property-for-sale/dublin/' || part_of_link || '/p_' || occurence ||'/').getclob() AS html,
 39                   'http://www.property.ie/property-for-sale/dublin/' || part_of_link || '/p_' || occurence ||'/' AS link,
 40                   occurence
 41                FROM
 42                   seed a CROSS JOIN (
 43                      SELECT level occurence FROM dual CONNECT BY level <= ( SELECT MAX(num_pages) FROM seed )
 44                   )
 45                WHERE num_pages >= occurence AND part_of_link = r_seed.part_of_link
 46            );
 47
 48            COMMIT;
 49
 50
 51
 52            dbms_lock.sleep(1);
 53
 54         END LOOP;
 55
 56
 57
 58      EXCEPTION WHEN OTHERS THEN
 59
 60         EXECUTE IMMEDIATE ' DELETE FROM property_html WHERE part_of_link = ' ' ' || l_part_of_link || ' ' '';
 61         raise_err(SQLCODE,SUBSTR(SQLERRM,1,1000));
 62
 63      END;
 64
 65  END extract_prop_html;
 66  /

On lines 7-16 we define a cursor that will use the information from the seed table to browse the master property pages. This cursor allows us to either iterate over everything in the seed table (if we pass in NULL as a parameter to the procedure) or just a particular area. This is achieved via the COALESCE function.

On lines 42-57 we do the main work. We extract the html of all of the master property pages on an area by area basis. Again we use our cross join and CONNECT BY technique from earlier on to retrieve all master property pages for an area in one go. The results of this cross join just for one area would look similar to below:

PART_OF_LINK           HTML                                                                                                                         LINK
------------------------------ -------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------
Ballsbridge                    searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_1/
Ballsbridge                    searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_2/
Ballsbridge                    searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_3/
Ballsbridge                    searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_4/
Ballsbridge                    searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_5/
Ballsbridge                    searchresults_container"><div class="search_result"><div class="sresult_address" http://www.property.ie/property-for-sale/dublin/Ballsbridge/p_6/

...

We store the piece of html that contains the property attributes in our property_html table. Later on we will use this piece of HTML to parse the property attributes we are interested in.

On line 52 we pause for exactly one second to reduce the load on the property.ie web server before moving on to the next area.

On lines 58-63 we do some error handling in case we lose connectivity. If we lose connectivity we delete any entries for the area we were extracting at the moment the error occurred. This will allow us to pick up from where the extract stopped when we re-execute the procedure.

5. Extract and merge property attributes

We now have the relevant HTML from the master property pages to extract and merge the property attributes.

SQL> CREATE OR REPLACE PROCEDURE merge_prop
  2
  3  IS
  4
  5  -- exec merge_prop
  6
  7  BEGIN
  8
  9
 10      INSERT INTO property_attributes
 11      SELECT
 12         TO_CHAR(REGEXP_SUBSTR(SUBSTR(prop_info,1,INSTR(prop_info,'</h2>')),'http.*[0-9]{4}.')) AS link,
 13         to_number(replace((REGEXP_SUBSTR(SUBSTR(prop_info,1,INSTR(prop_info,'</h2>')),'/[0-9]{4,5}./')),'/','')) AS prop_code,
 14         TO_NUMBER(REGEXP_REPLACE(REGEXP_SUBSTR(prop_info,'[0-9],[0-9]{3},[0-9]{3}|[0-9]{3},[0-9]{3}'),'[,|.]')) AS price,
 15         TO_CHAR(REGEXP_REPLACE(SUBSTR(prop_info,1,INSTR(prop_info,'</h2>')-1),'<[^>]+>|[0-9]\.')) AS address,
 16         to_char(substr(prop_info,instr(prop_info,'<h4>',1)+4,instr(prop_info,'</h4>',1)-instr(prop_info,'<h4>',1)-4)) as rooms,
 17         part_of_link AS area
 18      FROM ( SELECT
 19         SUBSTR(html,instr(html,'<div class="sresult_address">',1,occurence+3),instr(html,'<div class="sresult_moredetail">',1,occurence+3)-instr(html,'<div class="sresult_address">',1,occurence+3)) AS prop_info,
 20          occurence,part_of_link FROM property_html
 21        CROSS JOIN (
 22         SELECT level occurence FROM dual CONNECT BY level <= 10) );
 23
 24
 25      COMMIT;
 26
 27     -- Get the properties that were updated or newly inserted
 28
 29      INSERT INTO property_helper
 30          SELECT
 31              link,
 32              prop_code,
 33              price,
 34              address,
 35              rooms,
 36              area,
 37              0
 38          FROM
 39             property_attributes
 40          WHERE prop_code IS NOT NULL
 41          MINUS
 42          SELECT
 43             link,
 44             prop_code,
 45             price,
 46             address,
 47             rooms,
 48             area,
 49             delete_ind
 50          FROM
 51             property
 52
 53      COMMIT;
 54
 55      -- Get the property codes that were deleted
 56
 57      INSERT INTO property_helper
 58          SELECT
 59             '-',
 60             prop_code,
 61             -1,
 62             '-',
 63             '-',
 64             '-',
 65             1
 66          FROM
 67             property
 68          WHERE delete_ind <> 1
 69      MINUS
 70          SELECT
 71             '-',
 72             prop_code,
 73             -1,
 74             '-',
 75             '-',
 76             '-',
 77             1
 78          FROM
 79              property_attributes;
 80
 81
 82      COMMIT;
 83
 84      -- Update the updated and deleted records
 85
 86      MERGE INTO property a USING (
 87          SELECT
 88             link,
 89             prop_code,
 90             price,
 91             address,
 92             rooms,
 93             area,
 94             delete_ind
 95          FROM
 96             property_helper
 97          ) b ON (a.prop_code = b.prop_code )
 98      WHEN MATCHED THEN UPDATE SET
 99          a.valid_to_date = CASE WHEN a.valid_ind = 1  THEN SYSDATE ELSE a.valid_to_date END,
100          a.date_removed  = CASE
101                              WHEN a.delete_ind = 1 THEN a.date_removed    -- It has been removed previously
102                              ELSE
103                                  CASE
104                                      WHEN b.delete_ind = 1 THEN SYSDATE
105                                      ELSE a.date_removed
106                                  END
107                            END,
108          a.valid_ind     = 0,
109          a.delete_ind    = CASE WHEN b.delete_ind = 1 THEN 1 ELSE a.delete_ind END;
110
111
112      COMMIT;
113
114      -- Create the updated and newly inserted records. Updated records get a new record to audit changes
115
116      INSERT INTO property
117      SELECT
118         seq_property.nextval,
119         link,
120         prop_code,
121         price,
122         address,
123         rooms,
124         area,
125         SYSDATE,
126         TO_DATE('31/12/9999','DD/MM/YYYY'),
127         TO_DATE('31/12/9999','DD/MM/YYYY'),
128         1,
129         0
130      FROM (
131          SELECT
132             link,
133             prop_code,
134             price,
135             address,
136             rooms,
137             area,
138             delete_ind
139          FROM
140              property_helper
141          MINUS
142          SELECT
143             link,
144             prop_code,
145             price,
146             address,
147             rooms,
148             area,
149             delete_ind
150          FROM
151             property
152              )
153      WHERE
154         delete_ind <> 1;
155
156      COMMIT;
157
158  END merge_prop;
159  /

Procedure created.

The above procedure consists of five parts.

On lines 11-22 we parse the relevant attributes from the HTML piece we extracted in the previous step. This includes the link to the property’s details page, the property_code (unique identifier for the property), the price, the address, and the room details. Again we are using Regular Expressions to achieve this.

On lines 29-51 we store properties that were either updated or added since our last extract batch in a helper table (property_helper). We have to do a full comparison between all our previously extracted properties in the property table and the properties we have just extracted. We do this via the MINUS operator.
Note: For a large volume of records and depending on our hardware, we might run into performance issues doing a full diff between the two result sets. Anything below 1M records should not be a problem though.

On lines 51-75 we store properties that were deleted since our last extract job in the property_helper table. Again the only option we have here is to do a full comparison between the records we have extracted previously and those we have extracted in our current batch cycle.

On lines 86-109 we merge records that were updated or deleted with previously extracted property records. For each record that was updated we update its valid period and set the valid_ind to 0, i.e. the valid indicator is set to false and as a result we have marked this record as invalid. For each record that was deleted we also update its valid period and valid_ind field. In addition, we update the record’s delete_ind field to 1, i.e. its delete indicator is set to true and as a result we have marked this record as deleted at source.

On lines 104-142 we insert the new records we came across in our current extract batch. We also create a new record for updated records (similar to a Slowly Changing Dimension Type 2). This will give us an audit trail for any updates that were made to records, e.g. when the price is increased or decreased.

6. Extract property details

As part of the previous step we extracted the link to the property’s details page. In this step we will use this link as part of an HTTP get request and scrape the information we are interested in from this page.

SQL> CREATE OR REPLACE PROCEDURE insert_prop_desc
  2
  3  IS
  4
  5  -- exec insert_prop_desc
  6
  7     CURSOR c_prop_desc
  8     IS
  9     SELECT
 10        link,
 11        property_id,
 12        prop_code
 13     FROM
 14        property a
 15     WHERE
 16        NOT EXISTS ( SELECT NULL FROM property_description b WHERE a.prop_code = b.prop_code);
 17
 18  BEGIN
 19
 20     FOR r_prop_desc IN c_prop_desc
 21     LOOP
 22
 23     INSERT INTO property_description
 24     SELECT
 25        r_prop_desc.property_id,
 26        r_prop_desc.prop_code,
 27        REPLACE(REGEXP_REPLACE(REGEXP_SUBSTR(REPLACE(REPLACE(REPLACE(html,CHR(10),''),CHR(13),''),CHR(9),''),'--></script></div>.*<div class="separator">'),'<[^>]+>'),'-->',''),
 28        TRUNC(SYSDATE),
 29        REGEXP_SUBSTR(TO_CHAR(REGEXP_SUBSTR(html,'show_map.*')),'(-|[0-9])[0-9].[0-9]{2,8}',1,1),
 30        REGEXP_SUBSTR(TO_CHAR(REGEXP_SUBSTR(html,'show_map.*')),'-[0-9].[0-9]{2,8}',1,1)
 31     FROM (
 32     SELECT
 33     HTTPURITYPE(r_prop_desc.link).getclob() AS html
 34     from dual  );
 35
 36     COMMIT;
 37
 38     dbms_lock.sleep(1);
 39
 40
 41     END LOOP;
 42
 43  END insert_prop_desc ;
 44  /

On lines 7-16 we define a cursor that will return us those properties for which no description has been added.

On lines 23-34 we iterate over the cursor and parse the description, the longitude, and the latitude from the HTML. We will use longitude and latitude in part 2 of this series to calculate distance between properties.

7. Bringing it all together

In a last step we bring all the individual procedures together in a master procedure.

SQL> CREATE OR REPLACE PROCEDURE prop_batch
  2
  3  IS
  4
  5  BEGIN
  6
  7      EXECUTE IMMEDIATE 'TRUNCATE TABLE property_html';
  8
  9      EXECUTE IMMEDIATE 'TRUNCATE TABLE property_helper';
 10
 11      EXECUTE IMMEDIATE 'TRUNCATE TABLE seed';
 12
 13      EXECUTE IMMEDIATE 'TRUNCATE TABLE seed_html';
 14
 15      EXECUTE IMMEDIATE 'TRUNCATE TABLE property_attributes';
 16
 17      extract_seed_html;
 18
 19      load_seed;
 20
 21      extract_prop_html (NULL);
 22
 23      merge_prop;
 24
 25      insert_prop_desc;
 26
 27  END prop_batch;
 28  /

Procedure created.

On lines 7 -15 we remove data from our previous extract batch and then, step by step, execute each extract procedure.

As a last step we need to add error handling and code instrumentation to our solution. However, this is out of scope for this article.