Discussion Notes

Jan 31, 2001

(courtesy Balaji K.S. with some edits and input by Rob Capra)

The discussions focused on three threads:

How useful is a web log?
What techniques are available for mining a web log?
Placing this in the larger context of an information system

Is a web log useful to draw inferences?

Multiple Windows

The user may have different windows open at the same time which can make drawing inferences about the user's experience of a web site tricky. For example, the user could open two windows to different sites (A and B). Then, for 15 minutes, the user does nothing on site A, but is instead working on site B. Based on this data, site A might be tempted to think that the user has a liking of their page but in reality, the user spent all the time reading site B.

How to distinguish one user from another in the web log

Problem: due to proxies and shared computers, the "hostname" recorded in the log is not always an accurate indicator to distinguish one user from another
Options: cookies, scripts, and authentication
But these have privacy/security considerations
Can only look at information about the user's activity on this site

Caching problem

Web clients and proxies may cache previously viewed pages, but then revisits to the same page may not be reflected in the site's web log since the page was retrieved from the cache, not the site.
Method to deal with this by Cooley et. al. that uses knowledge of the site's structure to determine if a revisit has occurred or if the access was from different users

Web site design depends on perception

"Conclusions are only valid if the users perceive the site and understand its services as the designers have conceived them." (p.128 of the article)
Need to focus first on "personalizing the site in serving its users."
If users don't understand the site, drawing conclusions based on their data about what is popular or correlated on the site may be invalid.
By looking at how users navigate and interact with the site, we can learn things about the quality of the site

Concept hierarchies can help give insights

Concept hierarchies are like taxonomies: takes concrete to more abstract.
The paper mentions using this idea on hosts to find areas, regions, or countries instead of specific hosts
Another example is given about "generalizing" query types such as titleANDauthor and publisherANDyear to "TwoParametersSearch"; still it appears that all scenarios have been "enumerated" by the author

Data Mining

For a particular scenario (or scenarios), several ideas are available to improve the efficiency of the data mining process: we survey them below.

Anti-monotonicity

If it is found (in a bottom-up fashion) that there is no support for {b}, then there is no need to look any higher to the {b,d} and {b,c,d} nodes since there will be no support for them either:

                    {b,c,d}
                     / | \
                    /  |  \
                   /   |   \
                  /    |    \
              {b,d}  {b,c}  {d,c}
                | \  /   \  / |
                |  \/     \/  |
                |  /\     /\  |
                | /  \   /  \ |
               {b}    {d}   {c}
                 \     |     /
                  \    |    /
                   \   |   /
                    \  |  /
                      { }

Query optimizers can make use of the antimonotonicity constraint to selective "reorder" query (mining) operations in an attempt to improve retrieval performance.

Using Generality Orderings

Meta-Patterns = Patterns of Patterns. Thus, syntactic and semantic constraints on the nature of patterns can be used to prune the search space for hypotheses.

Anytime Results

Data mining can be terminated when results of the desired fidelity are achieved.

Caveats with the WUM approach

The mining language does not support closure in the sense of SQL (i.e., the output of a mining query cannot seamlessly serve as the input to another mining query). Moreover, the expressiveness of the language is constrained to propositional logic. First order predicate logic can help mine fundamentally relational patterns.

Placing weblog mining in a larger context

Enumeration

Each scenario is enumerated in advance to ensure that data mining and exploitation (of mined patterns) can make use of this information. There is some attempt to separate modeling of the system from targeting.

Different Sessions

The session information cannot be easily maintained. If an user accesses page2 and in another window goes to page0, both might need to be considered as a single session (e.g., a "manual" information integration scenario); according to the authors, however, they are modeled as different sessions. (only caching issue is addressed by Cooley and not session information consistency).

Interaction between sites

Mining a web log can give some information but it will not be very useful to infer something. In today's world, everything a user needs are from different websites. So having the web log from a single website can give information about how he navigates within a website..which link he clicks .. etc ( can be useful to redesign his website.. some links may not be used. we can infer either the user did not like it or link was not placed in the proper place.. Redesign and analyze the behaviour), .....

but the designer will not be able to know the pattern (context, scenario) (1) from which website the user came to this site (why?) 2) is the output from this site going to be used in different site. ie the interaction of different sites cannot be determined.

Evaluation

The author's idea of evaluation coupled with the usability study appears nice and should be developed. Some statements and observations about how users prefer to interact with the system deserve particular attention. Do the users do this because this is what they want or because that's how they think the system can work?