Discussion Notes
Jan 31, 2001
(courtesy Balaji K.S. with some edits and input by Rob Capra)
The discussions focused on three threads:
- How useful is a web log?
- What techniques are available for mining a web log?
- Placing this in the larger context of an information system
Is a web log useful to draw inferences?
Multiple Windows
The user may have different windows open at the same time
which can make drawing inferences about the user's experience
of a web site tricky.
For example, the user could open two windows to different sites
(A and B). Then, for 15 minutes, the user does nothing on site A,
but is instead working on site B.
Based on this data, site A might be tempted to think that the
user has a liking of their page but in reality, the user spent
all the time reading site B.
How to distinguish one user from another in the web log
- Problem: due to proxies and shared computers,
the "hostname" recorded in the log is not always
an accurate indicator to distinguish one user
from another
- Options: cookies, scripts, and authentication
- But these have privacy/security considerations
- Can only look at information about the user's activity
on this site
Caching problem
- Web clients and proxies may cache previously
viewed pages, but then revisits to the same page
may not be reflected in the site's web log since
the page was retrieved from the cache, not the site.
- Method to deal with this by Cooley et. al. that uses
knowledge of the site's structure to determine if
a revisit has occurred or if the access was from
different users
Web site design depends on perception
- "Conclusions are only valid if the users perceive the
site and understand its services as the designers
have conceived them." (p.128 of the article)
- Need to focus first on "personalizing the site
in serving its users."
- If users don't understand the site, drawing conclusions
based on their data about what is popular or correlated
on the site may be invalid.
- By looking at how users navigate and interact
with the site, we can learn things about
the quality of the site
Concept hierarchies can help give insights
- Concept hierarchies are like taxonomies:
takes concrete to more abstract.
- The paper mentions using this idea on hosts to find areas,
regions, or countries instead of specific hosts
- Another example is given about "generalizing" query types
such as titleANDauthor and publisherANDyear
to "TwoParametersSearch"; still it appears that all scenarios
have been "enumerated" by the author
Data Mining
For a particular scenario (or scenarios), several ideas are available to
improve the efficiency of the data mining process: we survey them below.
Anti-monotonicity
If it is found (in a bottom-up fashion) that there is no support for {b},
then there is no need to look any higher to the {b,d} and {b,c,d} nodes
since there will be no support for them either:
{b,c,d}
/ | \
/ | \
/ | \
/ | \
{b,d} {b,c} {d,c}
| \ / \ / |
| \/ \/ |
| /\ /\ |
| / \ / \ |
{b} {d} {c}
\ | /
\ | /
\ | /
\ | /
{ }
Query optimizers can make use of the antimonotonicity constraint
to selective "reorder" query (mining) operations in an attempt to
improve retrieval performance.
Using Generality Orderings
Meta-Patterns = Patterns of Patterns. Thus, syntactic and semantic constraints
on the nature of patterns can be used to prune the search space for hypotheses.
Anytime Results
Data mining can be terminated when results of the desired fidelity are
achieved.
Caveats with the WUM approach
The mining language does not support closure in the sense of SQL (i.e., the output
of a mining query cannot seamlessly serve as the input to another mining query). Moreover,
the expressiveness of the language is constrained to propositional logic. First order
predicate logic can help mine fundamentally relational patterns.
Placing weblog mining in a larger context
Enumeration
Each scenario is enumerated in advance to ensure that data mining and exploitation (of
mined patterns) can make use of this information.
There is some attempt to separate modeling of the system from targeting.
Different Sessions
The session information cannot be easily maintained.
If an user accesses page2 and in another window goes to
page0, both might need to be considered as a single session (e.g.,
a "manual" information integration scenario);
according to the authors, however, they are
modeled as different sessions.
(only caching issue is addressed by Cooley and not session
information consistency).
Interaction between sites
Mining a web log can give some information but it will not be very useful
to infer something. In today's world, everything a user needs are from
different websites. So having the web log from a single website can give
information about how he navigates within a website..which link he
clicks .. etc ( can be useful to redesign his website.. some links
may not be used. we can infer either the user did not like it or link
was not placed in the proper place..
Redesign and analyze the behaviour), .....
but the designer will not be able to know the pattern (context, scenario)
(1) from which
website the user came to this site (why?) 2) is the output from this
site going to be used in different site. ie the interaction of different
sites cannot be determined.
Evaluation
The author's idea of evaluation coupled with the usability study appears
nice and should be developed. Some statements and
observations about how users prefer to interact with
the system deserve particular attention. Do the users do this because
this is what they want or because that's how they think the system can work?