Discussion Notes
Mar 16, 2001
(courtesy Iype Isac)
Content Modeling
Need for content modeling
- cluster according to content
- mining structure from websites
- information integration
- answering questions/queries
- more compact representation
The papers discussed both worked on webpages which have schema but
they were different in one aspect. The Knoblock paper dealt with
automatically generated webpages (which would have been taken
from a database and so would have a db schema).
The Craven paper worked on maually generated webpages (hence the
need to learn).
Modeling Web Sources for Information Integration
- tries to create a database-like view of the web and is more database oriented
- AI is involved in generating the query planning algorithm. Why is proper query planning required ?
- can make the query processing cheaper and more efficient
- improper query processing may not result in an answer,
e.g. in case of embedded queries
- an interesting concept was programming by demonstration (to create wrappers for the webpages).
- talks about the possibility of learning the landmark grammar of web pages using a greedy covering algorithm
Learning to Extract Symbolic knowledge from the WWW
- this paper focusses more on learning about the webpage (the classes they fit into and the relationship between them)
- uses both positive and negative examples (closed world concept - anything not positive is negative)
- uses the naive Bayes method with KullBack-Leibler Divergence modification for classifying web pages.
eg. We can classify a document D into a class C using the Bayes method. In that case the KL modification would take the form given below
|prob(W/C)| ----> probablity that the vocabulary word occurs in this class
log -----------
|prob(W/D)| ----> probablity that the vocabulary word occurs in the document
We can calculate the denominator and using that we try to calculate the numerator. This is the appraoch used to classify the webpages into classes. (the actual method of estimating the probability is not mentioned - only a smoothing technique is mentioned)
- uses the FOIL algorithm to learn first-order rules
- results of the learning emphasize the concept in the graph
strand that pages that link to a given page are often better
descriptors than the actual pages themselves.
These ideas are modeled in
relational form by the rules.
- Adding probability measures and metric to make the rules more
quantitative would have complicated the learning process.
|