Discussion Notes

Mar 16, 2001

(courtesy Iype Isac)

Content Modeling

Need for content modeling

cluster according to content
mining structure from websites
information integration
answering questions/queries
more compact representation

The papers discussed both worked on webpages which have schema but they were different in one aspect. The Knoblock paper dealt with automatically generated webpages (which would have been taken from a database and so would have a db schema). The Craven paper worked on maually generated webpages (hence the need to learn).

Modeling Web Sources for Information Integration

tries to create a database-like view of the web and is more database oriented
AI is involved in generating the query planning algorithm. Why is proper query planning required ? - can make the query processing cheaper and more efficient - improper query processing may not result in an answer, e.g. in case of embedded queries
an interesting concept was programming by demonstration (to create wrappers for the webpages).
talks about the possibility of learning the landmark grammar of web pages using a greedy covering algorithm

Learning to Extract Symbolic knowledge from the WWW

this paper focusses more on learning about the webpage (the classes they fit into and the relationship between them)
uses both positive and negative examples (closed world concept - anything not positive is negative)
uses the naive Bayes method with KullBack-Leibler Divergence modification for classifying web pages.

eg. We can classify a document D into a class C using the Bayes method. In that case the KL modification would take the form given below
```
	|prob(W/C)| ----> probablity that the vocabulary word occurs in this class
   log  -----------
	|prob(W/D)| ----> probablity that the vocabulary word occurs in the document
```
We can calculate the denominator and using that we try to calculate the numerator. This is the appraoch used to classify the webpages into classes. (the actual method of estimating the probability is not mentioned - only a smoothing technique is mentioned)
uses the FOIL algorithm to learn first-order rules
results of the learning emphasize the concept in the graph strand that pages that link to a given page are often better descriptors than the actual pages themselves. These ideas are modeled in relational form by the rules.
Adding probability measures and metric to make the rules more quantitative would have complicated the learning process.