Modeling the Internet and the Web
Probabilistic Methods and Algorithms
Pierre Baldi
,
Paolo Frasconi
,
Padhraic Smyth
Table of contents
Preface
Mathematical Background
Probability and Learning from a Bayesian Perspective
Parameter Estimation from Data
Basic principles
A simple die example
Mixture Models and the Expectation Maximization Algorithm
Graphical Models
Bayesian networks
Belief propagation
Learning directed graphical models from data
Classification
Clustering
Power Law Distributions
Definition
Scale-free properties (80/20 rule)
Applications to languages: Zipf and Heaps laws
Origin of power-law distributions and Fermi's model
Exercises
Basic WWW Technologies
Web documents
SGML and HTML
General structure of an HTML document
Links
Resource identifiers: URI, URL, and URN
Protocols
Reference models and TCP/IP
The domain name system
The hypertext transfer protocol
Programming examples
Log files
Search engines
Overview
Coverage
Basic crawling
Exercises
Web Graphs
Internet and Web Graphs
Power-law size
Power-law connectivity
Small-world networks
Power law of PageRank
The bow-tie structure
Generative Models for theWeb Graph and Other Networks
Web page growth
Lattice perturbation models: between order and disorder
Preferential attachment models, or the rich get richer
Copy models
PageRank models
Applications
Distributed search algorithms
Subgraph patterns and communities
Robustness and vulnerability
Notes and additional technical references
Exercises
Text Analysis
(sample chapter available for download)
Indexing
Basic concepts
Compression techniques
Lexical Processing
Tokenization
Text conflation and vocabulary reduction
Content-Based Ranking
The vector-space model
Document similarity
Retrieval and evaluation measures
Probabilistic Retrieval
Latent Semantic Analysis
LSI and text documents
Probabilistic LSA
Text Categorization
k nearest neighbors
The Naive Bayes classifier
Support vector classifiers
Feature selection
Measures of performance
Applications
Supervised learning with unlabeled data
Exploiting Hyperlinks
Co-training
Relational learning
Document Clustering
Background and examples
Clustering algorithms for documents
Related approaches
Information Extraction
Exercises
Link analysis
Early Approaches to Link Analysis
Nonnegative Matrices and Dominant Eigenvectors
Hubs and Authorities: HITS
PageRank
Stability
Stability of HITS
Stability of PageRank
Probabilistic Link Analysis
SALSA
PHITS
Limitations of Link Analysis
Advanced Crawling Techniques
Selective Crawling
Focused Crawling
Focused crawling by relevance prediction
Context Graphs
Reinforcement Learning
Related intelligentWeb agents
Distributed crawling
Web Dynamics
Lifetime and aging of documents
Other measures of recency
Recency and synchronization policies
Modeling and Understanding Human Behavior on the Web
Introduction
Web Data and Measurement Issues
Background
Server-side data
Client-side data
Empirical Client-Side Studies of Browsing Behavior
Early studies from 1995 to 1997
The Cockburn and McKenzie study from 2002
Probabilistic Models of Browsing Behavior
Markov models for page prediction
Fitting Markov models to observed page-request data
Bayesian parameter estimation for Markov models
Predicting page requests with Markov models
Modeling runlengths within states
Modeling session lengths
A decision-theoretic surfing model
Predicting page requests using additional variables
Modeling and Understanding Search Engine Querying
Empirical studies of search behavior
Models for search strategies
Exercises
Commerce on theWeb: Models and Applications
Introduction
Customer Data on theWeb
Automated Recommender Systems
Evaluating recommender systems
Nearest-neighbor collaborative filtering
Model-based collaborative filtering
Model-based combining of votes and content
Networks and Recommendations
Email-based product recommendations
A diffusion model
Web Path Analysis for Purchase Prediction
Exercises
Appendix A Mathematical Complements
Graph Theory
Basic definitions
Connectivity
Random graphs
Distributions
Expectation, Variance, and Covariance
Discrete distributions
Continuous distributions
Weibull distribution
Exponential family
Extreme value distribution
Singular Value Decomposition
Markov Chains
Information Theory
Mathematical background
Information, surprise, and relevance
Appendix B List of Main Symbols and Abbreviations
References
Index