Modeling the Internet and the Web

Probabilistic Methods and Algorithms

Pierre Baldi, Paolo Frasconi, Padhraic Smyth

Table of contents

  1. Mathematical Background
    1. Probability and Learning from a Bayesian Perspective
    2. Parameter Estimation from Data
      1. Basic principles
      2. A simple die example
    3. Mixture Models and the Expectation Maximization Algorithm
    4. Graphical Models
      1. Bayesian networks
      2. Belief propagation
      3. Learning directed graphical models from data
    5. Classification
    6. Clustering
    7. Power Law Distributions
      1. Definition
      2. Scale-free properties (80/20 rule)
      3. Applications to languages: Zipf and Heaps laws
      4. Origin of power-law distributions and Fermi's model
    8. Exercises
  2. Basic WWW Technologies
    1. Web documents
      1. SGML and HTML
      2. General structure of an HTML document
      3. Links
    2. Resource identifiers: URI, URL, and URN
    3. Protocols
      1. Reference models and TCP/IP
      2. The domain name system
      3. The hypertext transfer protocol
      4. Programming examples
    4. Log files
    5. Search engines
      1. Overview
      2. Coverage
      3. Basic crawling
    6. Exercises
  3. Web Graphs
    1. Internet and Web Graphs
      1. Power-law size
      2. Power-law connectivity
      3. Small-world networks
      4. Power law of PageRank
      5. The bow-tie structure
    2. Generative Models for theWeb Graph and Other Networks
      1. Web page growth
      2. Lattice perturbation models: between order and disorder
      3. Preferential attachment models, or the rich get richer
      4. Copy models
      5. PageRank models
    3. Applications
      1. Distributed search algorithms
      2. Subgraph patterns and communities
      3. Robustness and vulnerability
    4. Notes and additional technical references
    5. Exercises
  4. Text Analysis (sample chapter available for download)
    1. Indexing
      1. Basic concepts
      2. Compression techniques
    2. Lexical Processing
      1. Tokenization
      2. Text conflation and vocabulary reduction
    3. Content-Based Ranking
      1. The vector-space model
      2. Document similarity
      3. Retrieval and evaluation measures
    4. Probabilistic Retrieval
    5. Latent Semantic Analysis
      1. LSI and text documents
      2. Probabilistic LSA
    6. Text Categorization
      1. k nearest neighbors
      2. The Naive Bayes classifier
      3. Support vector classifiers
      4. Feature selection
      5. Measures of performance
      6. Applications
      7. Supervised learning with unlabeled data
    7. Exploiting Hyperlinks
      1. Co-training
      2. Relational learning
    8. Document Clustering
      1. Background and examples
      2. Clustering algorithms for documents
      3. Related approaches
    9. Information Extraction
    10. Exercises
  5. Link analysis
    1. Early Approaches to Link Analysis
    2. Nonnegative Matrices and Dominant Eigenvectors
    3. Hubs and Authorities: HITS
    4. PageRank
    5. Stability
      1. Stability of HITS
      2. Stability of PageRank
    6. Probabilistic Link Analysis
      1. SALSA
      2. PHITS
    7. Limitations of Link Analysis
  6. Advanced Crawling Techniques
    1. Selective Crawling
    2. Focused Crawling
      1. Focused crawling by relevance prediction
      2. Context Graphs
      3. Reinforcement Learning
      4. Related intelligentWeb agents
    3. Distributed crawling
    4. Web Dynamics
      1. Lifetime and aging of documents
      2. Other measures of recency
      3. Recency and synchronization policies
  7. Modeling and Understanding Human Behavior on the Web
    1. Introduction
    2. Web Data and Measurement Issues
      1. Background
      2. Server-side data 
      3. Client-side data
    3. Empirical Client-Side Studies of Browsing Behavior
      1. Early studies from 1995 to 1997
      2. The Cockburn and McKenzie study from 2002
    4. Probabilistic Models of Browsing Behavior
      1. Markov models for page prediction
      2. Fitting Markov models to observed page-request data
      3. Bayesian parameter estimation for Markov models
      4. Predicting page requests with Markov models
      5. Modeling runlengths within states
      6. Modeling session lengths
      7. A decision-theoretic surfing model
      8. Predicting page requests using additional variables
    5. Modeling and Understanding Search Engine Querying
      1. Empirical studies of search behavior
      2. Models for search strategies
    6. Exercises
  8. Commerce on theWeb: Models and Applications
    1. Introduction
    2. Customer Data on theWeb
    3. Automated Recommender Systems
      1. Evaluating recommender systems
      2. Nearest-neighbor collaborative filtering
      3. Model-based collaborative filtering
      4. Model-based combining of votes and content
    4. Networks and Recommendations
      1. Email-based product recommendations
      2. A diffusion model
    5. Web Path Analysis for Purchase Prediction
    6. Exercises