Loading...

Data science and predictive analytics

Dinov, Ivo D | 2018

2605 Viewed
  1. Type of Document: Book
  2. Publisher: Switzerland : Springer International Publishing AG , 2018
  3. Keywords:
  4. Big data ; Mathematical statistics ; Medical records -- Data processing ; R (Computer program language)

 Digital Object List

 Bookmark

  • Foreword
  • Preface
    • Genesis
    • Purpose
    • Limitations/Prerequisites
    • Scope of the Book
    • Acknowledgements
  • DSPA Application and Use Disclaimer
    • Biomedical, Biosocial, Environmental, and Health Disclaimer
  • Notations
  • Contents
  • Chapter 1: Motivation
    • 1.1 DSPA Mission and Objectives
    • 1.2 Examples of Driving Motivational Problems and Challenges
      • 1.2.1 Alzheimer´s Disease
      • 1.2.2 Parkinson´s Disease
      • 1.2.3 Drug and Substance Use
      • 1.2.4 Amyotrophic Lateral Sclerosis
      • 1.2.5 Normal Brain Visualization
      • 1.2.6 Neurodegeneration
      • 1.2.7 Genetic Forensics: 2013-2016 Ebola Outbreak
      • 1.2.8 Next Generation Sequence (NGS) Analysis
      • 1.2.9 Neuroimaging-Genetics
    • 1.3 Common Characteristics of Big (Biomedical and Health) Data
    • 1.4 Data Science
    • 1.5 Predictive Analytics
    • 1.6 High-Throughput Big Data Analytics
    • 1.7 Examples of Data Repositories, Archives, and Services
    • 1.8 DSPA Expectations
  • Chapter 2: Foundations of R
    • 2.1 Why Use R?
    • 2.2 Getting Started
      • 2.2.1 Install Basic Shell-Based R
      • 2.2.2 GUI Based R Invocation (RStudio)
      • 2.2.3 RStudio GUI Layout
      • 2.2.4 Some Notes
    • 2.3 Help
    • 2.4 Simple Wide-to-Long Data format Translation
    • 2.5 Data Generation
    • 2.6 Input/Output (I/O)
    • 2.7 Slicing and Extracting Data
    • 2.8 Variable Conversion
    • 2.9 Variable Information
    • 2.10 Data Selection and Manipulation
    • 2.11 Math Functions
    • 2.12 Matrix Operations
    • 2.13 Advanced Data Processing
    • 2.14 Strings
    • 2.15 Plotting
    • 2.16 QQ Normal Probability Plot
    • 2.17 Low-Level Plotting Commands
    • 2.18 Graphics Parameters
    • 2.19 Optimization and model Fitting
    • 2.20 Statistics
    • 2.21 Distributions
      • 2.21.1 Programming
    • 2.22 Data Simulation Primer
    • 2.23 Appendix
      • 2.23.1 HTML SOCR Data Import
      • 2.23.2 R Debugging
        • Example
    • 2.24 Assignments: 2. R Foundations
      • 2.24.1 Confirm that You Have Installed R/RStudio
      • 2.24.2 Long-to-Wide Data Format Translation
      • 2.24.3 Data Frames
      • 2.24.4 Data Stratification
      • 2.24.5 Simulation
      • 2.24.6 Programming
    • References
  • Chapter 3: Managing Data in R
    • 3.1 Saving and Loading R Data Structures
    • 3.2 Importing and Saving Data from CSV Files
    • 3.3 Exploring the Structure of Data
    • 3.4 Exploring Numeric Variables
    • 3.5 Measuring the Central Tendency: Mean, Median, Mode
    • 3.6 Measuring Spread: Quartiles and the Five-Number Summary
    • 3.7 Visualizing Numeric Variables: Boxplots
    • 3.8 Visualizing Numeric Variables: Histograms
    • 3.9 Understanding Numeric Data: Uniform and Normal Distributions
    • 3.10 Measuring Spread: Variance and Standard Deviation
    • 3.11 Exploring Categorical Variables
    • 3.12 Exploring Relationships Between Variables
    • 3.13 Missing Data
      • 3.13.1 Simulate Some Real Multivariate Data
      • 3.13.2 TBI Data Example
      • 3.13.3 Imputation via Expectation-Maximization
        • Types of Missing Data
        • General Idea of EM Algorithm
        • EM-Based Imputation
        • A Simple Manual Implementation of EM-Based Imputation
        • Plotting Complete and Imputed Data
        • Validation of EM-Imputation Using the Amelia R Package
          • Comparison
          • Density Plots
    • 3.14 Parsing Webpages and Visualizing Tabular HTML Data
    • 3.15 Cohort-Rebalancing (for Imbalanced Groups)
    • 3.16 Appendix
      • 3.16.1 Importing Data from SQL Databases
      • 3.16.2 R Code Fragments
    • 3.17 Assignments: 3. Managing Data in R
      • 3.17.1 Import, Plot, Summarize and Save Data
      • 3.17.2 Explore some Bivariate Relations in the Data
      • 3.17.3 Missing Data
      • 3.17.4 Surface Plots
      • 3.17.5 Unbalanced Designs
      • 3.17.6 Aggregate Analysis
    • References
  • Chapter 4: Data Visualization
    • 4.1 Common Questions
    • 4.2 Classification of Visualization Methods
    • 4.3 Composition
      • 4.3.1 Histograms and Density Plots
      • 4.3.2 Pie Chart
      • 4.3.3 Heat Map
    • 4.4 Comparison
      • 4.4.1 Paired Scatter Plots
      • 4.4.2 Jitter Plot
      • 4.4.3 Bar Plots
      • 4.4.4 Trees and Graphs
      • 4.4.5 Correlation Plots
    • 4.5 Relationships
      • 4.5.1 Line Plots Using ggplot
      • 4.5.2 Density Plots
      • 4.5.3 Distributions
      • 4.5.4 2D Kernel Density and 3D Surface Plots
      • 4.5.5 Multiple 2D Image Surface Plots
      • 4.5.6 3D and 4D Visualizations
    • 4.6 Appendix
      • 4.6.1 Hands-on Activity (Health Behavior Risks)
      • 4.6.2 Additional ggplot Examples
        • Housing Price Data
        • Modeling the Home Price Index Data (Fig. 4.48)
        • Map of the Neighborhoods of Los Angeles (LA)
        • Latin Letter Frequency in Different Languages
    • 4.7 Assignments 4: Data Visualization
      • 4.7.1 Common Plots
      • 4.7.2 Trees and Graphs
      • 4.7.3 Exploratory Data Analytics (EDA)
    • References
  • Chapter 5: Linear Algebra and Matrix Computing
    • 5.1 Matrices (Second Order Tensors)
      • 5.1.1 Create Matrices
      • 5.1.2 Adding Columns and Rows
    • 5.2 Matrix Subscripts
    • 5.3 Matrix Operations
      • 5.3.1 Addition
      • 5.3.2 Subtraction
      • 5.3.3 Multiplication
        • Elementwise Multiplication
        • Matrix Multiplication
      • 5.3.4 Element-wise Division
      • 5.3.5 Transpose
      • 5.3.6 Multiplicative Inverse
    • 5.4 Matrix Algebra Notation
      • 5.4.1 Linear Models
      • 5.4.2 Solving Systems of Equations
      • 5.4.3 The Identity Matrix
    • 5.5 Scalars, Vectors and Matrices
      • 5.5.1 Sample Statistics (Mean, Variance)
        • Mean
        • Variance
        • Applications of Matrix Algebra: Linear Modeling
        • Finding Function Extrema (Min/Max) Using Calculus
      • 5.5.2 Least Square Estimation
        • The R lm Function
    • 5.6 Eigenvalues and Eigenvectors
    • 5.7 Other Important Functions
    • 5.8 Matrix Notation (Another View)
    • 5.9 Multivariate Linear Regression
    • 5.10 Sample Covariance Matrix
    • 5.11 Assignments: 5. Linear Algebra and Matrix Computing
      • 5.11.1 How Is Matrix Multiplication Defined?
      • 5.11.2 Scalar Versus Matrix Multiplication
      • 5.11.3 Matrix Equations
      • 5.11.4 Least Square Estimation
      • 5.11.5 Matrix Manipulation
      • 5.11.6 Matrix Transpose
      • 5.11.7 Sample Statistics
      • 5.11.8 Least Square Estimation
      • 5.11.9 Eigenvalues and Eigenvectors
    • References
  • Chapter 6: Dimensionality Reduction
    • 6.1 Example: Reducing 2D to 1D
    • 6.2 Matrix Rotations
    • 6.3 Notation
    • 6.4 Summary (PCA vs. ICA vs. FA)
    • 6.5 Principal Component Analysis (PCA)
      • 6.5.1 Principal Components
    • 6.6 Independent Component Analysis (ICA)
    • 6.7 Factor Analysis (FA)
    • 6.8 Singular Value Decomposition (SVD)
    • 6.9 SVD Summary
    • 6.10 Case Study for Dimension Reduction (Parkinson´s Disease)
    • 6.11 Assignments: 6. Dimensionality Reduction
      • 6.11.1 Parkinson´s Disease Example
      • 6.11.2 Allometric Relations in Plants Example
        • Load Data
        • Dimensionality Reduction
    • References
  • Chapter 7: Lazy Learning: Classification Using Nearest Neighbors
    • 7.1 Motivation
    • 7.2 The kNN Algorithm Overview
      • 7.2.1 Distance Function and Dummy Coding
      • 7.2.2 Ways to Determine k
      • 7.2.3 Rescaling of the Features
      • 7.2.4 Rescaling Formulas
    • 7.3 Case Study
      • 7.3.1 Step 1: Collecting Data
      • 7.3.2 Step 2: Exploring and Preparing the Data
      • 7.3.3 Normalizing Data
      • 7.3.4 Data Preparation: Creating Training and Testing Datasets
      • 7.3.5 Step 3: Training a Model On the Data
      • 7.3.6 Step 4: Evaluating Model Performance
      • 7.3.7 Step 5: Improving Model Performance
      • 7.3.8 Testing Alternative Values of k
      • 7.3.9 Quantitative Assessment (Tables 7.2 and 7.3)
    • 7.4 Assignments: 7. Lazy Learning: Classification Using Nearest Neighbors
      • 7.4.1 Traumatic Brain Injury (TBI)
      • 7.4.2 Parkinson´s Disease
      • 7.4.3 KNN Classification in a High Dimensional Space
      • 7.4.4 KNN Classification in a Lower Dimensional Space
    • References
  • Chapter 8: Probabilistic Learning: Classification Using Naive Bayes
    • 8.1 Overview of the Naive Bayes Algorithm
    • 8.2 Assumptions
    • 8.3 Bayes Formula
    • 8.4 The Laplace Estimator
    • 8.5 Case Study: Head and Neck Cancer Medication
      • 8.5.1 Step 1: Collecting Data
      • 8.5.2 Step 2: Exploring and Preparing the Data
        • Data Preparation: Processing Text Data for Analysis
        • Data Preparation: Creating Training and Test Datasets
        • Visualizing Text Data: Word Clouds
        • Data Preparation: Creating Indicator Features for Frequent Words
      • 8.5.3 Step 3: Training a Model on the Data
      • 8.5.4 Step 4: Evaluating Model Performance
      • 8.5.5 Step 5: Improving Model Performance
      • 8.5.6 Step 6: Compare Naive Bayesian against LDA
    • 8.6 Practice Problem
    • 8.7 Assignments 8: Probabilistic Learning: Classification Using Naive Bayes
      • 8.7.1 Explain These Two Concepts
      • 8.7.2 Analyzing Textual Data
    • References
  • Chapter 9: Decision Tree Divide and Conquer Classification
    • 9.1 Motivation
    • 9.2 Hands-on Example: Iris Data
    • 9.3 Decision Tree Overview
      • 9.3.1 Divide and Conquer
      • 9.3.2 Entropy
      • 9.3.3 Misclassification Error and Gini Index
      • 9.3.4 C5.0 Decision Tree Algorithm
      • 9.3.5 Pruning the Decision Tree
    • 9.4 Case Study 1: Quality of Life and Chronic Disease
      • 9.4.1 Step 1: Collecting Data
      • 9.4.2 Step 2: Exploring and Preparing the Data
        • Data Preparation: Creating Random Training and Test Datasets
      • 9.4.3 Step 3: Training a Model On the Data
      • 9.4.4 Step 4: Evaluating Model Performance
      • 9.4.5 Step 5: Trial Option
      • 9.4.6 Loading the Misclassification Error Matrix
      • 9.4.7 Parameter Tuning
    • 9.5 Compare Different Impurity Indices
    • 9.6 Classification Rules
      • 9.6.1 Separate and Conquer
      • 9.6.2 The One Rule Algorithm
      • 9.6.3 The RIPPER Algorithm
    • 9.7 Case Study 2: QoL in Chronic Disease (Take 2)
      • 9.7.1 Step 3: Training a Model on the Data
      • 9.7.2 Step 4: Evaluating Model Performance
      • 9.7.3 Step 5: Alternative Model1
      • 9.7.4 Step 5: Alternative Model2
    • 9.8 Practice Problem
    • 9.9 Assignments 9: Decision Tree Divide and Conquer Classification
      • 9.9.1 Explain These Concepts
      • 9.9.2 Decision Tree Partitioning
    • References
  • Chapter 10: Forecasting Numeric Data Using Regression Models
    • 10.1 Understanding Regression
      • 10.1.1 Simple Linear Regression
    • 10.2 Ordinary Least Squares Estimation
      • 10.2.1 Model Assumptions
      • 10.2.2 Correlations
      • 10.2.3 Multiple Linear Regression
    • 10.3 Case Study 1: Baseball Players
      • 10.3.1 Step 1: Collecting Data
      • 10.3.2 Step 2: Exploring and Preparing the Data
      • 10.3.3 Exploring Relationships Among Features: The Correlation Matrix
      • 10.3.4 Visualizing Relationships Among Features: The Scatterplot Matrix
      • 10.3.5 Step 3: Training a Model on the Data
      • 10.3.6 Step 4: Evaluating Model Performance
    • 10.4 Step 5: Improving Model Performance
      • 10.4.1 Model Specification: Adding Non-linear Relationships
      • 10.4.2 Transformation: Converting a Numeric Variable to a Binary Indicator
      • 10.4.3 Model Specification: Adding Interaction Effects
    • 10.5 Understanding Regression Trees and Model Trees
      • 10.5.1 Adding Regression to Trees
    • 10.6 Case Study 2: Baseball Players (Take 2)
      • 10.6.1 Step 2: Exploring and Preparing the Data
      • 10.6.2 Step 3: Training a Model On the Data
      • 10.6.3 Visualizing Decision Trees
      • 10.6.4 Step 4: Evaluating Model Performance
      • 10.6.5 Measuring Performance with Mean Absolute Error
      • 10.6.6 Step 5: Improving Model Performance
    • 10.7 Practice Problem: Heart Attack Data
    • 10.8 Assignments: 10. Forecasting Numeric Data Using Regression Models
    • References
  • Chapter 11: Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
    • 11.1 Understanding Neural Networks
      • 11.1.1 From Biological to Artificial Neurons
      • 11.1.2 Activation Functions
      • 11.1.3 Network Topology
      • 11.1.4 The Direction of Information Travel
      • 11.1.5 The Number of Nodes in Each Layer
      • 11.1.6 Training Neural Networks with Backpropagation
    • 11.2 Case Study 1: Google Trends and the Stock Market: Regression
      • 11.2.1 Step 1: Collecting Data
        • Variables
      • 11.2.2 Step 2: Exploring and Preparing the Data
      • 11.2.3 Step 3: Training a Model on the Data
      • 11.2.4 Step 4: Evaluating Model Performance
      • 11.2.5 Step 5: Improving Model Performance
      • 11.2.6 Step 6: Adding Additional Layers
    • 11.3 Simple NN Demo: Learning to Compute
    • 11.4 Case Study 2: Google Trends and the Stock Market - Classification
    • 11.5 Support Vector Machines (SVM)
      • 11.5.1 Classification with Hyperplanes
        • Finding the Maximum Margin
        • Linearly Separable Data
        • Non-linearly Separable Data
        • Using Kernels for Non-linear Spaces
    • 11.6 Case Study 3: Optical Character Recognition (OCR)
      • 11.6.1 Step 1: Prepare and Explore the Data
      • 11.6.2 Step 2: Training an SVM Model
      • 11.6.3 Step 3: Evaluating Model Performance
      • 11.6.4 Step 4: Improving Model Performance
    • 11.7 Case Study 4: Iris Flowers
      • 11.7.1 Step 1: Collecting Data
      • 11.7.2 Step 2: Exploring and Preparing the Data
      • 11.7.3 Step 3: Training a Model on the Data
      • 11.7.4 Step 4: Evaluating Model Performance
      • 11.7.5 Step 5: RBF Kernel Function
      • 11.7.6 Parameter Tuning
      • 11.7.7 Improving the Performance of Gaussian Kernels
    • 11.8 Practice
      • 11.8.1 Problem 1 Google Trends and the Stock Market
      • 11.8.2 Problem 2: Quality of Life and Chronic Disease
    • 11.9 Appendix
    • 11.10 Assignments: 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
      • 11.10.1 Learn and Predict a Power-Function
      • 11.10.2 Pediatric Schizophrenia Study
    • References
  • Chapter 12: Apriori Association Rules Learning
    • 12.1 Association Rules
    • 12.2 The Apriori Algorithm for Association Rule Learning
    • 12.3 Measuring Rule Importance by Using Support and Confidence
    • 12.4 Building a Set of Rules with the Apriori Principle
    • 12.5 A Toy Example
    • 12.6 Case Study 1: Head and Neck Cancer Medications
      • 12.6.1 Step 1: Collecting Data
      • 12.6.2 Step 2: Exploring and Preparing the Data
        • Visualizing Item Support: Item Frequency Plots
        • Visualizing Transaction Data: Plotting the Sparse Matrix
      • 12.6.3 Step 3: Training a Model on the Data
      • 12.6.4 Step 4: Evaluating Model Performance
      • 12.6.5 Step 5: Improving Model Performance
        • Sorting the Set of Association Rules
        • Taking Subsets of Association Rules
        • Saving Association Rules to a File or Data Frame
    • 12.7 Practice Problems: Groceries
    • 12.8 Summary
    • 12.9 Assignments: 12. Apriori Association Rules Learning
    • References
  • Chapter 13: k-Means Clustering
    • 13.1 Clustering as a Machine Learning Task
    • 13.2 Silhouette Plots
    • 13.3 The k-Means Clustering Algorithm
      • 13.3.1 Using Distance to Assign and Update Clusters
      • 13.3.2 Choosing the Appropriate Number of Clusters
    • 13.4 Case Study 1: Divorce and Consequences on Young Adults
      • 13.4.1 Step 1: Collecting Data
        • Variables
      • 13.4.2 Step 2: Exploring and Preparing the Data
      • 13.4.3 Step 3: Training a Model on the Data
      • 13.4.4 Step 4: Evaluating Model Performance
      • 13.4.5 Step 5: Usage of Cluster Information
    • 13.5 Model Improvement
      • 13.5.1 Tuning the Parameter k
    • 13.6 Case Study 2: Pediatric Trauma
      • 13.6.1 Step 1: Collecting Data
      • 13.6.2 Step 2: Exploring and Preparing the Data
      • 13.6.3 Step 3: Training a Model on the Data
      • 13.6.4 Step 4: Evaluating Model Performance
      • 13.6.5 Practice Problem: Youth Development
    • 13.7 Hierarchical Clustering
    • 13.8 Gaussian Mixture Models
    • 13.9 Summary
    • 13.10 Assignments: 13. k-Means Clustering
    • References
  • Chapter 14: Model Performance Assessment
    • 14.1 Measuring the Performance of Classification Methods
    • 14.2 Evaluation Strategies
      • 14.2.1 Binary Outcomes
      • 14.2.2 Confusion Matrices
      • 14.2.3 Other Measures of Performance Beyond Accuracy
      • 14.2.4 The Kappa (κ) Statistic
        • Summary of the Kappa Score for Calculating Prediction Accuracy
      • 14.2.5 Computation of Observed Accuracy and Expected Accuracy
      • 14.2.6 Sensitivity and Specificity
      • 14.2.7 Precision and Recall
      • 14.2.8 The F-Measure
    • 14.3 Visualizing Performance Tradeoffs (ROC Curve)
    • 14.4 Estimating Future Performance (Internal Statistical Validation)
      • 14.4.1 The Holdout Method
      • 14.4.2 Cross-Validation
      • 14.4.3 Bootstrap Sampling
    • 14.5 Assignment: 14. Evaluation of Model Performance
    • References
  • Chapter 15: Improving Model Performance
    • 15.1 Improving Model Performance by Parameter Tuning
    • 15.2 Using caret for Automated Parameter Tuning
      • 15.2.1 Customizing the Tuning Process
      • 15.2.2 Improving Model Performance with Meta-learning
      • 15.2.3 Bagging
      • 15.2.4 Boosting
      • 15.2.5 Random Forests
        • Training Random Forests
        • Evaluating Random Forest Performance
      • 15.2.6 Adaptive Boosting
    • 15.3 Assignment: 15. Improving Model Performance
      • 15.3.1 Model Improvement Case Study
    • References
  • Chapter 16: Specialized Machine Learning Topics
    • 16.1 Working with Specialized Data and Databases
      • 16.1.1 Data Format Conversion
      • 16.1.2 Querying Data in SQL Databases
      • 16.1.3 Real Random Number Generation
      • 16.1.4 Downloading the Complete Text of Web Pages
      • 16.1.5 Reading and Writing XML with the XML Package
      • 16.1.6 Web-Page Data Scraping
      • 16.1.7 Parsing JSON from Web APIs
      • 16.1.8 Reading and Writing Microsoft Excel Spreadsheets Using XLSX
    • 16.2 Working with Domain-Specific Data
      • 16.2.1 Working with Bioinformatics Data
      • 16.2.2 Visualizing Network Data
    • 16.3 Data Streaming
      • 16.3.1 Definition
      • 16.3.2 The stream Package
      • 16.3.3 Synthetic Example: Random Gaussian Stream
        • k-Means Clustering
      • 16.3.4 Sources of Data Streams
        • Static Structure Streams
        • Concept Drift Streams
        • Real Data Streams
      • 16.3.5 Printing, Plotting and Saving Streams
      • 16.3.6 Stream Animation
      • 16.3.7 Case-Study: SOCR Knee Pain Data
      • 16.3.8 Data Stream Clustering and Classification (DSC)
      • 16.3.9 Evaluation of Data Stream Clustering
    • 16.4 Optimization and Improving the Computational Performance
      • 16.4.1 Generalizing Tabular Data Structures with dplyr
      • 16.4.2 Making Data Frames Faster with Data.Table
      • 16.4.3 Creating Disk-Based Data Frames with ff
      • 16.4.4 Using Massive Matrices with bigmemory
    • 16.5 Parallel Computing
      • 16.5.1 Measuring Execution Time
      • 16.5.2 Parallel Processing with Multiple Cores
      • 16.5.3 Parallelization Using foreach and doParallel
      • 16.5.4 GPU Computing
    • 16.6 Deploying Optimized Learning Algorithms
      • 16.6.1 Building Bigger Regression Models with biglm
      • 16.6.2 Growing Bigger and Faster Random Forests with bigrf
      • 16.6.3 Training and Evaluation Models in Parallel with caret
    • 16.7 Practice Problem
    • 16.8 Assignment: 16. Specialized Machine Learning Topics
      • 16.8.1 Working with Website Data
      • 16.8.2 Network Data and Visualization
      • 16.8.3 Data Conversion and Parallel Computing
    • References
  • Chapter 17: Variable/Feature Selection
    • 17.1 Feature Selection Methods
      • 17.1.1 Filtering Techniques
      • 17.1.2 Wrapper Methods
      • 17.1.3 Embedded Techniques
    • 17.2 Case Study: ALS
      • 17.2.1 Step 1: Collecting Data
      • 17.2.2 Step 2: Exploring and Preparing the Data
      • 17.2.3 Step 3: Training a Model on the Data
      • 17.2.4 Step 4: Evaluating Model Performance
        • Comparing with RFE
        • Comparing with Stepwise Feature Selection
    • 17.3 Practice Problem
    • 17.4 Assignment: 17. Variable/Feature Selection
      • 17.4.1 Wrapper Feature Selection
      • 17.4.2 Use the PPMI Dataset
    • References
  • Chapter 18: Regularized Linear Modeling and Controlled Variable Selection
    • 18.1 Questions
    • 18.2 Matrix Notation
    • 18.3 Regularized Linear Modeling
      • 18.3.1 Ridge Regression
      • 18.3.2 Least Absolute Shrinkage and Selection Operator (LASSO) Regression
      • 18.3.3 Predictor Standardization
      • 18.3.4 Estimation Goals
    • 18.4 Linear Regression
      • 18.4.1 Drawbacks of Linear Regression
      • 18.4.2 Assessing Prediction Accuracy
      • 18.4.3 Estimating the Prediction Error
      • 18.4.4 Improving the Prediction Accuracy
      • 18.4.5 Variable Selection
    • 18.5 Regularization Framework
      • 18.5.1 Role of the Penalty Term
      • 18.5.2 Role of the Regularization Parameter
      • 18.5.3 LASSO
      • 18.5.4 General Regularization Framework
    • 18.6 Implementation of Regularization
      • 18.6.1 Example: Neuroimaging-Genetics Study of Parkinson´s Disease Dataset
      • 18.6.2 Computational Complexity
      • 18.6.3 LASSO and Ridge Solution Paths
      • 18.6.4 Choice of the Regularization Parameter
      • 18.6.5 Cross Validation Motivation
      • 18.6.6 n-Fold Cross Validation
      • 18.6.7 LASSO 10-Fold Cross Validation
      • 18.6.8 Stepwise OLS (Ordinary Least Squares)
      • 18.6.9 Final Models
      • 18.6.10 Model Performance
      • 18.6.11 Comparing Selected Features
      • 18.6.12 Summary
    • 18.7 Knock-off Filtering: Simulated Example
      • 18.7.1 Notes
    • 18.8 PD Neuroimaging-Genetics Case-Study
      • 18.8.1 Fetching, Cleaning and Preparing the Data
      • 18.8.2 Preparing the Response Vector
      • 18.8.3 False Discovery Rate (FDR)
        • Graphical Interpretation of the Benjamini-Hochberg (BH) Method
        • FDR Adjusting the p-Values
      • 18.8.4 Running the Knockoff Filter
    • 18.9 Assignment: 18. Regularized Linear Modeling and Knockoff Filtering
    • References
  • Chapter 19: Big Longitudinal Data Analysis
    • 19.1 Time Series Analysis
      • 19.1.1 Step 1: Plot Time Series
      • 19.1.2 Step 2: Find Proper Parameter Values for ARIMA Model
      • 19.1.3 Check the Differencing Parameter
      • 19.1.4 Identifying the AR and MA Parameters
      • 19.1.5 Step 3: Build an ARIMA Model
      • 19.1.6 Step 4: Forecasting with ARIMA Model
    • 19.2 Structural Equation Modeling (SEM)-Latent Variables
      • 19.2.1 Foundations of SEM
      • 19.2.2 SEM Components
      • 19.2.3 Case Study - Parkinson´s Disease (PD)
        • Step 1 - Collecting Data
        • Step 2 - Exploring and Preparing the Data
        • Step 3 - Fitting a Model on the Data
      • 19.2.4 Outputs of Lavaan SEM
    • 19.3 Longitudinal Data Analysis-Linear Mixed Models
      • 19.3.1 Mean Trend
      • 19.3.2 Modeling the Correlation
    • 19.4 GLMM/GEE Longitudinal Data Analysis
      • 19.4.1 GEE Versus GLMM
    • 19.5 Assignment: 19. Big Longitudinal Data Analysis
      • 19.5.1 Imaging Data
      • 19.5.2 Time Series Analysis
      • 19.5.3 Latent Variables Model
    • References
  • Chapter 20: Natural Language Processing/Text Mining
    • 20.1 A Simple NLP/TM Example
      • 20.1.1 Define and Load the Unstructured-Text Documents
      • 20.1.2 Create a New VCorpus Object
      • 20.1.3 To-Lower Case Transformation
      • 20.1.4 Text Pre-processing
        • Remove Stopwords
        • Remove Punctuation
        • Stemming: Removal of Plurals and Action Suffixes
      • 20.1.5 Bags of Words
      • 20.1.6 Document Term Matrix
    • 20.2 Case-Study: Job Ranking
      • 20.2.1 Step 1: Make a VCorpus Object
      • 20.2.2 Step 2: Clean the VCorpus Object
      • 20.2.3 Step 3: Build the Document Term Matrix
      • 20.2.4 Area Under the ROC Curve
    • 20.3 TF-IDF
      • 20.3.1 Term Frequency (TF)
      • 20.3.2 Inverse Document Frequency (IDF)
      • 20.3.3 TF-IDF
    • 20.4 Cosine Similarity
    • 20.5 Sentiment Analysis
      • 20.5.1 Data Preprocessing
      • 20.5.2 NLP/TM Analytics
      • 20.5.3 Prediction Optimization
    • 20.6 Assignment: 20. Natural Language Processing/Text Mining
      • 20.6.1 Mining Twitter Data
      • 20.6.2 Mining Cancer Clinical Notes
    • References
  • Chapter 21: Prediction and Internal Statistical Cross Validation
    • 21.1 Forecasting Types and Assessment Approaches
    • 21.2 Overfitting
      • 21.2.1 Example (US Presidential Elections)
      • 21.2.2 Example (Google Flu Trends)
      • 21.2.3 Example (Autism)
    • 21.3 Internal Statistical Cross-Validation is an Iterative Process
    • 21.4 Example (Linear Regression)
      • 21.4.1 Cross-Validation Methods
      • 21.4.2 Exhaustive Cross-Validation
      • 21.4.3 Non-Exhaustive Cross-Validation
    • 21.5 Case-Studies
      • 21.5.1 Example 1: Prediction of Parkinson´s Disease Using Adaptive Boosting (AdaBoost)
      • 21.5.2 Example 2: Sleep Dataset
      • 21.5.3 Example 3: Model-Based (Linear Regression) Prediction Using the Attitude Dataset
      • 21.5.4 Example 4: Parkinson´s Data (ppmi_data)
    • 21.6 Summary of CV output
    • 21.7 Alternative Predictor Functions
      • 21.7.1 Logistic Regression
      • 21.7.2 Quadratic Discriminant Analysis (QDA)
      • 21.7.3 Foundation of LDA and QDA for Prediction, Dimensionality Reduction, and Forecasting
        • LDA (Linear Discriminant Analysis)
        • QDA (Quadratic Discriminant Analysis)
      • 21.7.4 Neural Networks
      • 21.7.5 SVM
      • 21.7.6 k-Nearest Neighbors Algorithm (k-NN)
      • 21.7.7 k-Means Clustering (k-MC)
      • 21.7.8 Spectral Clustering
        • Iris Petal Data
        • Spirals Data
        • Income Data
    • 21.8 Compare the Results
    • 21.9 Assignment: 21. Prediction and Internal Statistical Cross-Validation
    • References
  • Chapter 22: Function Optimization
    • 22.1 Free (Unconstrained) Optimization
      • 22.1.1 Example 1: Minimizing a Univariate Function (Inverse-CDF)
      • 22.1.2 Example 2: Minimizing a Bivariate Function
      • 22.1.3 Example 3: Using Simulated Annealing to Find the Maximum of an Oscillatory Function
    • 22.2 Constrained Optimization
      • 22.2.1 Equality Constraints
      • 22.2.2 Lagrange Multipliers
      • 22.2.3 Inequality Constrained Optimization
        • Linear Programming (LP)
        • Mixed Integer Linear Programming (MILP)
      • 22.2.4 Quadratic Programming (QP)
    • 22.3 General Non-linear Optimization
      • 22.3.1 Dual Problem Optimization
        • Motivation
        • Example 1: Linear Example
        • Example 2: Quadratic Example
        • Example 3: More Complex Non-linear Optimization
        • Example 4: Another Linear Example
    • 22.4 Manual Versus Automated Lagrange Multiplier Optimization
    • 22.5 Data Denoising
    • 22.6 Assignment: 22. Function Optimization
      • 22.6.1 Unconstrained Optimization
      • 22.6.2 Linear Programming (LP)
      • 22.6.3 Mixed Integer Linear Programming (MILP)
      • 22.6.4 Quadratic Programming (QP)
      • 22.6.5 Complex Non-linear Optimization
      • 22.6.6 Data Denoising
    • References
  • Chapter 23: Deep Learning, Neural Networks
    • 23.1 Deep Learning Training
      • 23.1.1 Perceptrons
    • 23.2 Biological Relevance
    • 23.3 Simple Neural Net Examples
      • 23.3.1 Exclusive OR (XOR) Operator
      • 23.3.2 NAND Operator
      • 23.3.3 Complex Networks Designed Using Simple Building Blocks
    • 23.4 Classification
      • 23.4.1 Sonar Data Example
      • 23.4.2 MXNet Notes
    • 23.5 Case-Studies
      • 23.5.1 ALS Regression Example
      • 23.5.2 Spirals 2D Data
      • 23.5.3 IBS Study
      • 23.5.4 Country QoL Ranking Data
      • 23.5.5 Handwritten Digits Classification
        • Configuring the Neural Network
        • Training
        • Forecasting
        • Examining the Network Structure Using LeNet
    • 23.6 Classifying Real-World Images
      • 23.6.1 Load the Pre-trained Model
      • 23.6.2 Load, Preprocess and Classify New Images - US Weather Pattern
      • 23.6.3 Lake Mapourika, New Zealand
      • 23.6.4 Beach Image
      • 23.6.5 Volcano
      • 23.6.6 Brain Surface
      • 23.6.7 Face Mask
    • 23.7 Assignment: 23. Deep Learning, Neural Networks
      • 23.7.1 Deep Learning Classification
      • 23.7.2 Deep Learning Regression
      • 23.7.3 Image Classification
    • References
  • Summary
  • Glossary
  • Index
...see more