Data science and predictive analytics biomedical and health applications using R

Dinov, Ivo D; Ivo D. Dinov

Please enable javascript in your browser.

Data science and predictive analytics

Dinov, Ivo D | 2018

2605 Viewed

Type of Document: Book
Publisher: Switzerland : Springer International Publishing AG , 2018
Keywords:
Big data ; Mathematical statistics ; Medical records -- Data processing ; R (Computer program language)

Digital Object List

محتواي کتاب
view
download

Bookmark

Foreword
Preface
- Genesis
- Purpose
- Limitations/Prerequisites
- Scope of the Book
- Acknowledgements
DSPA Application and Use Disclaimer
- Biomedical, Biosocial, Environmental, and Health Disclaimer
Notations
Contents
Chapter 1: Motivation
- 1.1 DSPA Mission and Objectives
- 1.2 Examples of Driving Motivational Problems and Challenges
  - 1.2.1 Alzheimer´s Disease
  - 1.2.2 Parkinson´s Disease
  - 1.2.3 Drug and Substance Use
  - 1.2.4 Amyotrophic Lateral Sclerosis
  - 1.2.5 Normal Brain Visualization
  - 1.2.6 Neurodegeneration
  - 1.2.7 Genetic Forensics: 2013-2016 Ebola Outbreak
  - 1.2.8 Next Generation Sequence (NGS) Analysis
  - 1.2.9 Neuroimaging-Genetics
- 1.3 Common Characteristics of Big (Biomedical and Health) Data
- 1.4 Data Science
- 1.5 Predictive Analytics
- 1.6 High-Throughput Big Data Analytics
- 1.7 Examples of Data Repositories, Archives, and Services
- 1.8 DSPA Expectations
Chapter 2: Foundations of R
- 2.1 Why Use R?
- 2.2 Getting Started
  - 2.2.1 Install Basic Shell-Based R
  - 2.2.2 GUI Based R Invocation (RStudio)
  - 2.2.3 RStudio GUI Layout
  - 2.2.4 Some Notes
- 2.3 Help
- 2.4 Simple Wide-to-Long Data format Translation
- 2.5 Data Generation
- 2.6 Input/Output (I/O)
- 2.7 Slicing and Extracting Data
- 2.8 Variable Conversion
- 2.9 Variable Information
- 2.10 Data Selection and Manipulation
- 2.11 Math Functions
- 2.12 Matrix Operations
- 2.13 Advanced Data Processing
- 2.14 Strings
- 2.15 Plotting
- 2.16 QQ Normal Probability Plot
- 2.17 Low-Level Plotting Commands
- 2.18 Graphics Parameters
- 2.19 Optimization and model Fitting
- 2.20 Statistics
- 2.21 Distributions
  - 2.21.1 Programming
- 2.22 Data Simulation Primer
- 2.23 Appendix
  - 2.23.1 HTML SOCR Data Import
  - 2.23.2 R Debugging
    - Example
- 2.24 Assignments: 2. R Foundations
  - 2.24.1 Confirm that You Have Installed R/RStudio
  - 2.24.2 Long-to-Wide Data Format Translation
  - 2.24.3 Data Frames
  - 2.24.4 Data Stratification
  - 2.24.5 Simulation
  - 2.24.6 Programming
- References
Chapter 3: Managing Data in R
- 3.1 Saving and Loading R Data Structures
- 3.2 Importing and Saving Data from CSV Files
- 3.3 Exploring the Structure of Data
- 3.4 Exploring Numeric Variables
- 3.5 Measuring the Central Tendency: Mean, Median, Mode
- 3.6 Measuring Spread: Quartiles and the Five-Number Summary
- 3.7 Visualizing Numeric Variables: Boxplots
- 3.8 Visualizing Numeric Variables: Histograms
- 3.9 Understanding Numeric Data: Uniform and Normal Distributions
- 3.10 Measuring Spread: Variance and Standard Deviation
- 3.11 Exploring Categorical Variables
- 3.12 Exploring Relationships Between Variables
- 3.13 Missing Data
  - 3.13.1 Simulate Some Real Multivariate Data
  - 3.13.2 TBI Data Example
  - 3.13.3 Imputation via Expectation-Maximization
    - Types of Missing Data
    - General Idea of EM Algorithm
    - EM-Based Imputation
    - A Simple Manual Implementation of EM-Based Imputation
    - Plotting Complete and Imputed Data
    - Validation of EM-Imputation Using the Amelia R Package
      - Comparison
      - Density Plots
- 3.14 Parsing Webpages and Visualizing Tabular HTML Data
- 3.15 Cohort-Rebalancing (for Imbalanced Groups)
- 3.16 Appendix
  - 3.16.1 Importing Data from SQL Databases
  - 3.16.2 R Code Fragments
- 3.17 Assignments: 3. Managing Data in R
  - 3.17.1 Import, Plot, Summarize and Save Data
  - 3.17.2 Explore some Bivariate Relations in the Data
  - 3.17.3 Missing Data
  - 3.17.4 Surface Plots
  - 3.17.5 Unbalanced Designs
  - 3.17.6 Aggregate Analysis
- References
Chapter 4: Data Visualization
- 4.1 Common Questions
- 4.2 Classification of Visualization Methods
- 4.3 Composition
  - 4.3.1 Histograms and Density Plots
  - 4.3.2 Pie Chart
  - 4.3.3 Heat Map
- 4.4 Comparison
  - 4.4.1 Paired Scatter Plots
  - 4.4.2 Jitter Plot
  - 4.4.3 Bar Plots
  - 4.4.4 Trees and Graphs
  - 4.4.5 Correlation Plots
- 4.5 Relationships
  - 4.5.1 Line Plots Using ggplot
  - 4.5.2 Density Plots
  - 4.5.3 Distributions
  - 4.5.4 2D Kernel Density and 3D Surface Plots
  - 4.5.5 Multiple 2D Image Surface Plots
  - 4.5.6 3D and 4D Visualizations
- 4.6 Appendix
  - 4.6.1 Hands-on Activity (Health Behavior Risks)
  - 4.6.2 Additional ggplot Examples
    - Housing Price Data
    - Modeling the Home Price Index Data (Fig. 4.48)
    - Map of the Neighborhoods of Los Angeles (LA)
    - Latin Letter Frequency in Different Languages
- 4.7 Assignments 4: Data Visualization
  - 4.7.1 Common Plots
  - 4.7.2 Trees and Graphs
  - 4.7.3 Exploratory Data Analytics (EDA)
- References
Chapter 5: Linear Algebra and Matrix Computing
- 5.1 Matrices (Second Order Tensors)
  - 5.1.1 Create Matrices
  - 5.1.2 Adding Columns and Rows
- 5.2 Matrix Subscripts
- 5.3 Matrix Operations
  - 5.3.1 Addition
  - 5.3.2 Subtraction
  - 5.3.3 Multiplication
    - Elementwise Multiplication
    - Matrix Multiplication
  - 5.3.4 Element-wise Division
  - 5.3.5 Transpose
  - 5.3.6 Multiplicative Inverse
- 5.4 Matrix Algebra Notation
  - 5.4.1 Linear Models
  - 5.4.2 Solving Systems of Equations
  - 5.4.3 The Identity Matrix
- 5.5 Scalars, Vectors and Matrices
  - 5.5.1 Sample Statistics (Mean, Variance)
    - Mean
    - Variance
    - Applications of Matrix Algebra: Linear Modeling
    - Finding Function Extrema (Min/Max) Using Calculus
  - 5.5.2 Least Square Estimation
    - The R lm Function
- 5.6 Eigenvalues and Eigenvectors
- 5.7 Other Important Functions
- 5.8 Matrix Notation (Another View)
- 5.9 Multivariate Linear Regression
- 5.10 Sample Covariance Matrix
- 5.11 Assignments: 5. Linear Algebra and Matrix Computing
  - 5.11.1 How Is Matrix Multiplication Defined?
  - 5.11.2 Scalar Versus Matrix Multiplication
  - 5.11.3 Matrix Equations
  - 5.11.4 Least Square Estimation
  - 5.11.5 Matrix Manipulation
  - 5.11.6 Matrix Transpose
  - 5.11.7 Sample Statistics
  - 5.11.8 Least Square Estimation
  - 5.11.9 Eigenvalues and Eigenvectors
- References
Chapter 6: Dimensionality Reduction
- 6.1 Example: Reducing 2D to 1D
- 6.2 Matrix Rotations
- 6.3 Notation
- 6.4 Summary (PCA vs. ICA vs. FA)
- 6.5 Principal Component Analysis (PCA)
  - 6.5.1 Principal Components
- 6.6 Independent Component Analysis (ICA)
- 6.7 Factor Analysis (FA)
- 6.8 Singular Value Decomposition (SVD)
- 6.9 SVD Summary
- 6.10 Case Study for Dimension Reduction (Parkinson´s Disease)
- 6.11 Assignments: 6. Dimensionality Reduction
  - 6.11.1 Parkinson´s Disease Example
  - 6.11.2 Allometric Relations in Plants Example
    - Load Data
    - Dimensionality Reduction
- References
Chapter 7: Lazy Learning: Classification Using Nearest Neighbors
- 7.1 Motivation
- 7.2 The kNN Algorithm Overview
  - 7.2.1 Distance Function and Dummy Coding
  - 7.2.2 Ways to Determine k
  - 7.2.3 Rescaling of the Features
  - 7.2.4 Rescaling Formulas
- 7.3 Case Study
  - 7.3.1 Step 1: Collecting Data
  - 7.3.2 Step 2: Exploring and Preparing the Data
  - 7.3.3 Normalizing Data
  - 7.3.4 Data Preparation: Creating Training and Testing Datasets
  - 7.3.5 Step 3: Training a Model On the Data
  - 7.3.6 Step 4: Evaluating Model Performance
  - 7.3.7 Step 5: Improving Model Performance
  - 7.3.8 Testing Alternative Values of k
  - 7.3.9 Quantitative Assessment (Tables 7.2 and 7.3)
- 7.4 Assignments: 7. Lazy Learning: Classification Using Nearest Neighbors
  - 7.4.1 Traumatic Brain Injury (TBI)
  - 7.4.2 Parkinson´s Disease
  - 7.4.3 KNN Classification in a High Dimensional Space
  - 7.4.4 KNN Classification in a Lower Dimensional Space
- References
Chapter 8: Probabilistic Learning: Classification Using Naive Bayes
- 8.1 Overview of the Naive Bayes Algorithm
- 8.2 Assumptions
- 8.3 Bayes Formula
- 8.4 The Laplace Estimator
- 8.5 Case Study: Head and Neck Cancer Medication
  - 8.5.1 Step 1: Collecting Data
  - 8.5.2 Step 2: Exploring and Preparing the Data
    - Data Preparation: Processing Text Data for Analysis
    - Data Preparation: Creating Training and Test Datasets
    - Visualizing Text Data: Word Clouds
    - Data Preparation: Creating Indicator Features for Frequent Words
  - 8.5.3 Step 3: Training a Model on the Data
  - 8.5.4 Step 4: Evaluating Model Performance
  - 8.5.5 Step 5: Improving Model Performance
  - 8.5.6 Step 6: Compare Naive Bayesian against LDA
- 8.6 Practice Problem
- 8.7 Assignments 8: Probabilistic Learning: Classification Using Naive Bayes
  - 8.7.1 Explain These Two Concepts
  - 8.7.2 Analyzing Textual Data
- References
Chapter 9: Decision Tree Divide and Conquer Classification
- 9.1 Motivation
- 9.2 Hands-on Example: Iris Data
- 9.3 Decision Tree Overview
  - 9.3.1 Divide and Conquer
  - 9.3.2 Entropy
  - 9.3.3 Misclassification Error and Gini Index
  - 9.3.4 C5.0 Decision Tree Algorithm
  - 9.3.5 Pruning the Decision Tree
- 9.4 Case Study 1: Quality of Life and Chronic Disease
  - 9.4.1 Step 1: Collecting Data
  - 9.4.2 Step 2: Exploring and Preparing the Data
    - Data Preparation: Creating Random Training and Test Datasets
  - 9.4.3 Step 3: Training a Model On the Data
  - 9.4.4 Step 4: Evaluating Model Performance
  - 9.4.5 Step 5: Trial Option
  - 9.4.6 Loading the Misclassification Error Matrix
  - 9.4.7 Parameter Tuning
- 9.5 Compare Different Impurity Indices
- 9.6 Classification Rules
  - 9.6.1 Separate and Conquer
  - 9.6.2 The One Rule Algorithm
  - 9.6.3 The RIPPER Algorithm
- 9.7 Case Study 2: QoL in Chronic Disease (Take 2)
  - 9.7.1 Step 3: Training a Model on the Data
  - 9.7.2 Step 4: Evaluating Model Performance
  - 9.7.3 Step 5: Alternative Model1
  - 9.7.4 Step 5: Alternative Model2
- 9.8 Practice Problem
- 9.9 Assignments 9: Decision Tree Divide and Conquer Classification
  - 9.9.1 Explain These Concepts
  - 9.9.2 Decision Tree Partitioning
- References
Chapter 10: Forecasting Numeric Data Using Regression Models
- 10.1 Understanding Regression
  - 10.1.1 Simple Linear Regression
- 10.2 Ordinary Least Squares Estimation
  - 10.2.1 Model Assumptions
  - 10.2.2 Correlations
  - 10.2.3 Multiple Linear Regression
- 10.3 Case Study 1: Baseball Players
  - 10.3.1 Step 1: Collecting Data
  - 10.3.2 Step 2: Exploring and Preparing the Data
  - 10.3.3 Exploring Relationships Among Features: The Correlation Matrix
  - 10.3.4 Visualizing Relationships Among Features: The Scatterplot Matrix
  - 10.3.5 Step 3: Training a Model on the Data
  - 10.3.6 Step 4: Evaluating Model Performance
- 10.4 Step 5: Improving Model Performance
  - 10.4.1 Model Specification: Adding Non-linear Relationships
  - 10.4.2 Transformation: Converting a Numeric Variable to a Binary Indicator
  - 10.4.3 Model Specification: Adding Interaction Effects
- 10.5 Understanding Regression Trees and Model Trees
  - 10.5.1 Adding Regression to Trees
- 10.6 Case Study 2: Baseball Players (Take 2)
  - 10.6.1 Step 2: Exploring and Preparing the Data
  - 10.6.2 Step 3: Training a Model On the Data
  - 10.6.3 Visualizing Decision Trees
  - 10.6.4 Step 4: Evaluating Model Performance
  - 10.6.5 Measuring Performance with Mean Absolute Error
  - 10.6.6 Step 5: Improving Model Performance
- 10.7 Practice Problem: Heart Attack Data
- 10.8 Assignments: 10. Forecasting Numeric Data Using Regression Models
- References
Chapter 11: Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
- 11.1 Understanding Neural Networks
  - 11.1.1 From Biological to Artificial Neurons
  - 11.1.2 Activation Functions
  - 11.1.3 Network Topology
  - 11.1.4 The Direction of Information Travel
  - 11.1.5 The Number of Nodes in Each Layer
  - 11.1.6 Training Neural Networks with Backpropagation
- 11.2 Case Study 1: Google Trends and the Stock Market: Regression
  - 11.2.1 Step 1: Collecting Data
    - Variables
  - 11.2.2 Step 2: Exploring and Preparing the Data
  - 11.2.3 Step 3: Training a Model on the Data
  - 11.2.4 Step 4: Evaluating Model Performance
  - 11.2.5 Step 5: Improving Model Performance
  - 11.2.6 Step 6: Adding Additional Layers
- 11.3 Simple NN Demo: Learning to Compute
- 11.4 Case Study 2: Google Trends and the Stock Market - Classification
- 11.5 Support Vector Machines (SVM)
  - 11.5.1 Classification with Hyperplanes
    - Finding the Maximum Margin
    - Linearly Separable Data
    - Non-linearly Separable Data
    - Using Kernels for Non-linear Spaces
- 11.6 Case Study 3: Optical Character Recognition (OCR)
  - 11.6.1 Step 1: Prepare and Explore the Data
  - 11.6.2 Step 2: Training an SVM Model
  - 11.6.3 Step 3: Evaluating Model Performance
  - 11.6.4 Step 4: Improving Model Performance
- 11.7 Case Study 4: Iris Flowers
  - 11.7.1 Step 1: Collecting Data
  - 11.7.2 Step 2: Exploring and Preparing the Data
  - 11.7.3 Step 3: Training a Model on the Data
  - 11.7.4 Step 4: Evaluating Model Performance
  - 11.7.5 Step 5: RBF Kernel Function
  - 11.7.6 Parameter Tuning
  - 11.7.7 Improving the Performance of Gaussian Kernels
- 11.8 Practice
  - 11.8.1 Problem 1 Google Trends and the Stock Market
  - 11.8.2 Problem 2: Quality of Life and Chronic Disease
- 11.9 Appendix
- 11.10 Assignments: 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
  - 11.10.1 Learn and Predict a Power-Function
  - 11.10.2 Pediatric Schizophrenia Study
- References
Chapter 12: Apriori Association Rules Learning
- 12.1 Association Rules
- 12.2 The Apriori Algorithm for Association Rule Learning
- 12.3 Measuring Rule Importance by Using Support and Confidence
- 12.4 Building a Set of Rules with the Apriori Principle
- 12.5 A Toy Example
- 12.6 Case Study 1: Head and Neck Cancer Medications
  - 12.6.1 Step 1: Collecting Data
  - 12.6.2 Step 2: Exploring and Preparing the Data
    - Visualizing Item Support: Item Frequency Plots
    - Visualizing Transaction Data: Plotting the Sparse Matrix
  - 12.6.3 Step 3: Training a Model on the Data
  - 12.6.4 Step 4: Evaluating Model Performance
  - 12.6.5 Step 5: Improving Model Performance
    - Sorting the Set of Association Rules
    - Taking Subsets of Association Rules
    - Saving Association Rules to a File or Data Frame
- 12.7 Practice Problems: Groceries
- 12.8 Summary
- 12.9 Assignments: 12. Apriori Association Rules Learning
- References
Chapter 13: k-Means Clustering
- 13.1 Clustering as a Machine Learning Task
- 13.2 Silhouette Plots
- 13.3 The k-Means Clustering Algorithm
  - 13.3.1 Using Distance to Assign and Update Clusters
  - 13.3.2 Choosing the Appropriate Number of Clusters
- 13.4 Case Study 1: Divorce and Consequences on Young Adults
  - 13.4.1 Step 1: Collecting Data
    - Variables
  - 13.4.2 Step 2: Exploring and Preparing the Data
  - 13.4.3 Step 3: Training a Model on the Data
  - 13.4.4 Step 4: Evaluating Model Performance
  - 13.4.5 Step 5: Usage of Cluster Information
- 13.5 Model Improvement
  - 13.5.1 Tuning the Parameter k
- 13.6 Case Study 2: Pediatric Trauma
  - 13.6.1 Step 1: Collecting Data
  - 13.6.2 Step 2: Exploring and Preparing the Data
  - 13.6.3 Step 3: Training a Model on the Data
  - 13.6.4 Step 4: Evaluating Model Performance
  - 13.6.5 Practice Problem: Youth Development
- 13.7 Hierarchical Clustering
- 13.8 Gaussian Mixture Models
- 13.9 Summary
- 13.10 Assignments: 13. k-Means Clustering
- References
Chapter 14: Model Performance Assessment
- 14.1 Measuring the Performance of Classification Methods
- 14.2 Evaluation Strategies
  - 14.2.1 Binary Outcomes
  - 14.2.2 Confusion Matrices
  - 14.2.3 Other Measures of Performance Beyond Accuracy
  - 14.2.4 The Kappa (κ) Statistic
    - Summary of the Kappa Score for Calculating Prediction Accuracy
  - 14.2.5 Computation of Observed Accuracy and Expected Accuracy
  - 14.2.6 Sensitivity and Specificity
  - 14.2.7 Precision and Recall
  - 14.2.8 The F-Measure
- 14.3 Visualizing Performance Tradeoffs (ROC Curve)
- 14.4 Estimating Future Performance (Internal Statistical Validation)
  - 14.4.1 The Holdout Method
  - 14.4.2 Cross-Validation
  - 14.4.3 Bootstrap Sampling
- 14.5 Assignment: 14. Evaluation of Model Performance
- References
Chapter 15: Improving Model Performance
- 15.1 Improving Model Performance by Parameter Tuning
- 15.2 Using caret for Automated Parameter Tuning
  - 15.2.1 Customizing the Tuning Process
  - 15.2.2 Improving Model Performance with Meta-learning
  - 15.2.3 Bagging
  - 15.2.4 Boosting
  - 15.2.5 Random Forests
    - Training Random Forests
    - Evaluating Random Forest Performance
  - 15.2.6 Adaptive Boosting
- 15.3 Assignment: 15. Improving Model Performance
  - 15.3.1 Model Improvement Case Study
- References
Chapter 16: Specialized Machine Learning Topics
- 16.1 Working with Specialized Data and Databases
  - 16.1.1 Data Format Conversion
  - 16.1.2 Querying Data in SQL Databases
  - 16.1.3 Real Random Number Generation
  - 16.1.4 Downloading the Complete Text of Web Pages
  - 16.1.5 Reading and Writing XML with the XML Package
  - 16.1.6 Web-Page Data Scraping
  - 16.1.7 Parsing JSON from Web APIs
  - 16.1.8 Reading and Writing Microsoft Excel Spreadsheets Using XLSX
- 16.2 Working with Domain-Specific Data
  - 16.2.1 Working with Bioinformatics Data
  - 16.2.2 Visualizing Network Data
- 16.3 Data Streaming
  - 16.3.1 Definition
  - 16.3.2 The stream Package
  - 16.3.3 Synthetic Example: Random Gaussian Stream
    - k-Means Clustering
  - 16.3.4 Sources of Data Streams
    - Static Structure Streams
    - Concept Drift Streams
    - Real Data Streams
  - 16.3.5 Printing, Plotting and Saving Streams
  - 16.3.6 Stream Animation
  - 16.3.7 Case-Study: SOCR Knee Pain Data
  - 16.3.8 Data Stream Clustering and Classification (DSC)
  - 16.3.9 Evaluation of Data Stream Clustering
- 16.4 Optimization and Improving the Computational Performance
  - 16.4.1 Generalizing Tabular Data Structures with dplyr
  - 16.4.2 Making Data Frames Faster with Data.Table
  - 16.4.3 Creating Disk-Based Data Frames with ff
  - 16.4.4 Using Massive Matrices with bigmemory
- 16.5 Parallel Computing
  - 16.5.1 Measuring Execution Time
  - 16.5.2 Parallel Processing with Multiple Cores
  - 16.5.3 Parallelization Using foreach and doParallel
  - 16.5.4 GPU Computing
- 16.6 Deploying Optimized Learning Algorithms
  - 16.6.1 Building Bigger Regression Models with biglm
  - 16.6.2 Growing Bigger and Faster Random Forests with bigrf
  - 16.6.3 Training and Evaluation Models in Parallel with caret
- 16.7 Practice Problem
- 16.8 Assignment: 16. Specialized Machine Learning Topics
  - 16.8.1 Working with Website Data
  - 16.8.2 Network Data and Visualization
  - 16.8.3 Data Conversion and Parallel Computing
- References
Chapter 17: Variable/Feature Selection
- 17.1 Feature Selection Methods
  - 17.1.1 Filtering Techniques
  - 17.1.2 Wrapper Methods
  - 17.1.3 Embedded Techniques
- 17.2 Case Study: ALS
  - 17.2.1 Step 1: Collecting Data
  - 17.2.2 Step 2: Exploring and Preparing the Data
  - 17.2.3 Step 3: Training a Model on the Data
  - 17.2.4 Step 4: Evaluating Model Performance
    - Comparing with RFE
    - Comparing with Stepwise Feature Selection
- 17.3 Practice Problem
- 17.4 Assignment: 17. Variable/Feature Selection
  - 17.4.1 Wrapper Feature Selection
  - 17.4.2 Use the PPMI Dataset
- References
Chapter 18: Regularized Linear Modeling and Controlled Variable Selection
- 18.1 Questions
- 18.2 Matrix Notation
- 18.3 Regularized Linear Modeling
  - 18.3.1 Ridge Regression
  - 18.3.2 Least Absolute Shrinkage and Selection Operator (LASSO) Regression
  - 18.3.3 Predictor Standardization
  - 18.3.4 Estimation Goals
- 18.4 Linear Regression
  - 18.4.1 Drawbacks of Linear Regression
  - 18.4.2 Assessing Prediction Accuracy
  - 18.4.3 Estimating the Prediction Error
  - 18.4.4 Improving the Prediction Accuracy
  - 18.4.5 Variable Selection
- 18.5 Regularization Framework
  - 18.5.1 Role of the Penalty Term
  - 18.5.2 Role of the Regularization Parameter
  - 18.5.3 LASSO
  - 18.5.4 General Regularization Framework
- 18.6 Implementation of Regularization
  - 18.6.1 Example: Neuroimaging-Genetics Study of Parkinson´s Disease Dataset
  - 18.6.2 Computational Complexity
  - 18.6.3 LASSO and Ridge Solution Paths
  - 18.6.4 Choice of the Regularization Parameter
  - 18.6.5 Cross Validation Motivation
  - 18.6.6 n-Fold Cross Validation
  - 18.6.7 LASSO 10-Fold Cross Validation
  - 18.6.8 Stepwise OLS (Ordinary Least Squares)
  - 18.6.9 Final Models
  - 18.6.10 Model Performance
  - 18.6.11 Comparing Selected Features
  - 18.6.12 Summary
- 18.7 Knock-off Filtering: Simulated Example
  - 18.7.1 Notes
- 18.8 PD Neuroimaging-Genetics Case-Study
  - 18.8.1 Fetching, Cleaning and Preparing the Data
  - 18.8.2 Preparing the Response Vector
  - 18.8.3 False Discovery Rate (FDR)
    - Graphical Interpretation of the Benjamini-Hochberg (BH) Method
    - FDR Adjusting the p-Values
  - 18.8.4 Running the Knockoff Filter
- 18.9 Assignment: 18. Regularized Linear Modeling and Knockoff Filtering
- References
Chapter 19: Big Longitudinal Data Analysis
- 19.1 Time Series Analysis
  - 19.1.1 Step 1: Plot Time Series
  - 19.1.2 Step 2: Find Proper Parameter Values for ARIMA Model
  - 19.1.3 Check the Differencing Parameter
  - 19.1.4 Identifying the AR and MA Parameters
  - 19.1.5 Step 3: Build an ARIMA Model
  - 19.1.6 Step 4: Forecasting with ARIMA Model
- 19.2 Structural Equation Modeling (SEM)-Latent Variables
  - 19.2.1 Foundations of SEM
  - 19.2.2 SEM Components
  - 19.2.3 Case Study - Parkinson´s Disease (PD)
    - Step 1 - Collecting Data
    - Step 2 - Exploring and Preparing the Data
    - Step 3 - Fitting a Model on the Data
  - 19.2.4 Outputs of Lavaan SEM
- 19.3 Longitudinal Data Analysis-Linear Mixed Models
  - 19.3.1 Mean Trend
  - 19.3.2 Modeling the Correlation
- 19.4 GLMM/GEE Longitudinal Data Analysis
  - 19.4.1 GEE Versus GLMM
- 19.5 Assignment: 19. Big Longitudinal Data Analysis
  - 19.5.1 Imaging Data
  - 19.5.2 Time Series Analysis
  - 19.5.3 Latent Variables Model
- References
Chapter 20: Natural Language Processing/Text Mining
- 20.1 A Simple NLP/TM Example
  - 20.1.1 Define and Load the Unstructured-Text Documents
  - 20.1.2 Create a New VCorpus Object
  - 20.1.3 To-Lower Case Transformation
  - 20.1.4 Text Pre-processing
    - Remove Stopwords
    - Remove Punctuation
    - Stemming: Removal of Plurals and Action Suffixes
  - 20.1.5 Bags of Words
  - 20.1.6 Document Term Matrix
- 20.2 Case-Study: Job Ranking
  - 20.2.1 Step 1: Make a VCorpus Object
  - 20.2.2 Step 2: Clean the VCorpus Object
  - 20.2.3 Step 3: Build the Document Term Matrix
  - 20.2.4 Area Under the ROC Curve
- 20.3 TF-IDF
  - 20.3.1 Term Frequency (TF)
  - 20.3.2 Inverse Document Frequency (IDF)
  - 20.3.3 TF-IDF
- 20.4 Cosine Similarity
- 20.5 Sentiment Analysis
  - 20.5.1 Data Preprocessing
  - 20.5.2 NLP/TM Analytics
  - 20.5.3 Prediction Optimization
- 20.6 Assignment: 20. Natural Language Processing/Text Mining
  - 20.6.1 Mining Twitter Data
  - 20.6.2 Mining Cancer Clinical Notes
- References
Chapter 21: Prediction and Internal Statistical Cross Validation
- 21.1 Forecasting Types and Assessment Approaches
- 21.2 Overfitting
  - 21.2.1 Example (US Presidential Elections)
  - 21.2.2 Example (Google Flu Trends)
  - 21.2.3 Example (Autism)
- 21.3 Internal Statistical Cross-Validation is an Iterative Process
- 21.4 Example (Linear Regression)
  - 21.4.1 Cross-Validation Methods
  - 21.4.2 Exhaustive Cross-Validation
  - 21.4.3 Non-Exhaustive Cross-Validation
- 21.5 Case-Studies
  - 21.5.1 Example 1: Prediction of Parkinson´s Disease Using Adaptive Boosting (AdaBoost)
  - 21.5.2 Example 2: Sleep Dataset
  - 21.5.3 Example 3: Model-Based (Linear Regression) Prediction Using the Attitude Dataset
  - 21.5.4 Example 4: Parkinson´s Data (ppmi_data)
- 21.6 Summary of CV output
- 21.7 Alternative Predictor Functions
  - 21.7.1 Logistic Regression
  - 21.7.2 Quadratic Discriminant Analysis (QDA)
  - 21.7.3 Foundation of LDA and QDA for Prediction, Dimensionality Reduction, and Forecasting
    - LDA (Linear Discriminant Analysis)
    - QDA (Quadratic Discriminant Analysis)
  - 21.7.4 Neural Networks
  - 21.7.5 SVM
  - 21.7.6 k-Nearest Neighbors Algorithm (k-NN)
  - 21.7.7 k-Means Clustering (k-MC)
  - 21.7.8 Spectral Clustering
    - Iris Petal Data
    - Spirals Data
    - Income Data
- 21.8 Compare the Results
- 21.9 Assignment: 21. Prediction and Internal Statistical Cross-Validation
- References
Chapter 22: Function Optimization
- 22.1 Free (Unconstrained) Optimization
  - 22.1.1 Example 1: Minimizing a Univariate Function (Inverse-CDF)
  - 22.1.2 Example 2: Minimizing a Bivariate Function
  - 22.1.3 Example 3: Using Simulated Annealing to Find the Maximum of an Oscillatory Function
- 22.2 Constrained Optimization
  - 22.2.1 Equality Constraints
  - 22.2.2 Lagrange Multipliers
  - 22.2.3 Inequality Constrained Optimization
    - Linear Programming (LP)
    - Mixed Integer Linear Programming (MILP)
  - 22.2.4 Quadratic Programming (QP)
- 22.3 General Non-linear Optimization
  - 22.3.1 Dual Problem Optimization
    - Motivation
    - Example 1: Linear Example
    - Example 2: Quadratic Example
    - Example 3: More Complex Non-linear Optimization
    - Example 4: Another Linear Example
- 22.4 Manual Versus Automated Lagrange Multiplier Optimization
- 22.5 Data Denoising
- 22.6 Assignment: 22. Function Optimization
  - 22.6.1 Unconstrained Optimization
  - 22.6.2 Linear Programming (LP)
  - 22.6.3 Mixed Integer Linear Programming (MILP)
  - 22.6.4 Quadratic Programming (QP)
  - 22.6.5 Complex Non-linear Optimization
  - 22.6.6 Data Denoising
- References
Chapter 23: Deep Learning, Neural Networks
- 23.1 Deep Learning Training
  - 23.1.1 Perceptrons
- 23.2 Biological Relevance
- 23.3 Simple Neural Net Examples
  - 23.3.1 Exclusive OR (XOR) Operator
  - 23.3.2 NAND Operator
  - 23.3.3 Complex Networks Designed Using Simple Building Blocks
- 23.4 Classification
  - 23.4.1 Sonar Data Example
  - 23.4.2 MXNet Notes
- 23.5 Case-Studies
  - 23.5.1 ALS Regression Example
  - 23.5.2 Spirals 2D Data
  - 23.5.3 IBS Study
  - 23.5.4 Country QoL Ranking Data
  - 23.5.5 Handwritten Digits Classification
    - Configuring the Neural Network
    - Training
    - Forecasting
    - Examining the Network Structure Using LeNet
- 23.6 Classifying Real-World Images
  - 23.6.1 Load the Pre-trained Model
  - 23.6.2 Load, Preprocess and Classify New Images - US Weather Pattern
  - 23.6.3 Lake Mapourika, New Zealand
  - 23.6.4 Beach Image
  - 23.6.5 Volcano
  - 23.6.6 Brain Surface
  - 23.6.7 Face Mask
- 23.7 Assignment: 23. Deep Learning, Neural Networks
  - 23.7.1 Deep Learning Classification
  - 23.7.2 Deep Learning Regression
  - 23.7.3 Image Classification
- References
Summary
Glossary
Index

Friend's email
Your name
Your email
enter code