Loading...
- Type of Document: Book
- Publisher: Switzerland : Springer International Publishing AG , 2018
- Keywords:
- Big data ; Mathematical statistics ; Medical records -- Data processing ; R (Computer program language)
- Foreword
- Preface
- Genesis
- Purpose
- Limitations/Prerequisites
- Scope of the Book
- Acknowledgements
- DSPA Application and Use Disclaimer
- Biomedical, Biosocial, Environmental, and Health Disclaimer
- Notations
- Contents
- Chapter 1: Motivation
- 1.1 DSPA Mission and Objectives
- 1.2 Examples of Driving Motivational Problems and Challenges
- 1.2.1 Alzheimer´s Disease
- 1.2.2 Parkinson´s Disease
- 1.2.3 Drug and Substance Use
- 1.2.4 Amyotrophic Lateral Sclerosis
- 1.2.5 Normal Brain Visualization
- 1.2.6 Neurodegeneration
- 1.2.7 Genetic Forensics: 2013-2016 Ebola Outbreak
- 1.2.8 Next Generation Sequence (NGS) Analysis
- 1.2.9 Neuroimaging-Genetics
- 1.3 Common Characteristics of Big (Biomedical and Health) Data
- 1.4 Data Science
- 1.5 Predictive Analytics
- 1.6 High-Throughput Big Data Analytics
- 1.7 Examples of Data Repositories, Archives, and Services
- 1.8 DSPA Expectations
- Chapter 2: Foundations of R
- 2.1 Why Use R?
- 2.2 Getting Started
- 2.2.1 Install Basic Shell-Based R
- 2.2.2 GUI Based R Invocation (RStudio)
- 2.2.3 RStudio GUI Layout
- 2.2.4 Some Notes
- 2.3 Help
- 2.4 Simple Wide-to-Long Data format Translation
- 2.5 Data Generation
- 2.6 Input/Output (I/O)
- 2.7 Slicing and Extracting Data
- 2.8 Variable Conversion
- 2.9 Variable Information
- 2.10 Data Selection and Manipulation
- 2.11 Math Functions
- 2.12 Matrix Operations
- 2.13 Advanced Data Processing
- 2.14 Strings
- 2.15 Plotting
- 2.16 QQ Normal Probability Plot
- 2.17 Low-Level Plotting Commands
- 2.18 Graphics Parameters
- 2.19 Optimization and model Fitting
- 2.20 Statistics
- 2.21 Distributions
- 2.21.1 Programming
- 2.22 Data Simulation Primer
- 2.23 Appendix
- 2.23.1 HTML SOCR Data Import
- 2.23.2 R Debugging
- Example
- 2.24 Assignments: 2. R Foundations
- 2.24.1 Confirm that You Have Installed R/RStudio
- 2.24.2 Long-to-Wide Data Format Translation
- 2.24.3 Data Frames
- 2.24.4 Data Stratification
- 2.24.5 Simulation
- 2.24.6 Programming
- References
- Chapter 3: Managing Data in R
- 3.1 Saving and Loading R Data Structures
- 3.2 Importing and Saving Data from CSV Files
- 3.3 Exploring the Structure of Data
- 3.4 Exploring Numeric Variables
- 3.5 Measuring the Central Tendency: Mean, Median, Mode
- 3.6 Measuring Spread: Quartiles and the Five-Number Summary
- 3.7 Visualizing Numeric Variables: Boxplots
- 3.8 Visualizing Numeric Variables: Histograms
- 3.9 Understanding Numeric Data: Uniform and Normal Distributions
- 3.10 Measuring Spread: Variance and Standard Deviation
- 3.11 Exploring Categorical Variables
- 3.12 Exploring Relationships Between Variables
- 3.13 Missing Data
- 3.13.1 Simulate Some Real Multivariate Data
- 3.13.2 TBI Data Example
- 3.13.3 Imputation via Expectation-Maximization
- Types of Missing Data
- General Idea of EM Algorithm
- EM-Based Imputation
- A Simple Manual Implementation of EM-Based Imputation
- Plotting Complete and Imputed Data
- Validation of EM-Imputation Using the Amelia R Package
- Comparison
- Density Plots
- 3.14 Parsing Webpages and Visualizing Tabular HTML Data
- 3.15 Cohort-Rebalancing (for Imbalanced Groups)
- 3.16 Appendix
- 3.16.1 Importing Data from SQL Databases
- 3.16.2 R Code Fragments
- 3.17 Assignments: 3. Managing Data in R
- 3.17.1 Import, Plot, Summarize and Save Data
- 3.17.2 Explore some Bivariate Relations in the Data
- 3.17.3 Missing Data
- 3.17.4 Surface Plots
- 3.17.5 Unbalanced Designs
- 3.17.6 Aggregate Analysis
- References
- Chapter 4: Data Visualization
- 4.1 Common Questions
- 4.2 Classification of Visualization Methods
- 4.3 Composition
- 4.3.1 Histograms and Density Plots
- 4.3.2 Pie Chart
- 4.3.3 Heat Map
- 4.4 Comparison
- 4.4.1 Paired Scatter Plots
- 4.4.2 Jitter Plot
- 4.4.3 Bar Plots
- 4.4.4 Trees and Graphs
- 4.4.5 Correlation Plots
- 4.5 Relationships
- 4.5.1 Line Plots Using ggplot
- 4.5.2 Density Plots
- 4.5.3 Distributions
- 4.5.4 2D Kernel Density and 3D Surface Plots
- 4.5.5 Multiple 2D Image Surface Plots
- 4.5.6 3D and 4D Visualizations
- 4.6 Appendix
- 4.6.1 Hands-on Activity (Health Behavior Risks)
- 4.6.2 Additional ggplot Examples
- Housing Price Data
- Modeling the Home Price Index Data (Fig. 4.48)
- Map of the Neighborhoods of Los Angeles (LA)
- Latin Letter Frequency in Different Languages
- 4.7 Assignments 4: Data Visualization
- 4.7.1 Common Plots
- 4.7.2 Trees and Graphs
- 4.7.3 Exploratory Data Analytics (EDA)
- References
- Chapter 5: Linear Algebra and Matrix Computing
- 5.1 Matrices (Second Order Tensors)
- 5.1.1 Create Matrices
- 5.1.2 Adding Columns and Rows
- 5.2 Matrix Subscripts
- 5.3 Matrix Operations
- 5.3.1 Addition
- 5.3.2 Subtraction
- 5.3.3 Multiplication
- Elementwise Multiplication
- Matrix Multiplication
- 5.3.4 Element-wise Division
- 5.3.5 Transpose
- 5.3.6 Multiplicative Inverse
- 5.4 Matrix Algebra Notation
- 5.4.1 Linear Models
- 5.4.2 Solving Systems of Equations
- 5.4.3 The Identity Matrix
- 5.5 Scalars, Vectors and Matrices
- 5.5.1 Sample Statistics (Mean, Variance)
- Mean
- Variance
- Applications of Matrix Algebra: Linear Modeling
- Finding Function Extrema (Min/Max) Using Calculus
- 5.5.2 Least Square Estimation
- The R lm Function
- 5.5.1 Sample Statistics (Mean, Variance)
- 5.6 Eigenvalues and Eigenvectors
- 5.7 Other Important Functions
- 5.8 Matrix Notation (Another View)
- 5.9 Multivariate Linear Regression
- 5.10 Sample Covariance Matrix
- 5.11 Assignments: 5. Linear Algebra and Matrix Computing
- 5.11.1 How Is Matrix Multiplication Defined?
- 5.11.2 Scalar Versus Matrix Multiplication
- 5.11.3 Matrix Equations
- 5.11.4 Least Square Estimation
- 5.11.5 Matrix Manipulation
- 5.11.6 Matrix Transpose
- 5.11.7 Sample Statistics
- 5.11.8 Least Square Estimation
- 5.11.9 Eigenvalues and Eigenvectors
- References
- 5.1 Matrices (Second Order Tensors)
- Chapter 6: Dimensionality Reduction
- 6.1 Example: Reducing 2D to 1D
- 6.2 Matrix Rotations
- 6.3 Notation
- 6.4 Summary (PCA vs. ICA vs. FA)
- 6.5 Principal Component Analysis (PCA)
- 6.5.1 Principal Components
- 6.6 Independent Component Analysis (ICA)
- 6.7 Factor Analysis (FA)
- 6.8 Singular Value Decomposition (SVD)
- 6.9 SVD Summary
- 6.10 Case Study for Dimension Reduction (Parkinson´s Disease)
- 6.11 Assignments: 6. Dimensionality Reduction
- 6.11.1 Parkinson´s Disease Example
- 6.11.2 Allometric Relations in Plants Example
- Load Data
- Dimensionality Reduction
- References
- Chapter 7: Lazy Learning: Classification Using Nearest Neighbors
- 7.1 Motivation
- 7.2 The kNN Algorithm Overview
- 7.2.1 Distance Function and Dummy Coding
- 7.2.2 Ways to Determine k
- 7.2.3 Rescaling of the Features
- 7.2.4 Rescaling Formulas
- 7.3 Case Study
- 7.3.1 Step 1: Collecting Data
- 7.3.2 Step 2: Exploring and Preparing the Data
- 7.3.3 Normalizing Data
- 7.3.4 Data Preparation: Creating Training and Testing Datasets
- 7.3.5 Step 3: Training a Model On the Data
- 7.3.6 Step 4: Evaluating Model Performance
- 7.3.7 Step 5: Improving Model Performance
- 7.3.8 Testing Alternative Values of k
- 7.3.9 Quantitative Assessment (Tables 7.2 and 7.3)
- 7.4 Assignments: 7. Lazy Learning: Classification Using Nearest Neighbors
- 7.4.1 Traumatic Brain Injury (TBI)
- 7.4.2 Parkinson´s Disease
- 7.4.3 KNN Classification in a High Dimensional Space
- 7.4.4 KNN Classification in a Lower Dimensional Space
- References
- Chapter 8: Probabilistic Learning: Classification Using Naive Bayes
- 8.1 Overview of the Naive Bayes Algorithm
- 8.2 Assumptions
- 8.3 Bayes Formula
- 8.4 The Laplace Estimator
- 8.5 Case Study: Head and Neck Cancer Medication
- 8.5.1 Step 1: Collecting Data
- 8.5.2 Step 2: Exploring and Preparing the Data
- Data Preparation: Processing Text Data for Analysis
- Data Preparation: Creating Training and Test Datasets
- Visualizing Text Data: Word Clouds
- Data Preparation: Creating Indicator Features for Frequent Words
- 8.5.3 Step 3: Training a Model on the Data
- 8.5.4 Step 4: Evaluating Model Performance
- 8.5.5 Step 5: Improving Model Performance
- 8.5.6 Step 6: Compare Naive Bayesian against LDA
- 8.6 Practice Problem
- 8.7 Assignments 8: Probabilistic Learning: Classification Using Naive Bayes
- 8.7.1 Explain These Two Concepts
- 8.7.2 Analyzing Textual Data
- References
- Chapter 9: Decision Tree Divide and Conquer Classification
- 9.1 Motivation
- 9.2 Hands-on Example: Iris Data
- 9.3 Decision Tree Overview
- 9.3.1 Divide and Conquer
- 9.3.2 Entropy
- 9.3.3 Misclassification Error and Gini Index
- 9.3.4 C5.0 Decision Tree Algorithm
- 9.3.5 Pruning the Decision Tree
- 9.4 Case Study 1: Quality of Life and Chronic Disease
- 9.4.1 Step 1: Collecting Data
- 9.4.2 Step 2: Exploring and Preparing the Data
- Data Preparation: Creating Random Training and Test Datasets
- 9.4.3 Step 3: Training a Model On the Data
- 9.4.4 Step 4: Evaluating Model Performance
- 9.4.5 Step 5: Trial Option
- 9.4.6 Loading the Misclassification Error Matrix
- 9.4.7 Parameter Tuning
- 9.5 Compare Different Impurity Indices
- 9.6 Classification Rules
- 9.6.1 Separate and Conquer
- 9.6.2 The One Rule Algorithm
- 9.6.3 The RIPPER Algorithm
- 9.7 Case Study 2: QoL in Chronic Disease (Take 2)
- 9.7.1 Step 3: Training a Model on the Data
- 9.7.2 Step 4: Evaluating Model Performance
- 9.7.3 Step 5: Alternative Model1
- 9.7.4 Step 5: Alternative Model2
- 9.8 Practice Problem
- 9.9 Assignments 9: Decision Tree Divide and Conquer Classification
- 9.9.1 Explain These Concepts
- 9.9.2 Decision Tree Partitioning
- References
- Chapter 10: Forecasting Numeric Data Using Regression Models
- 10.1 Understanding Regression
- 10.1.1 Simple Linear Regression
- 10.2 Ordinary Least Squares Estimation
- 10.2.1 Model Assumptions
- 10.2.2 Correlations
- 10.2.3 Multiple Linear Regression
- 10.3 Case Study 1: Baseball Players
- 10.3.1 Step 1: Collecting Data
- 10.3.2 Step 2: Exploring and Preparing the Data
- 10.3.3 Exploring Relationships Among Features: The Correlation Matrix
- 10.3.4 Visualizing Relationships Among Features: The Scatterplot Matrix
- 10.3.5 Step 3: Training a Model on the Data
- 10.3.6 Step 4: Evaluating Model Performance
- 10.4 Step 5: Improving Model Performance
- 10.4.1 Model Specification: Adding Non-linear Relationships
- 10.4.2 Transformation: Converting a Numeric Variable to a Binary Indicator
- 10.4.3 Model Specification: Adding Interaction Effects
- 10.5 Understanding Regression Trees and Model Trees
- 10.5.1 Adding Regression to Trees
- 10.6 Case Study 2: Baseball Players (Take 2)
- 10.6.1 Step 2: Exploring and Preparing the Data
- 10.6.2 Step 3: Training a Model On the Data
- 10.6.3 Visualizing Decision Trees
- 10.6.4 Step 4: Evaluating Model Performance
- 10.6.5 Measuring Performance with Mean Absolute Error
- 10.6.6 Step 5: Improving Model Performance
- 10.7 Practice Problem: Heart Attack Data
- 10.8 Assignments: 10. Forecasting Numeric Data Using Regression Models
- References
- 10.1 Understanding Regression
- Chapter 11: Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
- 11.1 Understanding Neural Networks
- 11.1.1 From Biological to Artificial Neurons
- 11.1.2 Activation Functions
- 11.1.3 Network Topology
- 11.1.4 The Direction of Information Travel
- 11.1.5 The Number of Nodes in Each Layer
- 11.1.6 Training Neural Networks with Backpropagation
- 11.2 Case Study 1: Google Trends and the Stock Market: Regression
- 11.2.1 Step 1: Collecting Data
- Variables
- 11.2.2 Step 2: Exploring and Preparing the Data
- 11.2.3 Step 3: Training a Model on the Data
- 11.2.4 Step 4: Evaluating Model Performance
- 11.2.5 Step 5: Improving Model Performance
- 11.2.6 Step 6: Adding Additional Layers
- 11.2.1 Step 1: Collecting Data
- 11.3 Simple NN Demo: Learning to Compute
- 11.4 Case Study 2: Google Trends and the Stock Market - Classification
- 11.5 Support Vector Machines (SVM)
- 11.5.1 Classification with Hyperplanes
- Finding the Maximum Margin
- Linearly Separable Data
- Non-linearly Separable Data
- Using Kernels for Non-linear Spaces
- 11.5.1 Classification with Hyperplanes
- 11.6 Case Study 3: Optical Character Recognition (OCR)
- 11.6.1 Step 1: Prepare and Explore the Data
- 11.6.2 Step 2: Training an SVM Model
- 11.6.3 Step 3: Evaluating Model Performance
- 11.6.4 Step 4: Improving Model Performance
- 11.7 Case Study 4: Iris Flowers
- 11.7.1 Step 1: Collecting Data
- 11.7.2 Step 2: Exploring and Preparing the Data
- 11.7.3 Step 3: Training a Model on the Data
- 11.7.4 Step 4: Evaluating Model Performance
- 11.7.5 Step 5: RBF Kernel Function
- 11.7.6 Parameter Tuning
- 11.7.7 Improving the Performance of Gaussian Kernels
- 11.8 Practice
- 11.8.1 Problem 1 Google Trends and the Stock Market
- 11.8.2 Problem 2: Quality of Life and Chronic Disease
- 11.9 Appendix
- 11.10 Assignments: 11. Black Box Machine-Learning Methods: Neural Networks and Support Vector Machines
- 11.10.1 Learn and Predict a Power-Function
- 11.10.2 Pediatric Schizophrenia Study
- References
- 11.1 Understanding Neural Networks
- Chapter 12: Apriori Association Rules Learning
- 12.1 Association Rules
- 12.2 The Apriori Algorithm for Association Rule Learning
- 12.3 Measuring Rule Importance by Using Support and Confidence
- 12.4 Building a Set of Rules with the Apriori Principle
- 12.5 A Toy Example
- 12.6 Case Study 1: Head and Neck Cancer Medications
- 12.6.1 Step 1: Collecting Data
- 12.6.2 Step 2: Exploring and Preparing the Data
- Visualizing Item Support: Item Frequency Plots
- Visualizing Transaction Data: Plotting the Sparse Matrix
- 12.6.3 Step 3: Training a Model on the Data
- 12.6.4 Step 4: Evaluating Model Performance
- 12.6.5 Step 5: Improving Model Performance
- Sorting the Set of Association Rules
- Taking Subsets of Association Rules
- Saving Association Rules to a File or Data Frame
- 12.7 Practice Problems: Groceries
- 12.8 Summary
- 12.9 Assignments: 12. Apriori Association Rules Learning
- References
- Chapter 13: k-Means Clustering
- 13.1 Clustering as a Machine Learning Task
- 13.2 Silhouette Plots
- 13.3 The k-Means Clustering Algorithm
- 13.3.1 Using Distance to Assign and Update Clusters
- 13.3.2 Choosing the Appropriate Number of Clusters
- 13.4 Case Study 1: Divorce and Consequences on Young Adults
- 13.4.1 Step 1: Collecting Data
- Variables
- 13.4.2 Step 2: Exploring and Preparing the Data
- 13.4.3 Step 3: Training a Model on the Data
- 13.4.4 Step 4: Evaluating Model Performance
- 13.4.5 Step 5: Usage of Cluster Information
- 13.4.1 Step 1: Collecting Data
- 13.5 Model Improvement
- 13.5.1 Tuning the Parameter k
- 13.6 Case Study 2: Pediatric Trauma
- 13.6.1 Step 1: Collecting Data
- 13.6.2 Step 2: Exploring and Preparing the Data
- 13.6.3 Step 3: Training a Model on the Data
- 13.6.4 Step 4: Evaluating Model Performance
- 13.6.5 Practice Problem: Youth Development
- 13.7 Hierarchical Clustering
- 13.8 Gaussian Mixture Models
- 13.9 Summary
- 13.10 Assignments: 13. k-Means Clustering
- References
- Chapter 14: Model Performance Assessment
- 14.1 Measuring the Performance of Classification Methods
- 14.2 Evaluation Strategies
- 14.2.1 Binary Outcomes
- 14.2.2 Confusion Matrices
- 14.2.3 Other Measures of Performance Beyond Accuracy
- 14.2.4 The Kappa (κ) Statistic
- Summary of the Kappa Score for Calculating Prediction Accuracy
- 14.2.5 Computation of Observed Accuracy and Expected Accuracy
- 14.2.6 Sensitivity and Specificity
- 14.2.7 Precision and Recall
- 14.2.8 The F-Measure
- 14.3 Visualizing Performance Tradeoffs (ROC Curve)
- 14.4 Estimating Future Performance (Internal Statistical Validation)
- 14.4.1 The Holdout Method
- 14.4.2 Cross-Validation
- 14.4.3 Bootstrap Sampling
- 14.5 Assignment: 14. Evaluation of Model Performance
- References
- Chapter 15: Improving Model Performance
- 15.1 Improving Model Performance by Parameter Tuning
- 15.2 Using caret for Automated Parameter Tuning
- 15.2.1 Customizing the Tuning Process
- 15.2.2 Improving Model Performance with Meta-learning
- 15.2.3 Bagging
- 15.2.4 Boosting
- 15.2.5 Random Forests
- Training Random Forests
- Evaluating Random Forest Performance
- 15.2.6 Adaptive Boosting
- 15.3 Assignment: 15. Improving Model Performance
- 15.3.1 Model Improvement Case Study
- References
- Chapter 16: Specialized Machine Learning Topics
- 16.1 Working with Specialized Data and Databases
- 16.1.1 Data Format Conversion
- 16.1.2 Querying Data in SQL Databases
- 16.1.3 Real Random Number Generation
- 16.1.4 Downloading the Complete Text of Web Pages
- 16.1.5 Reading and Writing XML with the XML Package
- 16.1.6 Web-Page Data Scraping
- 16.1.7 Parsing JSON from Web APIs
- 16.1.8 Reading and Writing Microsoft Excel Spreadsheets Using XLSX
- 16.2 Working with Domain-Specific Data
- 16.2.1 Working with Bioinformatics Data
- 16.2.2 Visualizing Network Data
- 16.3 Data Streaming
- 16.3.1 Definition
- 16.3.2 The stream Package
- 16.3.3 Synthetic Example: Random Gaussian Stream
- k-Means Clustering
- 16.3.4 Sources of Data Streams
- Static Structure Streams
- Concept Drift Streams
- Real Data Streams
- 16.3.5 Printing, Plotting and Saving Streams
- 16.3.6 Stream Animation
- 16.3.7 Case-Study: SOCR Knee Pain Data
- 16.3.8 Data Stream Clustering and Classification (DSC)
- 16.3.9 Evaluation of Data Stream Clustering
- 16.4 Optimization and Improving the Computational Performance
- 16.4.1 Generalizing Tabular Data Structures with dplyr
- 16.4.2 Making Data Frames Faster with Data.Table
- 16.4.3 Creating Disk-Based Data Frames with ff
- 16.4.4 Using Massive Matrices with bigmemory
- 16.5 Parallel Computing
- 16.5.1 Measuring Execution Time
- 16.5.2 Parallel Processing with Multiple Cores
- 16.5.3 Parallelization Using foreach and doParallel
- 16.5.4 GPU Computing
- 16.6 Deploying Optimized Learning Algorithms
- 16.6.1 Building Bigger Regression Models with biglm
- 16.6.2 Growing Bigger and Faster Random Forests with bigrf
- 16.6.3 Training and Evaluation Models in Parallel with caret
- 16.7 Practice Problem
- 16.8 Assignment: 16. Specialized Machine Learning Topics
- 16.8.1 Working with Website Data
- 16.8.2 Network Data and Visualization
- 16.8.3 Data Conversion and Parallel Computing
- References
- 16.1 Working with Specialized Data and Databases
- Chapter 17: Variable/Feature Selection
- 17.1 Feature Selection Methods
- 17.1.1 Filtering Techniques
- 17.1.2 Wrapper Methods
- 17.1.3 Embedded Techniques
- 17.2 Case Study: ALS
- 17.2.1 Step 1: Collecting Data
- 17.2.2 Step 2: Exploring and Preparing the Data
- 17.2.3 Step 3: Training a Model on the Data
- 17.2.4 Step 4: Evaluating Model Performance
- Comparing with RFE
- Comparing with Stepwise Feature Selection
- 17.3 Practice Problem
- 17.4 Assignment: 17. Variable/Feature Selection
- 17.4.1 Wrapper Feature Selection
- 17.4.2 Use the PPMI Dataset
- References
- 17.1 Feature Selection Methods
- Chapter 18: Regularized Linear Modeling and Controlled Variable Selection
- 18.1 Questions
- 18.2 Matrix Notation
- 18.3 Regularized Linear Modeling
- 18.3.1 Ridge Regression
- 18.3.2 Least Absolute Shrinkage and Selection Operator (LASSO) Regression
- 18.3.3 Predictor Standardization
- 18.3.4 Estimation Goals
- 18.4 Linear Regression
- 18.4.1 Drawbacks of Linear Regression
- 18.4.2 Assessing Prediction Accuracy
- 18.4.3 Estimating the Prediction Error
- 18.4.4 Improving the Prediction Accuracy
- 18.4.5 Variable Selection
- 18.5 Regularization Framework
- 18.5.1 Role of the Penalty Term
- 18.5.2 Role of the Regularization Parameter
- 18.5.3 LASSO
- 18.5.4 General Regularization Framework
- 18.6 Implementation of Regularization
- 18.6.1 Example: Neuroimaging-Genetics Study of Parkinson´s Disease Dataset
- 18.6.2 Computational Complexity
- 18.6.3 LASSO and Ridge Solution Paths
- 18.6.4 Choice of the Regularization Parameter
- 18.6.5 Cross Validation Motivation
- 18.6.6 n-Fold Cross Validation
- 18.6.7 LASSO 10-Fold Cross Validation
- 18.6.8 Stepwise OLS (Ordinary Least Squares)
- 18.6.9 Final Models
- 18.6.10 Model Performance
- 18.6.11 Comparing Selected Features
- 18.6.12 Summary
- 18.7 Knock-off Filtering: Simulated Example
- 18.7.1 Notes
- 18.8 PD Neuroimaging-Genetics Case-Study
- 18.8.1 Fetching, Cleaning and Preparing the Data
- 18.8.2 Preparing the Response Vector
- 18.8.3 False Discovery Rate (FDR)
- Graphical Interpretation of the Benjamini-Hochberg (BH) Method
- FDR Adjusting the p-Values
- 18.8.4 Running the Knockoff Filter
- 18.9 Assignment: 18. Regularized Linear Modeling and Knockoff Filtering
- References
- Chapter 19: Big Longitudinal Data Analysis
- 19.1 Time Series Analysis
- 19.1.1 Step 1: Plot Time Series
- 19.1.2 Step 2: Find Proper Parameter Values for ARIMA Model
- 19.1.3 Check the Differencing Parameter
- 19.1.4 Identifying the AR and MA Parameters
- 19.1.5 Step 3: Build an ARIMA Model
- 19.1.6 Step 4: Forecasting with ARIMA Model
- 19.2 Structural Equation Modeling (SEM)-Latent Variables
- 19.2.1 Foundations of SEM
- 19.2.2 SEM Components
- 19.2.3 Case Study - Parkinson´s Disease (PD)
- Step 1 - Collecting Data
- Step 2 - Exploring and Preparing the Data
- Step 3 - Fitting a Model on the Data
- 19.2.4 Outputs of Lavaan SEM
- 19.3 Longitudinal Data Analysis-Linear Mixed Models
- 19.3.1 Mean Trend
- 19.3.2 Modeling the Correlation
- 19.4 GLMM/GEE Longitudinal Data Analysis
- 19.4.1 GEE Versus GLMM
- 19.5 Assignment: 19. Big Longitudinal Data Analysis
- 19.5.1 Imaging Data
- 19.5.2 Time Series Analysis
- 19.5.3 Latent Variables Model
- References
- 19.1 Time Series Analysis
- Chapter 20: Natural Language Processing/Text Mining
- 20.1 A Simple NLP/TM Example
- 20.1.1 Define and Load the Unstructured-Text Documents
- 20.1.2 Create a New VCorpus Object
- 20.1.3 To-Lower Case Transformation
- 20.1.4 Text Pre-processing
- Remove Stopwords
- Remove Punctuation
- Stemming: Removal of Plurals and Action Suffixes
- 20.1.5 Bags of Words
- 20.1.6 Document Term Matrix
- 20.2 Case-Study: Job Ranking
- 20.2.1 Step 1: Make a VCorpus Object
- 20.2.2 Step 2: Clean the VCorpus Object
- 20.2.3 Step 3: Build the Document Term Matrix
- 20.2.4 Area Under the ROC Curve
- 20.3 TF-IDF
- 20.3.1 Term Frequency (TF)
- 20.3.2 Inverse Document Frequency (IDF)
- 20.3.3 TF-IDF
- 20.4 Cosine Similarity
- 20.5 Sentiment Analysis
- 20.5.1 Data Preprocessing
- 20.5.2 NLP/TM Analytics
- 20.5.3 Prediction Optimization
- 20.6 Assignment: 20. Natural Language Processing/Text Mining
- 20.6.1 Mining Twitter Data
- 20.6.2 Mining Cancer Clinical Notes
- References
- 20.1 A Simple NLP/TM Example
- Chapter 21: Prediction and Internal Statistical Cross Validation
- 21.1 Forecasting Types and Assessment Approaches
- 21.2 Overfitting
- 21.2.1 Example (US Presidential Elections)
- 21.2.2 Example (Google Flu Trends)
- 21.2.3 Example (Autism)
- 21.3 Internal Statistical Cross-Validation is an Iterative Process
- 21.4 Example (Linear Regression)
- 21.4.1 Cross-Validation Methods
- 21.4.2 Exhaustive Cross-Validation
- 21.4.3 Non-Exhaustive Cross-Validation
- 21.5 Case-Studies
- 21.5.1 Example 1: Prediction of Parkinson´s Disease Using Adaptive Boosting (AdaBoost)
- 21.5.2 Example 2: Sleep Dataset
- 21.5.3 Example 3: Model-Based (Linear Regression) Prediction Using the Attitude Dataset
- 21.5.4 Example 4: Parkinson´s Data (ppmi_data)
- 21.6 Summary of CV output
- 21.7 Alternative Predictor Functions
- 21.7.1 Logistic Regression
- 21.7.2 Quadratic Discriminant Analysis (QDA)
- 21.7.3 Foundation of LDA and QDA for Prediction, Dimensionality Reduction, and Forecasting
- LDA (Linear Discriminant Analysis)
- QDA (Quadratic Discriminant Analysis)
- 21.7.4 Neural Networks
- 21.7.5 SVM
- 21.7.6 k-Nearest Neighbors Algorithm (k-NN)
- 21.7.7 k-Means Clustering (k-MC)
- 21.7.8 Spectral Clustering
- Iris Petal Data
- Spirals Data
- Income Data
- 21.8 Compare the Results
- 21.9 Assignment: 21. Prediction and Internal Statistical Cross-Validation
- References
- Chapter 22: Function Optimization
- 22.1 Free (Unconstrained) Optimization
- 22.1.1 Example 1: Minimizing a Univariate Function (Inverse-CDF)
- 22.1.2 Example 2: Minimizing a Bivariate Function
- 22.1.3 Example 3: Using Simulated Annealing to Find the Maximum of an Oscillatory Function
- 22.2 Constrained Optimization
- 22.2.1 Equality Constraints
- 22.2.2 Lagrange Multipliers
- 22.2.3 Inequality Constrained Optimization
- Linear Programming (LP)
- Mixed Integer Linear Programming (MILP)
- 22.2.4 Quadratic Programming (QP)
- 22.3 General Non-linear Optimization
- 22.3.1 Dual Problem Optimization
- Motivation
- Example 1: Linear Example
- Example 2: Quadratic Example
- Example 3: More Complex Non-linear Optimization
- Example 4: Another Linear Example
- 22.3.1 Dual Problem Optimization
- 22.4 Manual Versus Automated Lagrange Multiplier Optimization
- 22.5 Data Denoising
- 22.6 Assignment: 22. Function Optimization
- 22.6.1 Unconstrained Optimization
- 22.6.2 Linear Programming (LP)
- 22.6.3 Mixed Integer Linear Programming (MILP)
- 22.6.4 Quadratic Programming (QP)
- 22.6.5 Complex Non-linear Optimization
- 22.6.6 Data Denoising
- References
- 22.1 Free (Unconstrained) Optimization
- Chapter 23: Deep Learning, Neural Networks
- 23.1 Deep Learning Training
- 23.1.1 Perceptrons
- 23.2 Biological Relevance
- 23.3 Simple Neural Net Examples
- 23.3.1 Exclusive OR (XOR) Operator
- 23.3.2 NAND Operator
- 23.3.3 Complex Networks Designed Using Simple Building Blocks
- 23.4 Classification
- 23.4.1 Sonar Data Example
- 23.4.2 MXNet Notes
- 23.5 Case-Studies
- 23.5.1 ALS Regression Example
- 23.5.2 Spirals 2D Data
- 23.5.3 IBS Study
- 23.5.4 Country QoL Ranking Data
- 23.5.5 Handwritten Digits Classification
- Configuring the Neural Network
- Training
- Forecasting
- Examining the Network Structure Using LeNet
- 23.6 Classifying Real-World Images
- 23.6.1 Load the Pre-trained Model
- 23.6.2 Load, Preprocess and Classify New Images - US Weather Pattern
- 23.6.3 Lake Mapourika, New Zealand
- 23.6.4 Beach Image
- 23.6.5 Volcano
- 23.6.6 Brain Surface
- 23.6.7 Face Mask
- 23.7 Assignment: 23. Deep Learning, Neural Networks
- 23.7.1 Deep Learning Classification
- 23.7.2 Deep Learning Regression
- 23.7.3 Image Classification
- References
- 23.1 Deep Learning Training
- Summary
- Glossary
- Index
