Tan Zhou

Scan contact card

Scannable QR code with contact info

I am a Senior Data Scientist with 8+ years of experience(both industry and academic) in developing innovative solutions to business problems using quantitative and machine learning models.

I obtained my Ph.D. from Texas A&M University with the major of Applied Statistics and Geo-computation focusing on signal processing, Bayesian and Machine learning.

I specialized in writing for publication, algorithm development, and applying machine learning models to develop analystical solutions using various kinds of data such as time series data, text, geospatial and remote sensing data.

Passionate about applying advanced statistical methods such as machine learning and deep learning-based approaches, NLP and time series analysis to perform deep-dive analysis to identify emerging trends, pain points and opportunity areas from data, and to turn data into actionable insights and provide support for decision making and business optimization.

Technical expertise in statistcs, machine learning, NLP, time series analysis,big data processing, remote sensing, computer vision, and geospatial data engineering based on years of experience in university research, non-government, start-up, and private sector.

Skills

Languages, Operating Systems & Tools
  • Python
  • R
  • git
  • linux
  • bash
Machine learning
  • Bayesian
  • random forest
  • Neural network & deep learning (CNN, Resent, Unet, LSTM, transfer learning)
  • decision tree
  • nearest neighbor
  • support vector machine
  • recommender system
  • Nature Language Processing
Statistical methods
  • Partial least square regression
  • univariate and multivariate regression
  • linear discriminant analysis
  • logistic regression
  • time series analysis
  • factor analysis
  • mixed effect modeling
Platform Development & Administration
  • MySQL
  • Domino Data lab
Big data & Cloud

Sentimental analysis of reviewers' feedback using BERT vs. Machine learning

To predict sentiment (postive, neutral, negeative) of customer feedback using tweet texts of differnt airline companies and compare different models'performace on text classification.

Read more..

Enhanced fraud detection using ML and PySpark framework with feature selection

To develop a generalized model to deal with big and imblance data prediction that suitable for real-time fraud detection at the PySpark framework

Read more..

Fraud detection using ML and PySpark framework

To develop a generalized model to deal with big and imblance data prediction that suitable for real-time fraud detection at the PySpark framework

Read more..

Ensemble models in PySpark

To examplify the uses of ensemble models in PySpark as the ensemble models in [previous project using sklearn and keras](https://github.com/tankwin08/ensemble-models-ML-DL-) and predict if the client will subscribe (yes/no) a term deposit (variable y) using market campaign data.

Read more..

Bayesian Uncertainty for time series data (EVI) prediction using LSTM and autoencoder

To investigate the trend and pattern of time seriese data (MODIS data) using the Long Short Term Memory (LSTM) networks and quantify the uncertianty of the time series prediction of target variables.

Read more..

Time series analysis using ARIMA & LSTM - MODIS

To investigate the trend and pattern of time seriese data (MODIS data) using the Autoregressive Integrated Moving Averages (ARIMA) and Long Short Term Memory (LSTM) networks and further to check if we can use the current model to predict further values of target variables.

Read more..

Sentiment analysis for review classification using SWIVEL and a small datasets

To retrain the pretrained model (Submatrix-wise Vector Embedding Learner (SWIVEL) using using a small collected review datasets and classify the reviews of customer feedback as either positive or negative.

Read more..

Ensemble models for classification (combine deep learning with machine learning)

To develop a robust approach to conduct classification on data (a person is wearing glasses or not) using a ensemble of models, which include machine learning models (random forest,Gradient Boosting and Extra Trees) and deep learning model (optimized NN using Bayesian optimization).

Read more..

Bayesian optimization deep learning

To construct the architecture of Nentural Network (NN) and conduct paramter optimization of the NN.

Read more..

waveform decomposition vs. deconvolution

Compare the waveform lidar processing - Decompostion vs. Deconvolution

Read more..

Bayesian decompostion of waveform lidar and uncertaitny analysis

Quantify the uncertianty of waveform decomposition.

Read more..

Wait! There's more..

See all Creations for more examples!

Publications

A brief introduction of articles, presentations or talks.

Paper - Bayesian and Classical Machine Learning Methods: A Comparison for Tree Species Classification with LiDAR Waveform Signatures

We all know there is information contained in the waveform, how can we extract these information and improved the exsiting applications? The following is an example of the combination of advacned statistcial methods such as Bayesian and Machine learning methods with the waveform signatures for tree species identification using waveform lidar data.

November 2019

Paper - waveformlidar: An R Package for Waveform LiDAR Processing and Analysis

The brief introduction of waveformlidar package and the specific usages and corresponding logic.Examples of how to use them can be found in https://github.com/tankwin08/waveformlidar/tree/master/vignettes

October 2019

Paper - Estimating aboveground biomass and forest canopy cover with simulated ICESat-2 data

An example of using ICESat-2 data to estimate the biomass of forest over a regional scale.

January 2019

Paper - Mapping forest aboveground biomass with a simulated ICESat-2 vegetation canopy product and Landsat data

An example of combining ICESat-2 data with landsat to estimate the biomass and forest cover of forest over a regional scale.

January 2019

Paper - From LiDARWaveforms to Hyper Point Clouds: A Novel Data Product to Characterize Vegetation Structure

A new way to visualiza and analyze waveform lidar by converting them into the traditional lidar format (point cloud).

December 2018

Paper - Photon counting LiDAR: An adaptive ground and canopy height retrieval algorithm for ICESat-2 data

This study is mainly to demonstrat how to process ATL03 data of ICESat-2 to ATL08 data using the adaptive framework. (1) An adaptive methodological framework was developed to process upcoming ICESat-2 data. (2) Basic algorithms for ground and canopy photon classification with ICESat-2-like data. (3) Terrain and canopy height measurements with MABEL and simulated ICESat-2 data.

February 2018

Paper - Detecting and Quantifying Standing Dead Tree Structural Loss with Reconstructed Tree Models Using Voxelized Terrestrial Lidar Data

To detect the biomass and volume chnage of dead tree based on the multi-temporal terrestrial lidar scans using the reconstructed tree models and voxelization

January 2018

Paper - Bayesian decomposition of full waveform LiDAR data with uncertainty analysis

To better understand the uncertainty of our waveform processing, a Bayesian method was introduced to assess different methods' performance for waveform lidar processing. The methods have been summarized into an R package named wavefromlidar which is available in CRAN https://github.com/tankwin08/waveformlidar.

August 2017

Paper - Gold-A novel deconvolution algorithm with optimization for waveform LiDAR processing

This paper introduced a novel way to process full waveformlidar data and compared it with exisiting methods such as decompostion and RL deconvolution methods to further show the advantages of new method. These methods have been summarized into an R package named wavefromlidar which is available in CRAN https://github.com/tankwin08/waveformlidar.

April 2017

Experience

Senior Data Scientist

Colaberry Inc./Bayer Crop Science

Make business count on data and statistcs.

  • Built automatic pipeline to extract and update multiple level data with SQL (DbVisualizer and TOAD).
  • Conducted data cleaning and wrangling to convert unstructured lab data to useful knowledge and optimized supply chain pipeline by using them to predicted bio-workflow status.
  • Combined machine learning models with sentiment analysis (pretrained embedding model - SWIVEL) to predict and sync the predicted yield during growing season (Python).
  • Developed automated framework to process big geospatial datasets such as Sentinel-2 (at Domino & AWS S3) using Machine learning (e.g. Random forest, deep learning) and Bayesian methods to estimate soil attributes with uncertainty over a large scale (R & Python).
  • Developed ARIMA and LSTM models (autoencoders) with uncertainty to understand the drivers of the patterns of time series data (e.g. sale and price) and forecasted further values (Python).
  • Predicted customer churn using history sale and meta data such as customer service and feedback, identified the features most impact customer leaving and gave engagement suggestions (Python).
  • Conducted interaction analysis of genetic (G), environment (E) and treatment (T, e.g. seed rate) for hybrids, developed the framework to assess the benefits of different GET and provide support for outcome-based price model.

July 2018 - Present

Postdoctoral Research Associate

Texas A&M University, College Station

  • Developed validation plan for upcoming ICESat-2 data using advanced algorithms (Machine learning and deep learning).
  • Developed algorithms for waveform LiDAR visualization and R package.
  • Predicted the corn and sorghum yield with the UAV-acquired photos and 3D point cloud with Random Forest and deep learning algorithms such as CNN and SegNet

Jan 2018 - July 2018

Research Assistant

Texas A&M University

  • Developed deconvolution and decomposition algorithms for waveform LiDAR processing using National Ecological Observatory Network (NEON) data.
  • Developed R package for waveform data processing (waveformlidar, https://github.com/tankwin08).
  • Applied Bayesian concept to waveform LiDAR (big dataset) decomposition using Amazon AWS and super-computer of TAMU.
  • Identified tree species with uncertainty using waveform LiDAR data and machine learning methods (e.g., random forest, Bayesian logistics, SVM).
  • Develop multiple models (step-wise model, random forest model, and hierarchical Bayesian model) to estimate forest biomass changes and related uncertainty.
  • Developed algorithms for processing upcoming ICESat-2 data (photon counting LiDAR)

Sep 2013 - Dec 2017

Research Assistant

Beijing Normal University

  • Processed Landsat and MODIS images to extract parameters for the watershed models and developed models to simulate the eco-hydrological process for the Abujiao River.
  • Analyzed the land use and land cover change for the program “Regional ecological environmental process and safety control under the intensive agricultural development in Sanjiang Plain, Northeast China” supported by the NSF of China.
  • Collected archive data, developed methods to generate the spatial pattern map for the program “Heavy metal environmental health risk prevention and control key regional division and classification technology research” supported by Special Environmental Research Funds for Public Welfare.
  • Conducted filed work and wrote report for the program Water volume and level monitoring of Baiyangdian demonstration project

Sep 2010 - Jun 2013

Education

Texas A&M University

Ph.D (title: Advances in waveform and photon counting LiDAR processing for forest vegetation applications)
Applied statistcs and geocomputation, Ecosystem Science and Management

Award :

  • TAMU Distinguished Graduate Student Award for Excellence in Research - Doctoral.
  • NSF Doctoral Dissertation Improvement Grant (DDIG).
  • Teaching Certificate, Center for Teaching Excellence, Texas A&M University
2013 - 2017

Beijing Normal University

Master (title: The application of SWAT model in the un-gauged basin - A case study of Abujiao River)
Environmental Science

2010 - 2013

Tianjin University of Science & Technology

Bachelor
Environmental Science

2006 - 2010
Nifty tech tag lists from Wouter Beeftink