Data Mining and Analysis Techniques for the Halt of Tobacco Use

Exploring how data mining models classify tobacco use by analyzing biometric and demographic indicators.

Authors: Seth Chritzman & Hunter Minteer
Department of Computer Science, Slippery Rock University
CPSC 423: Data Mining & Analysis
Professor: Wahbeh Abdullah
Date: December 9, 2022

Introduction

The use of tobacco in the United States remains one of the major causes of death and disease. It is responsible for almost 500,000 deaths annually, and globally, over eight million deaths each year are linked to tobacco abuse. To identify and understand the problem behind tobacco usage, a combination of approaches must be used to comprehend the deeper patterns within the data.

In this study, data mining techniques are applied to determine whether an individual is a smoker or non-smoker. This is particularly valuable in real-world scenarios: when medical professionals attempt to assess whether a patient uses tobacco, the number of variables can make it difficult to arrive at an accurate conclusion. Using data mining and artificial intelligence (AI), we can improve the precision of such assessments dramatically.

Data mining is the process of exploring large datasets to discover hidden patterns, trends, and relationships. Its two primary goals are prediction and description.

Description focuses on patterns interpretable by humans.
Prediction focuses on using attributes within the data to forecast other variables.

Artificial intelligence and machine learning allow us to uncover relationships that are invisible to the human eye. By applying these methods, we can predict smoker status based on health indicators and demographic factors.

Physiological and behavioral factors that may influence tobacco use include:

Sociodemographic attributes
Past quitting attempts
Nicotine dependence level
Use of supplementary nicotine products
Gender differences

In addition to behavioral data, biochemical indicators can be valuable predictors. These include:

Alanine transaminase (ALT)
Aspartate transaminase (AST)
Gamma-glutamyl transpeptidase (GTP)
Hemoglobin
LDL and HDL cholesterol levels

The objective of this study is to use AI and data mining methods to identify whether a patient is a tobacco user by analyzing these various physical and biochemical attributes.

Body of Literature Review

Study 1 — Predicting Quit Probability Using WEKA

A study conducted in India focused on predicting the probability of a smoker successfully quitting. Tobacco use in India is a significant contributor to preventable deaths, as the country is the second-largest tobacco consumer globally. Researchers noted that cutting tobacco use in half could prevent over 180 million deaths.

To model smoking cessation, they used data mining techniques and the WEKA tool to test different algorithms:

Naïve Bayes
SMO (Sequential Minimal Optimization)
Random Forest
J48 Decision Tree
Decision Stump

The dataset included 16 attributes, such as age, gender, education, marital status, religion, occupation, number of dependents, tobacco type, frequency of use, years of habit, previous quit attempts, nicotine dependence, intervention type, and outcome.

After running the models with 10-fold cross-validation, the Decision Stump algorithm performed best with 55.87% accuracy, followed by SMO and J48. Although the dataset was limited to 655 individuals from a single clinic, the study provided valuable insight into how decision trees could be used for tobacco cessation prediction.

Study 2 — Investigating Tobacco Habits in Croatia

Another study examined the relationship between cigarette versus rolling tobacco use in Croatia’s Slavonia region. The dataset came from an online survey of primarily female participants aged 18–25.

The study aimed to determine whether price influenced tobacco choice, as rolling tobacco is cheaper. Results showed that:

45.1% of young adults were daily smokers
74.5% began smoking before age 18
47.1% were heavily exposed to tobacco before starting
39.2% wanted to quit, and nearly half had attempted multiple times

Researchers applied a J48 decision tree algorithm using 10-fold cross-validation, which effectively classified patterns in the dataset. However, the small participant pool (only 51 individuals) limited the statistical power. Despite this, the decision tree approach showed promise for small-scale behavioral modeling.

Methods

Building on prior research, our study tested multiple algorithms using a percentage split between male and female participants. The models tested were:

Dummy Classifier (baseline)
Decision Tree
Naïve Bayes

Each algorithm was evaluated for accuracy:

Algorithm	Accuracy	Notes
Dummy Classifier	0.63	Baseline; impractical since it ignores underlying patterns
Naïve Bayes	0.71	Moderate accuracy; fast and efficient
Decision Tree	0.77	Highest accuracy; selected as the final model

Our dataset contained various biometric and lifestyle variables:

Demographic: gender, age
Body metrics: weight, height, waist size
Physiological: eyesight, hearing
Blood chemistry: systolic and diastolic pressure, fasting blood sugar, cholesterol, triglycerides, HDL, LDL, hemoglobin
Urinalysis: urine protein, serum creatinine
Liver enzymes: AST, ALT, GTP
Other: insurance status, irritability, smoking status

To prepare the dataset:

String values (e.g., “Male/Female”, “Yes/No”) were converted into binary form using dummy variables.
To avoid the Dummy Variable Trap, one of each binary category was removed to prevent multicollinearity.
Data preprocessing followed guidance from LearnDataSci (2022).

Once cleaned, linear regression was used to evaluate correlations between variables and smoking status. Subsequently, we ran classification algorithms, ultimately determining that Decision Tree models produced the most accurate and interpretable results.

Results

The Decision Tree model produced the following performance metrics:

Metric	Score
Precision	0.693
Recall	0.690
F1 Score	0.690

While Naïve Bayes achieved slightly higher precision (≈0.70), the Decision Tree achieved better overall balance and interpretability. Statistical tests revealed key correlations:

Variable	P-Value
Waist	0.829
LDL	0.617
Urine Protein	0.690
AST	0.001
Others	0.000

These findings indicate that AST, GTP, hemoglobin, and LDL cholesterol were the most influential biochemical predictors of smoking status.

Additional observations:

Only 36% of the participants were smokers.
Smokers exhibited lower GTP and hemoglobin levels.
The most common smoking age range was 25–35.
Most smokers possessed dental insurance, suggesting socioeconomic links.

The Decision Tree effectively identified non-linear relationships and produced interpretable classifications for real-world application.

Conclusion

Our findings demonstrate that data mining and AI models can effectively identify tobacco use based on biochemical and demographic data. Among the tested algorithms, the Decision Tree achieved the highest accuracy and interpretability.

The study highlights that GTP and LDL levels serve as significant predictors distinguishing smokers from non-smokers. These findings have direct implications for healthcare, where such models could assist nurses and doctors in identifying smoking habits without direct questioning.

Ultimately, data mining offers a powerful tool for predictive analysis in public health. By leveraging these techniques, professionals can better understand risk factors, design targeted interventions, and improve prevention efforts in tobacco cessation.

References

World Health Organization. (2022, May 24). Tobacco. WHO Fact Sheet
Rijhwani, K., Mohanty, V., Aswini, YB., & Hashmi, S. (2020). Applicability of Data Mining and Predictive Analysis for Tobacco Cessation: An Exploratory Study. Front Dent. ResearchGate Link
Martinović, T. (2015). Investigating Tobacco Usage Habits Using Data Mining Approach. ENTerprise REsearch InNOVAtion Conference. Econstor Link
Centers for Disease Control and Prevention. (2014). Patterns of Tobacco Use Among U.S. Youth, Young Adults, and Adults. NIH Report
Dummy Variable Trap – LearnDataSci. (2022). LearnDataSci Glossary