New York Times Spelling Bee

Classifying Puzzles Based on its Features

Summary

Using data I collected from New York Times Spelling Bee Puzzles, two classmates and I performed exploratory data analysis and statistical testing before creating machine learning models to classify each puzzle as having bingo or not having bingo. Both supervised and unsupervised learning models were tested and the best performing model from each was tuned. Our best model was logistic regression with an accuracy of 74.4%.

Project Details

In my Intro to Data Science class I got the opportunity to do a machine learning project in a small group. I worked with two classmates and we were allowed to use a data set of our choosing, so I suggested a data set I had been working on.

Since August 1, 2023 I have recorded the letters and "hint line" in each day's New York Times Spelling Bee puzzle. At the time we worked on this project, I had just over 600. We focused on only one aspect of the puzzle out of the many ideas I have to explore the dataset when I finish my official data collection on July 30, 2027.

Spelling Bee game in progress

Background Information

Each game has seven letters, with one in the center. The center letter must be in every word, making it the most important, and each puzzle has a set list of words. There is a set number of words and points, with longer words being worth more points.

Additionally, each puzzle has at least one pangram, a word containing all seven letters at least once. Pangrams are worth their length in points plus seven.

Bingo is another feature where some games have bingo and others don't. A puzzle has bingo if each of the seven letters begins at least one word. The puzzle shown does not have bingo, since it is not written in the hint line.

Finally, each puzzle is created by Sam Ezersky, who has some rules for creating a puzzle.

Spelling Bee hint page where data was collected from

Research Question and Hypothesis

Our research question was:

How do point total, number of words, and letters in a NYT Spelling Bee puzzle contribute to whether or not the puzzle has bingo?

This led to our null hypothesis:

If there are certain letters or a certain number of points and words in the puzzle, than a classification machine learning model trained with these features will not be able to predict the presence of bingo with greater than 50% accuracy

and our alternative hypothesis:

If there are certain letters or a certain number of points and words in the puzzle, than a classification machine learning model trained with these features will be able to predict the presence of bingo with greater than 50% accuracy.

Our rationale was that more words may mean bingo since there are more words to begin with each letter and that certain letters may impact the puzzle, since few words begin with X, for example. We aimed for at least 50% accuracy because 50% accuracy would mean the model was guessing. Additionally, accuracy is a reliable metric because the two classes are fairly balanced and neither false positives or false negatives are more important.

Data Preprocessing

My data set originally had 40 columns, most of which were only necessary because I was storing it in an Excel sheet.

Most of the features were dropped, leaving only the seven letters, points, words, number of pangrams, number of vowels, and the dependant variable, bingo.

The letter columns were one-hot encoded so that there were columns for the center letter, and the other six letters, which are of equal importance, shared the one hot encoding so that there were six columns with a 1.

The other numerical columns had a StandardScaler applied to account for the large numeric ranges. Finally, there was one puzzle that I felt was a qualitative outlier that was removed: the creator has been quoted as saying there will never be a puzzle with an S, but the 2,500th puzzle contained an S and was removed from the data set.

Columns of the DataFrame showing one-hot encoded letters

Exploratory Data Analysis

Before creating our machine learning models, we needed to explore the other variables of the dataset.

Bingo was our target variable, but we needed to examine the relationships between Bingo and the other variables.

Data visualizations showing the relationships between Bingo and other variables

Statistical Testing

I worked on the statistical testing portion of the project. T-tests and chi squared tests of independece were run to compare each variable to bingo. Only points, words, center E and having N, X, or Y had a statistically significant relationship with bingo after applying a Bonferonni correction. This information was used to create two of each machine learning model, one with all of the features, and one with only the statistically significant features.

Supervised Learning

Four supervised learning models were tested: logistic regression, random forest classifier, gradient boosting classifier, and support vector classifier. Logistic regression with all features had the best results and was tuned to increase accuracy. The tuned logistic regression model had L2 regularization, ibfgs solver, and C = 0.1. The resulting accuracy was 74.4%, which was found to be statistically significant compared to 50% using 10-fold cross validation and a t-test.

Unsupervised Learning

I also worked on the unsupervised learning model portion. I tested three models: K-means, spectral clustering, and gaussian mixture clustering. Since we needed to classify having bingo or no bingo, each model was created with two clusters. The model with the best accuracy was gaussian mixture with all features.

This model was tuned to increase the accuracy and the best performing hyperparameters were two components and a full covariance type. The accuracy was 66.6% which was found to not be statistically significant compared to 50% using a 10-fold cross validation and a t-test.

As shown, the area under the reciever operating characteristic curve is only 0.68, which is a poor score and closer to guessing than a perfect model.

Receiver Operating Characteristic (ROC) curve for the best performing unsupervised learning model

Discussion

We were able to reject the null hypothesis since the logistic regression model did have significantly higher than 50% accuracy. The results of this project helped us to understand what letters in a puzzle may be more signifant as well as having a deeper understanding of what contributes to bingo appearing.

Future

I plan to continue collecting puzzles every day until I have exactly four years worth of puzzles. This large sample size will allow for more analysis projects in the future, including an examination of repeating pangrams, relationships between letters and points, and using feature reduction to see if accuracy could be improved.

Malin Elise Portfolio

© Malin Morris, 2025