How to Build a Multiple Regression Model for Passing Yards

R
Ruven Kotz
8 minute read

How to Build a Multiple Regression Model for Passing Yards

Predicting NFL quarterback passing yards is one of the most popular challenges in sports analytics, and for good reason. QB passing yards props are among the most liquid betting markets, with lines typically ranging from 220-280 yards for starting quarterbacks. Fantasy football players obsess over these projections, and daily fantasy sports platforms like DraftKings weight passing yards heavily in their scoring systems. But here's the brutal truth I learned building this model: even with sophisticated statistical approaches, accurately predicting passing yards is incredibly difficult.

In this tutorial, I'll walk you through building a multiple regression model using Parlay Savant, a tool that removes the technical friction of data collection and analysis. We'll gather historical quarterback data, build a predictive model incorporating multiple statistical factors, make real predictions, and then test those predictions against actual NFL results. I'll be honest about both the successes and spectacular failures along the way. By the end, you'll understand why sports prediction is so challenging and have realistic expectations about what statistical models can and cannot do.

Step 1: Getting the Data with Parlay Savant

I started by prompting Parlay Savant to gather comprehensive quarterback data that would give me the foundation for a multiple regression model:

Get quarterback passing stats with team and opponent information for the past 2 seasons to build a comprehensive dataset. Include passing attempts, completions, yards, touchdowns, interceptions, completion percentage, passer rating, team points, opponent points, weather data, and home/away status. Focus on games with at least 10 passing attempts and completed regular season games only.

Parlay Savant processed this request by generating a complex SQL query that joined multiple tables - player_game_stats, players, games, teams, team_game_stats, and weather. The query used proper filtering for position='QB', minimum attempts, completed games, and regular season only. It also calculated completion percentages on the fly and handled the home/away logic by comparing team_id with home_team_id.

The resulting dataset was impressive: 500 records covering 69 quarterbacks from the 2024 and 2025 seasons. Here's a sample of what we got:

quarterbackseasonweekopponentpassing_attemptspassing_yardshome_away
Josh Allen20253Miami Dolphins28213home
Tua Tagovailoa20253Buffalo Bills34146away
Russell Wilson20252Dallas Cowboys41450away
Dak Prescott20252New York Giants52361home
Jared Goff20252Chicago Bears28334home

What surprised me was the massive variance in the data. Passing yards ranged from 24 to 509 yards, with a mean of 224.1 yards and standard deviation of 79.3 yards. This immediately told me that any model would struggle with the inherent volatility of NFL passing games.

Data Distribution and Key Patterns

The correlation analysis revealed some expected relationships: completions (0.803) and passing attempts (0.707) had the strongest correlations with passing yards, while team points (0.463) showed a moderate positive relationship. Interestingly, opponent points had almost no correlation (0.104), suggesting that defensive performance isn't captured well by simple scoring metrics.

Step 2: Building the Model with Parlay Savant

Next, I prompted Parlay Savant to build the actual multiple regression model:

Build a multiple regression model for predicting quarterback passing yards using historical performance averages, home/away status, and game context factors. Use proper train/test splits, calculate feature importance, and provide detailed performance metrics including R-squared, MAE, and RMSE. Focus on creating a model that could be used for real predictions.

Parlay Savant generated Python code that created a sophisticated feature engineering pipeline. The code calculated historical averages for each quarterback using only past games (avoiding data leakage), created binary features for home field advantage and high-scoring games, and properly handled missing values by falling back to overall quarterback averages.

The statistical process worked like this: First, the code grouped quarterbacks by season to calculate historical averages for passing yards, attempts, and completion percentage. Then it created additional features like home field advantage (binary 0/1) and high-scoring game indicators (total points > 45). Finally, it used scikit-learn's LinearRegression with an 80/20 train-test split.

The model performance was sobering but realistic:

  • Training R²: 0.362
  • Test R²: 0.421
  • Test MAE: 49.5 yards
  • Test RMSE: 59.6 yards

The feature importance revealed some fascinating insights:

  • Historical avg yards: 0.831 (most important predictor)
  • High-scoring games: +37.6 yards (huge impact)
  • Historical avg attempts: 0.732 (volume matters)
  • Home field advantage: +7.2 yards (modest boost)
  • Historical completion %: 0.340 (efficiency helps)

What struck me was that the model only explained about 42% of the variance in passing yards. This means 58% of what determines a quarterback's passing yards in any given game is essentially unpredictable using these basic features - game script, defensive adjustments, weather, injuries, and countless other factors matter enormously.

The accuracy within different thresholds was humbling:

  • Within 20 yards: 20.0%
  • Within 30 yards: 33.0%
  • Within 50 yards: 57.0%

Step 3: Making Predictions

For the prediction phase, I prompted Parlay Savant to apply our trained model to upcoming games:

Use the trained multiple regression model to make predictions for upcoming Week 4 quarterback performances. Include top quarterbacks like Josh Allen, Dak Prescott, Joe Burrow, Patrick Mahomes, and Lamar Jackson. Show confidence intervals and analyze the key factors driving each prediction.

Parlay Savant created a prediction dataset with historical averages for each quarterback and applied our model coefficients. The process involved looking up each QB's historical performance, determining home/away status, and estimating whether the game would be high-scoring based on team tendencies.

Here are our Week 4 predictions:

QuarterbackTeamOpponentHome/AwayHistorical AvgPredicted
Josh AllenBuffalo BillsMiami DolphinsHome268.5289.1
Dak PrescottDallas CowboysNew York GiantsHome260.9243.6
Lamar JacksonBaltimore RavensCleveland BrownsAway245.3257.6
Joe BurrowCincinnati BengalsCarolina PanthersHome303.1321.9
Patrick MahomesKansas City ChiefsLos Angeles ChargersHome252.0273.5

The model's logic was transparent: Joe Burrow got the highest prediction (321.9 yards) because he had the highest historical average (303.1) plus benefits from home field (+7.2) and expected high-scoring game (+37.6). Dak Prescott's prediction was actually below his historical average because the model didn't expect a high-scoring game against the Giants.

Each prediction came with 68% confidence intervals of ±59.6 yards based on our test RMSE. So Josh Allen's 289.1 prediction had a range of 229.5-348.7 yards - a massive spread that highlights the uncertainty inherent in these predictions.

Model Performance and Predictions

Step 4: Testing Predictions

To validate our model, I prompted Parlay Savant to test our predictions against actual NFL results:

Test our model predictions against actual Week 3 NFL results. Compare predicted vs actual passing yards for quarterbacks like Josh Allen and Tua Tagovailoa. Calculate prediction accuracy and analyze why certain predictions failed or succeeded.

Parlay Savant retrieved the actual Week 3 results and created retroactive predictions using our model. This was the moment of truth - how would our sophisticated multiple regression model perform against reality?

The results were brutal:

QuarterbackOpponentH/AHistoricalPredictedActualError
Josh AllenMiami DolphinsH247.1270.0213-57.0
Tua TagovailoaBuffalo BillsA255.7266.2146-120.2

Epic Misses:

  • Josh Allen: Predicted 270.0, Actual 213 (off by 57.0 yards) - The model expected a typical Josh Allen performance, but Buffalo built an early lead and leaned heavily on their running game in the second half.
  • Tua Tagovailoa: Predicted 266.2, Actual 146 (off by 120.2 yards) - This was a disaster. Miami fell behind early and couldn't establish any offensive rhythm. The model had no way to predict Miami's offensive line struggles or Buffalo's defensive adjustments.

Success Stories: Unfortunately, there were none. Both predictions missed by enormous margins.

The validation metrics were sobering:

  • Mean Absolute Error: 88.6 yards
  • Predictions within 30 yards: 0.0%
  • Model bias: -88.6 yards (consistently overestimated)

Conclusion: What We Learned

This experiment with multiple regression for passing yards prediction was both educational and humbling. Our model achieved a respectable 42% R-squared on test data, which means it explained less than half the variance in quarterback passing yards. The 88.6-yard average error on real predictions shows just how difficult sports prediction truly is.

What the model got right:

  • Historical performance is the strongest predictor (coefficient of 0.831)
  • High-scoring games significantly boost passing yards (+37.6 yards on average)
  • Home field advantage provides a modest boost (+7.2 yards)
  • The model's feature relationships made intuitive sense

What the model missed:

  • Game script effects (teams with early leads run more)
  • Defensive adjustments and matchup-specific factors
  • Injury impacts and player availability
  • Weather conditions and field conditions
  • Coaching decisions and play-calling tendencies

Using Parlay Savant removed all the technical friction - I didn't need to write SQL queries, wrangle data, or debug Python code. The tool handled the database connections, feature engineering, and model building seamlessly. This allowed me to focus on the analytical insights rather than getting bogged down in implementation details.

However, Parlay Savant can't solve the fundamental challenge of sports prediction: the inherent randomness and complexity of NFL games. About 58% of passing yard variance remains unexplained even with multiple predictive features. This suggests that successful sports betting requires either finding edges the market hasn't priced in, or accepting that long-term profitability comes from small advantages compounded over many bets.

Potential improvements for future models:

  • Incorporating defensive rankings and matchup data
  • Adding weather variables (wind speed, precipitation)
  • Including injury reports and player availability
  • Using more sophisticated algorithms (random forests, neural networks)
  • Adding real-time betting line movements as features

The harsh reality is that if predicting passing yards were easy, sportsbooks wouldn't offer these markets. Our 0% accuracy rate within 30 yards on real predictions serves as a reminder that sports prediction is extraordinarily difficult, and anyone claiming consistent success should be viewed with skepticism. The best approach is building models that capture general tendencies while maintaining realistic expectations about their limitations.

All Insights
Powered by Parlay Savant AITry it free