R2 on RF models increasing between old and new data

Hi all, I would like to ask you guys some insight. I am currently working on my thesis and I have run into something I just can’t wrap my head around.

So, I have an old dataset (18000 samples) and a new one (26000 samples); the new one is made up by the old plus some extra samples. On both datasets I need to run a regression model to predict the fuel power consumption of an energy system (a cogenerator). The features I am using to predict are ambient temperature, output thermal power, output electrical power.
I trained a RF regression model on each dataset; the two models were trained with hyper grid search and cv = 5, and they turned out to be pretty different. I had significantly different results in terms of R2 (old: 0.850, new: 0.935).
Such a difference in R2 seems odd to me, and I would like to figure something out more. I ran some futher tests, in particular:

  1. Old model trained on new dataset, and new model on old dataset: similar R2 on old and new ds;

  2. New model trained on increasing fractions of new dataset: no significant change in R2 (R2 always similar to final R2 on new model).

  3. Subdatasets created as old ds + increasing fractions of the difference between new and old ds. Here we notice increasing R2 from old to new ds.

Since test 2 seems to suggest that ds size is not significant, I am wondering if test 3 may mean that the new data added to the old one has a higher informative value. Are there some further tests I can run to assess this hypothesis and how can I formulate it mathematically, or are you guys aware of any other phenomena that may be going on here?

I am also adding some pics.

Thank you in advance! Every suggestion would be much appreciacted.

1 Like

From picture:


What your table is telling you (plain language)

In your “old + fraction of newly added samples” experiment, R² climbs a lot:

  • R²: 0.848 → 0.932 (big jump)

But your absolute error improves only a little:

  • RMSE: 48.0 → 46.6 (≈ 2.9% better)
  • MAE: 37.1 → 36.0 (≈ 3.0% better)
  • RRMSE%: 3.26 → 3.22 (almost unchanged)

That combination—R² increases strongly while RMSE/MAE barely move—is the classic signature of a changing “baseline difficulty” of the test data, not a dramatic improvement in the model.

Why: R² is not an absolute error metric. It measures how much better you are than a trivial model that always predicts the mean of the test set. In scikit-learn terms it’s (1 - u/v), where (u) is your squared error and (v) is the variance-like “total sum of squares” of the target around its mean. (scikit-learn)


The main driver in your table: the “spread” of the target is increasing

As you add more of the “diff” samples, your test set grows (from 3662 to 5411) and, more importantly, its composition changes: it contains an increasing share of the newly added samples (n_diff_test_used rises to 1749).

That matters because if the fuel-consumption values in the test set become more spread out (higher range/variance), then:

  • predicting the mean becomes a worse baseline,
  • and R² tends to go up, even if your RMSE stays similar.

Your numbers strongly suggest exactly that. If you back-calculate the implied target variance from your MSE and R² (just using the definition of R²), the “spread” of the target implied by your table increases by about 2× across the columns. Interpreting this without math: the added samples likely make fuel consumption values in the test folds much more varied than in the old-only case, so the same-sized errors look “better” in R² terms.

This is a known pitfall: R² depends on the range/variance of the data, so it can’t be compared cleanly across datasets with different target variability. (Dynamic Ecology)


Why this fits your three tests

Test 2 (fractions of the new dataset): little change

That looks like a typical plateau: once the model has enough examples from the “new” distribution, extra samples don’t change performance much.

Test 3 (old + fraction of added): R² rises steadily

Because the evaluation set in each step includes more of the “diff” samples, you’re effectively changing the “difficulty”/variance of what you score on at each step. In other words, test 3 is not only “more training data”; it’s also “different test data.”

So test 3 can show an increasing R² trend even if the model is only modestly improving (which your RMSE/MAE suggest).


What might be different about the newly added samples (domain context)

With cogenerator data, “new samples” often differ in ways that change both variance and predictability:

  1. Different operating regimes

    • more high-load points (bigger outputs, bigger fuel consumption)
    • fewer start/stop/transient points
    • different control mode mix (heat-led vs electric-led behavior)
  2. Different seasons / ambient ranges

    • broader temperature coverage tends to expand the range of fuel consumption and efficiency patterns.
  3. Different noise level

    • better sensors / filtering / logging quality
    • fewer bad or clipped measurements
  4. More repeated steady-state points

    • telemetry often has many near-duplicates; if random CV splits put near-duplicates across folds, performance can look much better than it will in a true “future” scenario.

Your table’s almost-flat RRMSE% is consistent with “the scale/spread of the target changed” rather than “the model got dramatically smarter.”


The most important issue in your current setup

You are not holding the test distribution fixed

In your table:

  • the number of “diff” points included in the test set grows from 0 to 1749
  • the total test size grows from 3662 to 5411

So each R² value is computed on a different mixture of old vs new points, with potentially different target variance and regime composition.

This alone can explain a large share of the R² jump.


Best next tests (these directly answer “is the new data more informative?”)

1) Freeze one test set and never change it

Pick one test set and keep it identical across comparisons:

  • Best for telemetry: an out-of-time split (train on earlier period, test on later period).
  • Or use TimeSeriesSplit to avoid training on “future-like” points and testing on “past-like” points. (scikit-learn)

Then compare models trained on:

  • old-only training data
  • old + added training data

Report RMSE/MAE (primary) and (secondary). R² is fine to report, but only when the test distribution is the same.

2) Evaluate errors separately on “old” vs “added” points

Train a model on the full dataset (or train two models), then score on:

  • old-only test subset
  • added-only test subset

If the added-only subset has much lower MAE/RMSE, then “the new data is easier/more predictable given these 3 inputs” (often meaning less noise or more consistent regimes).

3) Check whether old vs added are distributionally different (adversarial validation)

Create labels: old=0, added=1. Train a classifier to distinguish them.

  • AUC near 0.5 ⇒ distributions are similar
  • high AUC ⇒ strong dataset shift (the added data is meaningfully different)

This is a standard, practical drift diagnostic (often called adversarial validation). (arXiv)

4) Rule out “too-optimistic CV” from time dependence / grouping

If your samples are time-correlated or come from identifiable runs/days, avoid random K-fold:

  • TimeSeriesSplit for time-ordered data (scikit-learn)
  • GroupKFold if you can define groups (day/run/session/operating campaign) so near-duplicate conditions can’t leak across folds (scikit-learn)

5) Make the hyperparameter comparison fair (nested CV)

Grid search + reporting the best CV score can be optimistic. Use nested CV (outer loop evaluates, inner loop tunes). scikit-learn explicitly warns about optimism in non-nested tuning evaluation. (scikit-learn)

6) A simple “variance-controlled” comparison (optional but very revealing)

If you want a quick sanity check without changing your whole pipeline:

  • Create test sets that have the same target distribution (e.g., stratify by fuel-consumption quantiles so each test set has similar ranges).
  • Then re-run your comparison.

If R² differences shrink a lot under matched target distributions, you’ve confirmed that the earlier R² gap was largely a variance/regime composition effect. The dependence of R² on variance/range is a known property. (Dynamic Ecology)


Bottom line for your specific case

  • Your table shows only modest improvement in absolute error (~3%).
  • The large R² increase is very plausibly driven by the test folds containing more variable fuel-consumption values (and possibly a different regime mix) as more “diff” samples are included.
  • The right way to confirm “new samples are more informative” is to hold the test set fixed, then compare MAE/RMSE (and R²) across training sets.

If you implement only one change: freeze an out-of-time test set and rerun the old-vs-new training comparison under that fixed test set, using MAE/RMSE as the primary metrics. (scikit-learn)