Stream: helpdesk (published)

Topic: Back Fill vs Forward Fill?


view this post on Zulip BryanB (Mar 19 2021 at 16:09):

When working with time series data, in particular financial data, how do you handle the missing values? Do you omit these rows completely, forward fill or back fill? I understand that this may be personal preference but it would be great to hear feedback from others who've faced the same issue.

view this post on Zulip Andrew Dinhobl (Mar 19 2021 at 16:20):

I'm not familiar with financial data, but for other time series data (specifically industrial data), my thoughts are that it depends. In some cases, I've linearly interpolated, some cases forward filled. I personally don't use back fills, but others might in their situations. There's Impute.jl which has some of these methods implemented, but I'm not sure if there are any others.

view this post on Zulip BryanB (Mar 19 2021 at 16:24):

@Andrew Dinhobl Thank you, I'm leaning towards forward filling the data. I was just about to ask how to forward fill and it looks like the package you mentioned is perfect for this. From a brief overview, I'm assuming that LOCF(Last Observation Carried Forward) is essentially another way of saying forward filling.

view this post on Zulip Andrew Dinhobl (Mar 19 2021 at 16:33):

I believe so, but haven't used it personally. As far as Interpolation vs. Ffill, for me the decision depends on 1) what you are doing with the data and what model types you are using, and 2) what is your process like? For 2, I do a lot of Ffilling for processes that are closer to piecewise step functions, and a more interpolation for smoother processes. If the process isn't smooth or stepwise, then you probably can't fill it unless you have a specific model of your process. If your data has huge gaps, then I find it scary to interpolate, but for smaller gaps, I feel more comfortable. Maybe not the most statistically sound technique, but those are some considerations.

view this post on Zulip Andrew Dinhobl (Mar 19 2021 at 16:35):

I imagine there are similar considerations for backfilling, but again, domain-specific.

view this post on Zulip BryanB (Mar 19 2021 at 16:45):

@Andrew Dinhobl Great points, thank you. In my case, I'd say the process is stepwise. Some symbols in the stock market have a lot more activity, in some cases this leads to having double or triple the rows for one symbol in comparison to those less active. So it seems like the forward fill would be best approach here. The financial data providers tend to only include data for every new change, so if this means going 30 seconds without a change, you'll end up with a 30 row gap if your time series is in 1 second intervals. In this case, forward filling would essentially equal the actual data that was available at the given time which the provider did not include.

view this post on Zulip Andrew Dinhobl (Mar 19 2021 at 17:17):

There is also a JuliaFinance Github org, I believe, and there is a category of packages here https://juliapackages.com/c/finance.

view this post on Zulip BryanB (Mar 19 2021 at 17:27):

@Andrew Dinhobl Awesome, thank you. I'll definitely check out those packages too. I'm sure there will be some really useful things to help with my analysis there. By the way, Impute.jl worked wonderfully and it was so simple to forward fill with just Impute.locf(df). Thanks again.

view this post on Zulip Brian Chen (Mar 20 2021 at 18:33):

Another option is to accept the sparsity and use a model that can handle discontinuities. I'm not sure about finance, but in some domains filling will actually introduce incorrect values.


Last updated: Nov 22 2024 at 04:41 UTC