When working with time series data, in particular financial data, how do you handle the missing values? Do you omit these rows completely, forward fill or back fill? I understand that this may be personal preference but it would be great to hear feedback from others who've faced the same issue.
I'm not familiar with financial data, but for other time series data (specifically industrial data), my thoughts are that it depends. In some cases, I've linearly interpolated, some cases forward filled. I personally don't use back fills, but others might in their situations. There's Impute.jl which has some of these methods implemented, but I'm not sure if there are any others.
@Andrew Dinhobl Thank you, I'm leaning towards forward filling the data. I was just about to ask how to forward fill and it looks like the package you mentioned is perfect for this. From a brief overview, I'm assuming that LOCF(Last Observation Carried Forward) is essentially another way of saying forward filling.
I believe so, but haven't used it personally. As far as Interpolation vs. Ffill, for me the decision depends on 1) what you are doing with the data and what model types you are using, and 2) what is your process like? For 2, I do a lot of Ffilling for processes that are closer to piecewise step functions, and a more interpolation for smoother processes. If the process isn't smooth or stepwise, then you probably can't fill it unless you have a specific model of your process. If your data has huge gaps, then I find it scary to interpolate, but for smaller gaps, I feel more comfortable. Maybe not the most statistically sound technique, but those are some considerations.
I imagine there are similar considerations for backfilling, but again, domain-specific.
@Andrew Dinhobl Great points, thank you. In my case, I'd say the process is stepwise. Some symbols in the stock market have a lot more activity, in some cases this leads to having double or triple the rows for one symbol in comparison to those less active. So it seems like the forward fill would be best approach here. The financial data providers tend to only include data for every new change, so if this means going 30 seconds without a change, you'll end up with a 30 row gap if your time series is in 1 second intervals. In this case, forward filling would essentially equal the actual data that was available at the given time which the provider did not include.
There is also a JuliaFinance Github org, I believe, and there is a category of packages here https://juliapackages.com/c/finance.
@Andrew Dinhobl Awesome, thank you. I'll definitely check out those packages too. I'm sure there will be some really useful things to help with my analysis there. By the way, Impute.jl worked wonderfully and it was so simple to forward fill with just Impute.locf(df)
. Thanks again.
Another option is to accept the sparsity and use a model that can handle discontinuities. I'm not sure about finance, but in some domains filling will actually introduce incorrect values.
Last updated: Nov 22 2024 at 04:41 UTC