Read and Index your data with pandas | Time Series in Python Part 1


Hey there, welcome to this data science
dojo video tutorial series on time series If you followed along to our
automated web scraping meetup video you would’ve scraped reddit user
comments every hour and scored the overall sentiment or emotion
towards U.S. political news events. Now that you’ve taken all the
observations of users is sentiment which is a numeric value or a number anywhere between negative one to positive one so very negative feelings to very positive The next thing you might want to do is predict the next few hours ahead, to see if users as
sentiment will likely continue to fall into negativity or make a turn or change in
directions so you can act on it beforehand This could also apply to customer satisfaction for example, where you might want to predict say a few hours in advance what customers would likely feel so you know when it’s best to
pay extra special attention to customers So in this video tutorial we’ll be covering some basics
of how to build a time series model to predict to users as sentiment
in the next few hours into the future or from the current or
last time stamp in the data set We’ll be writing this in
Python in response to our viewers we’re going to be building
a time series model in Python as this was a top request from our viewers However, if you are more comfortable with R I have included the equivalent
R script for you to follow along with just see the links below this
video to access both the R script and the Python script So in part 1, in this video,
we’ll go over reading and indexing our data for time series checking the data meets the requirements
or assumptions for time series modeling and transforming our data to
ensure it meets those requirements Then in part 2, we’ll get on
to the modeling part of time series input our terms and predict
users as sentiment a few hours ahead and in part three, we’ll evaluate our
predictions and then we’ll discuss issues with our model and how we can
take this further and learn more in depth So let’s get to it So these are the Python
packages that we’ll be using to build our time series model so we’re going to be using pandas for importing and indexing our data as a time series Be using matplotlib to plot and we’re using stat models
to build our time series model and will also be using statistics to help us calculate the mean absolute error to evaluate the model So, first things first, we need to read our CSV file of our train data set as a
univariate or kind of single outcome series We’re going to use date/time as the row index So let’s go ahead and use this We’ll just call it hourly sentiments series Okay and we’re going to use
panda’s “to read CSV” function and we’re going to give it our
csv file in a current working directory our train data set which
is a subset of our full data We’re going to use the
date/time column to index our data and that is basically the first column in our data set So it’s starting at 0 and we want to pass these dates obviously and we’re also going to
use this squeeze option here just to ensure it returns time series Okay cool Now we’re going to print this out just have a quick look at it We’ll save this and we’re going to run it… …in our terminal Okay cool. So it’s indexed it properly from what I can see but another way we can look at this is if we just print the
index itself to make sure it’s a date time index
so let’s go ahead and do this So basically print the index Okay cool. So our data, all my rows are a
date/time index which is what we need The next step we’re gonna do is we’re
just going to quickly preview the data to get an idea of the values in sample
size and this is a common practice not only in time series but
whenever you want to model some data you want to get an idea of the
kinds of values that you’re working with the kind of headers or features that you have how many rows you have to work with so let’s go ahead and do this So commented these out as no longer need them and we’re just gonna print
the first few rows of data just get a general idea and since we’re printing the first few rows we might as well print the tail end of this and also have a look at
the dimensions of that data Okay, let’s have a look Okay so we have kind of a negative and positive values as I said this user sentiment score can be any number
between negative one to positive one we also have like nineteen rows of data so I wouldn’t say this is much data to work with it’s barely anything at all really but we’re going to do our best to model on the data that we have so, next step I guess is we want to plot the data to check if it’s stationary and by
stationary I mean if it follows a kind a constant mean and variance So the reason we got to do this is many time series model require the data to
be stationary in order for to model it and we basically want to check to
see if our data looks a little bit like this so, as you can see, it has that consistent
upward and downward movement and it kind of centers around here So we’re gonna check to see if
our data follows this kind of pattern So we use matplotlib We’ll simply cut our data Okay let’s have a look Okay so as you can see kinda doesn’t follow any kind of stationary pattern here at all it’s quite all over the place so what we need to do now is
difference the data to make it more stationary and then we’re going to
plot it again to check if the data looks more stationary after differencing So differencing
basically subtracts the next value by the current value and we’re probably not
gonna to difference this more than twice as it’s best not to over
difference the data this could you know, potentially
lead to inaccurate estimates and we’re also going to
make sure that we leave no missing values as this could cause
problems especially when modeling later So let’s go ahead and do this So we’ll do our first round of differences and see if this makes a difference or not apply a diff onto that and we’ll also need to
fill in those missing values so that it can’t cause trouble for us okay and we need to plot this as well obviously okay let’s have a look I might just show both plots just so you can actually see the difference from the difference in data So that’s our original data our original values and after differencing As you can see it’s kind of having a little bit of more of a consistent
upward and downward movement centering around here I wouldn’t say it’s fully
there yet I would say that we could probably take another round of
differences just to get it more into shape so let’s go ahead and do that let’s go apply a second round the differences We’re gonna take differences from our first prints and just in case we’ll fill in any missing values Okay let’s plot this So that’s our original data,
our first round of differences and our second round of differences Now it’s not exactly
ideal but I would say that from the second round the differences it’s
starting to look a lot more stationary than what we had before So later in the video series
we’ll further check if our data is stationary or not but for now let’s just go ahead so now that we’ve differenced the data to make it more stationary we are
ready to move on to modeling on the modeling part of this series in part two of our video series Thanks for watching if you found this video tutorial useful give us a like otherwise you can check out our other videos at data science dojo tutorials

6 Comments

Add a Comment

Your email address will not be published. Required fields are marked *