Creating long daily series • trendecon

Once the combination of useful keywords is selected we might want to use them as economic indicators in production and display them on trendecon.org. This vignette describes the steps for adding new indicators to the production processes of “Trendecon”.

Directory structure

The names of production functions within the trendecon package start with proc_. They usually operate on the file system which means that calling these functions will write files to the local disk. The proc_ functions will save files within two folders: data and raw, which are located under the working directory (getwd()). raw folder holds all the data downloaded from Google, plus some transformations, such as seasonally adjusted or combined series. data folder, on the other hand, collects the final indicators - the data displayed on the website. If not present these folders will be created when calling proc_keyword_init for the first time.

Within these folders, every used geo location will have its own subfolder. Thus, in the end, if we used Swiss and Austrian indicators, we would end up with the following directory structure:

│
├── data
│   ├── at
│   └── ch
│
└── raw
    ├── at
    └── ch

Creating a new series

In order to create a new series we need to go through several steps:

Select a set of Google Trends keywords comprising the new series
Download the data for each keyword and store it in the trendecon/data repository.
Create the series using the downloaded data files.
Include the series and its keywords within the daily updates.

1) Selecting keywords

As a rule an series is composed of multiple related keywords and a geographical location. For example purposes an series “homeoffice” will be created for “ch” (Switzerland) using “headset”, “monitor”, “maus”, and “hdmi” keywords.

2) Downloading data

To include a new series, each keyword must first be initiated. Our goal is to produce long daily time series, ideally ranging from 2006. However, Google does not provide daily or weekly data for such a long time period. Hence, we need to circumvent the problem by applying a moving window of daily and weekly queries over the whole time period. So be careful, as this causes a lot of queries to Google and might result in a temporary IP address ban.

The commands below will download the data for our example series:

proc_keyword_init("headset", "CH")
proc_keyword_init("monitor", "CH")
proc_keyword_init("maus", "CH")
proc_keyword_init("hdmi", "CH")

After running the code above we will have the aggregated series at daily, weekly and monthly frequency, stored in the raw folder. For example, the files of the “headset” keyword will be stored under: raw/ch/headset_d.csv, raw/ch/headset_w.csv, and raw/ch/headset_m.csv. Analogous files will be present for the remaining 3 keywords.

3) Creating a series

Four actions need to be performed in order to combine multiple keywords under a single series:

Update each keyword by downloading the latest data.
Combine daily, weekly, and monthly frequencies within each keyword.
Remove seasonal patterns from the keyword data.
Combine all the prepared keywords and create a series.

All the above steps are performed by a single function call:

proc_index(c("headset", "monitor", "maus", "hdmi"), "CH", "homeoffice")

This function will combine the keywords passed as the first argument and store the resulting series under the name specified by the third argument. In this example a new file holding the data for “homeoffice” series will be created under data/ch/homeoffice_sa_csv. In addition, for each series, two intermediate preparation files will also be created, i.e. raw/ch/headset_mwd.csv and raw/ch/headset_sa.csv. Here “mwd” stores the combined monthly/weekly/daily data and “sa” contains the values after seasonal adjustment.

P.S. The same function proc_index() is also used for updating the data of an existing data series.

4) Adding daily updates

Last step is to include the newly created series into a daily-update script. First - all the files prepared in steps 1-3 need to be added to trendecon/data repository. Then, in order to schedule daily updates, the script proc_trendecon_ch.R in the trendecon/trendecon repository has to be modified. Note that the end of the file (“ch”) specifies geographic location and so for other locations (like “de”) there will be a separate file (i.e. proc_trendecon_de.R)

In order to add the series used in the example to this script 3 changes are needed:

create a variable holding the set of keywords used for the series:

kw_homeoffice <- c(
  "headset",
  "monitor",
  "maus",
  "hdmi"
)

add a function call that creates/updates this series:
```
proc_index(kw_homeoffice, "CH", "homeoffice")
```

add the name of this series to production list:

indices_in_production <- c(
  <...>,
  "homeoffice"
)

After that simply add the updated version of this scrip to trendecon/trendecon repository and all is set.

The proc_trendecon_ functions produce the final indicators that we show on trendecon.org. They are called by an automated process which is set up on GitHub, so we do not need to call them manually when updating the data. The full list of active Swiss indicators can be found within the code of proc_trendecon_ch() function.

Series creation steps

To get a better understanding of how multiple keywords are combined into a single series we need to examine the inner working of proc_index() function. The steps of series preparation are outlined below. Note that the functions displayed in this section are internal and typically are not called by the users.

Step 1: update with latest data

In the first step, the raw data series for each keyword are updated with the latest daily, weekly, and monthly data. For example, to update the raw series for keyword "Rezession" for Switzerland, the script calls the following internal function:

proc_keyword_latest("Rezession", "CH")

This function downloads raw daily, weekly, and monthly data for the specified pair of keyword and geo location. If the data for a particular keyword is not yet available, proc_keyword_init() should be called instead.

Step 2: bending

To combine the three frequencies (monthly, weekly, and daily), we apply the following methodology: in a first step, we “bend” the daily series to the weekly values, by applying a variant of the Chow-Lin (1971) method. This preserves the movement of the daily series and ensures that weekly averages are identical to the original weekly series. We then use the same methodology to bend the series to the monthly values.

To combine the three frequencies for a given keyword, the script calls another internal function:

proc_combine_freq("Rezession", "CH")

Step 3: seasonal adjustment

Some keywords’ time series might display seasonal patterns. For example it is not surprising that searches for gardening are higher in spring. In order to make meaningful comparisons over time such seasonal patterns, present within the data, need to be removed. To achieve this trendecon uses the “Prophet” procedure for estimating an additive model where non-linear trends are fit with yearly and weekly seasonality and the holiday effects.

To seasonally adjust a combined keyword, the script calls the following internal function:

proc_seas_adj("Rezession", "CH")

Step 4: combining

Once the raw data for each keyword has been processed we end up with multiple time series - one for each keyword within the indicator. As an example the main indicator uses the following keywords: "Wirtschaftskrise", "Kurzarbeit ", "arbeitslos", and "Insolvenz". To turn several provided keywords into a single time series the first principal component is used.

In order to achieve this the script calls ts_prcomp() function from the tsbox package.

Step 5: writing the data

Finally the prepared index is saved to a file in the data folder:

write_keyword(prepared_data, "indicator", "CH")