6th anniversary

It has been a quiet year of blogging since my 5th anniversary; there has only been 13 posts since. Though as I have mentioned before, I am using GitHub to share tutorials and some of my work. However, I will try to write at least twice a month, especially now that I have decided to learn more about tidyr, dplyr, and ggplot2.

For this year's anniversary post, I'll share some of my WordPress statistics, as I recently found out that there's an API to retrieve my site's stats. They are consistent with my Google Analytics stats, which don't take into account traffic generated by robots. This post is one day early from my actual anniversary date (October 1st) because I posted one day late last year. I've made the web traffic data available here if you want to recreate the plots; despite having started this blog six years ago, I only started keeping track of the stats (using Jetpack) on 2013-01-22. Let's get started!

The WordPress stats only provide the date and the total views on that day; let's add some more information based on the dates.

library(ggplot2)
library(dplyr)

d <- read.csv('stats.csv.gz')
d$date    <- as.Date(d$date)
d$day     <- factor(weekdays(d$date), levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
d$weekend <- grepl(pattern = "^S", x = d$day)
d$month   <- factor(months(d$date), levels = month.name)
d$quarter <- factor(quarters(d$date))
d$year    <- format(d$date, "%Y")

d <- tbl_df(d)
d
# A tibble: 1,341 x 7
         date views       day weekend   month quarter   year
       <date> <int>    <fctr>   <lgl>  <fctr>  <fctr> <fctr>
1  2013-01-22   130   Tuesday   FALSE January      Q1   2013
2  2013-01-23   269 Wednesday   FALSE January      Q1   2013
3  2013-01-24   258  Thursday   FALSE January      Q1   2013
4  2013-01-25   146    Friday   FALSE January      Q1   2013
5  2013-01-26    52  Saturday    TRUE January      Q1   2013
6  2013-01-27    53    Sunday    TRUE January      Q1   2013
7  2013-01-28   170    Monday   FALSE January      Q1   2013
8  2013-01-29   179   Tuesday   FALSE January      Q1   2013
9  2013-01-30   223 Wednesday   FALSE January      Q1   2013
10 2013-01-31   205  Thursday   FALSE January      Q1   2013
# ... with 1,331 more rows

# or if you prefer viewing the data sideways
glimpse(d)
Observations: 1,341
Variables: 7
$ date    <date> 2013-01-22, 2013-01-23, 2013-01-24, 2013-01-25, 2013-01-26, 2013-01-27, 2013-01-28, 2013-01-29, 2013-0...
$ views   <int> 130, 269, 258, 146, 52, 53, 170, 179, 223, 205, 149, 44, 48, 162, 156, 195, 168, 90, 43, 49, 156, 185, ...
$ day     <fctr> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, ...
$ weekend <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FA...
$ month   <fctr> January, January, January, January, January, January, January, January, January, January, February, Fe...
$ quarter <fctr> Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1...
$ year    <fctr> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...

Numbers are boring; let's see some plots!

ggplot(d, aes(day, views)) + geom_boxplot()

boxplot_dayThe traffic pattern of a work-related blog. Notice the decline in traffic as we approach the weekend.

ggplot(d, aes(month, views)) + geom_boxplot()

boxplot_monthThere seems to be a sinusoidal pattern.

# number of total views from
# 2013-01-22 until 2016-09-23
summarise(d, sum(views))
# A tibble: 1 x 1
  sum(views)
       <int>
1     795388

ggplot(d, aes(year, views)) + geom_boxplot()

# views broken down in years
group_by(d, year) %>% summarise(views = sum(views))
# A tibble: 4 x 2
    year  views
  <fctr>  <int>
1   2013  87415
2   2014 253844
3   2015 283973
4   2016 170156

boxplot_yearThis blog started to get more popular in 2014 and was on the rise. The lack of posts this year may be the reason for the downfall.

I'll focus specifically on the year 2015 from now on.

ggplot(filter(d, year == 2015), aes(day, views)) + geom_boxplot()

# the actual numbers
filter(d, year == 2015) %>% group_by(day) %>% summarise(views=sum(views))
# A tibble: 7 x 2
        day views
     <fctr> <int>
1    Monday 47703
2   Tuesday 50792
3 Wednesday 49249
4  Thursday 48841
5    Friday 43707
6  Saturday 21055
7    Sunday 22626

boxplot_day_2015There's the same pattern of decline as we approach the weekend, which was seen above when we plotted the days without factoring for a specific year. However, now that we have focused on a specific year, we can see some clear outliers.

Why was there less traffic on certain weekdays? Let's check out the dates for the outliers.

filter(d, year == 2015, day == 'Monday', views < 750)
# A tibble: 5 x 7
        date views    day weekend     month quarter   year
      <date> <int> <fctr>   <lgl>    <fctr>  <fctr> <fctr>
1 2015-01-05   663 Monday   FALSE   January      Q1   2015
2 2015-05-25   656 Monday   FALSE       May      Q2   2015
3 2015-09-07   676 Monday   FALSE September      Q3   2015
4 2015-12-21   516 Monday   FALSE  December      Q4   2015
5 2015-12-28   449 Monday   FALSE  December      Q4   2015

Three out five Mondays with unusually low traffic are near Christmas and New Years; this makes sense since most people are on holidays during that time (unless you work in Japan, where Christmas is not a holiday). What happened on 2015-05-25 and 2015-09-07? I looked those dates up and they were both federal holidays in the US, namely Memorial Day and Labor Day, respectively. Federal holidays in the US will impact my web traffic since ~37.7% (107,103/283,973) of my 2015 blog traffic came from visitors in the US. (Second place is the UK at ~7.31%, followed by Germany at ~5.76%. Australia is ranked 7th at ~2.98%.) How about the outliers on the other days?

filter(d, year == 2015, day == 'Tuesday', views < 750)
# A tibble: 1 x 7
        date views     day weekend    month quarter   year
      <date> <int>  <fctr>   <lgl>   <fctr>  <fctr> <fctr>
1 2015-12-29   500 Tuesday   FALSE December      Q4   2015

filter(d, year == 2015, day == 'Wednesday', views < 750)
# A tibble: 2 x 7
        date views       day weekend    month quarter   year
      <date> <int>    <fctr>   <lgl>   <fctr>  <fctr> <fctr>
1 2015-12-23   566 Wednesday   FALSE December      Q4   2015
2 2015-12-30   443 Wednesday   FALSE December      Q4   2015

filter(d, year == 2015, day == 'Thursday', views < 750)
# A tibble: 3 x 7
        date views      day weekend    month quarter   year
      <date> <int>   <fctr>   <lgl>   <fctr>  <fctr> <fctr>
1 2015-01-01   230 Thursday   FALSE  January      Q1   2015
2 2015-12-24   395 Thursday   FALSE December      Q4   2015
3 2015-12-31   302 Thursday   FALSE December      Q4   2015

filter(d, year == 2015, day == 'Friday', views < 500)
# A tibble: 2 x 7
        date views    day weekend    month quarter   year
      <date> <int> <fctr>   <lgl>   <fctr>  <fctr> <fctr>
1 2015-01-02   478 Friday   FALSE  January      Q1   2015
2 2015-12-25   238 Friday   FALSE December      Q4   2015

The other weekday outliers were dates near Christmas and New Years. There are other federal holidays in the US but I guess they didn't satisfy the criteria of being an outlier. (I did look up Martin Luther King Jr. Day in 2015, which was on 2015-01-19 [a Monday] and while the traffic [850] was less than the median, it wasn't an outlier.) Now let's take a look at the traffic factored by months for 2015.

ggplot(filter(d, year == 2015), aes(month, views)) + geom_boxplot()

# the numbers
filter(d, year == 2015) %>% group_by(month) %>% summarise(views=sum(views))
# A tibble: 12 x 2
       month views
      <fctr> <int>
1    January 22110
2   February 24765
3      March 27268
4      April 26449
5        May 23895
6       June 23677
7       July 23958
8     August 21271
9  September 22924
10   October 24604
11  November 23678
12  December 19374

boxplot_month_2015December (19374) has the least traffic as expected, followed by August (21271). I'm not sure why; perhaps it's because of the summer holiday in the US? August also had lower traffic in 2014 but was the fourth least.

Let's plot the traffic for all days in 2015, factored by weekend status.

ggplot(filter(d, year == 2015), aes(date, views, colour=weekend)) + geom_point() +
  xlab("") + ylab("Daily Views") + geom_smooth()

scatterplot_2015The sine wave pattern we saw earlier is present and much more cleanly on the weekends.

Let's use plotly with ggplot2 so that we create an interactive version of the plot and hover over the points to find out the exact dates. I've made the interactive plot available on RPubs.

library(plotly)
p <- ggplot(filter(d, year == 2015), aes(date, views, colour=weekend)) + geom_point() +
  xlab("") + ylab("Daily Views") + geom_smooth()

ggplotly(p)

scatterplot_plotly_2015Check out http://rpubs.com/davetang/web_traffic_2015 for the interactive version of this plot. I found other holidays including 2015-04-03 (Good Friday), 2015-07-03 (US Independence day holiday because 2015-07-04 was a Saturday), and 2015-11-27 (US Thanksgiving?).

Finally, plotting all the days since 2013-01-22 until 2016-09-23.

ggplot(d, aes(date, views, colour=weekend)) + geom_point() +
  xlab("") + ylab("Daily Views") + geom_smooth()

scatterplot_allI hope that's not a sign of things to come.

Summary

Well I hope you enjoyed the post; hopefully, I'll write more posts...

P.S. I recently visited The Pinnacles and used my anniversary as an excuse to change the header image of the site.

Print Friendly, PDF & Email



Creative Commons License
This work is licensed under a Creative Commons
Attribution 4.0 International License
.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.