It has been a quiet year of blogging since my 5th anniversary; there has only been 13 posts since. Though as I have mentioned before, I am using GitHub to share tutorials and some of my work. However, I will try to write at least twice a month, especially now that I have decided to learn more about tidyr, dplyr, and ggplot2.
For this year's anniversary post, I'll share some of my WordPress statistics, as I recently found out that there's an API to retrieve my site's stats. They are consistent with my Google Analytics stats, which don't take into account traffic generated by robots. This post is one day early from my actual anniversary date (October 1st) because I posted one day late last year. I've made the web traffic data available here if you want to recreate the plots; despite having started this blog six years ago, I only started keeping track of the stats (using Jetpack) on 2013-01-22. Let's get started!
The WordPress stats only provide the date and the total views on that day; let's add some more information based on the dates.
library(ggplot2) library(dplyr) d <- read.csv('stats.csv.gz') d$date <- as.Date(d$date) d$day <- factor(weekdays(d$date), levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')) d$weekend <- grepl(pattern = "^S", x = d$day) d$month <- factor(months(d$date), levels = month.name) d$quarter <- factor(quarters(d$date)) d$year <- format(d$date, "%Y") d <- tbl_df(d) d # A tibble: 1,341 x 7 date views day weekend month quarter year <date> <int> <fctr> <lgl> <fctr> <fctr> <fctr> 1 2013-01-22 130 Tuesday FALSE January Q1 2013 2 2013-01-23 269 Wednesday FALSE January Q1 2013 3 2013-01-24 258 Thursday FALSE January Q1 2013 4 2013-01-25 146 Friday FALSE January Q1 2013 5 2013-01-26 52 Saturday TRUE January Q1 2013 6 2013-01-27 53 Sunday TRUE January Q1 2013 7 2013-01-28 170 Monday FALSE January Q1 2013 8 2013-01-29 179 Tuesday FALSE January Q1 2013 9 2013-01-30 223 Wednesday FALSE January Q1 2013 10 2013-01-31 205 Thursday FALSE January Q1 2013 # ... with 1,331 more rows # or if you prefer viewing the data sideways glimpse(d) Observations: 1,341 Variables: 7 $ date <date> 2013-01-22, 2013-01-23, 2013-01-24, 2013-01-25, 2013-01-26, 2013-01-27, 2013-01-28, 2013-01-29, 2013-0... $ views <int> 130, 269, 258, 146, 52, 53, 170, 179, 223, 205, 149, 44, 48, 162, 156, 195, 168, 90, 43, 49, 156, 185, ... $ day <fctr> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, ... $ weekend <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FA... $ month <fctr> January, January, January, January, January, January, January, January, January, January, February, Fe... $ quarter <fctr> Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1, Q1... $ year <fctr> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, ...
Numbers are boring; let's see some plots!
ggplot(d, aes(day, views)) + geom_boxplot()
The traffic pattern of a work-related blog. Notice the decline in traffic as we approach the weekend.
ggplot(d, aes(month, views)) + geom_boxplot()
There seems to be a sinusoidal pattern.
# number of total views from # 2013-01-22 until 2016-09-23 summarise(d, sum(views)) # A tibble: 1 x 1 sum(views) <int> 1 795388 ggplot(d, aes(year, views)) + geom_boxplot() # views broken down in years group_by(d, year) %>% summarise(views = sum(views)) # A tibble: 4 x 2 year views <fctr> <int> 1 2013 87415 2 2014 253844 3 2015 283973 4 2016 170156
This blog started to get more popular in 2014 and was on the rise. The lack of posts this year may be the reason for the downfall.
I'll focus specifically on the year 2015 from now on.
ggplot(filter(d, year == 2015), aes(day, views)) + geom_boxplot() # the actual numbers filter(d, year == 2015) %>% group_by(day) %>% summarise(views=sum(views)) # A tibble: 7 x 2 day views <fctr> <int> 1 Monday 47703 2 Tuesday 50792 3 Wednesday 49249 4 Thursday 48841 5 Friday 43707 6 Saturday 21055 7 Sunday 22626
There's the same pattern of decline as we approach the weekend, which was seen above when we plotted the days without factoring for a specific year. However, now that we have focused on a specific year, we can see some clear outliers.
Why was there less traffic on certain weekdays? Let's check out the dates for the outliers.
filter(d, year == 2015, day == 'Monday', views < 750) # A tibble: 5 x 7 date views day weekend month quarter year <date> <int> <fctr> <lgl> <fctr> <fctr> <fctr> 1 2015-01-05 663 Monday FALSE January Q1 2015 2 2015-05-25 656 Monday FALSE May Q2 2015 3 2015-09-07 676 Monday FALSE September Q3 2015 4 2015-12-21 516 Monday FALSE December Q4 2015 5 2015-12-28 449 Monday FALSE December Q4 2015
Three out five Mondays with unusually low traffic are near Christmas and New Years; this makes sense since most people are on holidays during that time (unless you work in Japan, where Christmas is not a holiday). What happened on 2015-05-25 and 2015-09-07? I looked those dates up and they were both federal holidays in the US, namely Memorial Day and Labor Day, respectively. Federal holidays in the US will impact my web traffic since ~37.7% (107,103/283,973) of my 2015 blog traffic came from visitors in the US. (Second place is the UK at ~7.31%, followed by Germany at ~5.76%. Australia is ranked 7th at ~2.98%.) How about the outliers on the other days?
filter(d, year == 2015, day == 'Tuesday', views < 750) # A tibble: 1 x 7 date views day weekend month quarter year <date> <int> <fctr> <lgl> <fctr> <fctr> <fctr> 1 2015-12-29 500 Tuesday FALSE December Q4 2015 filter(d, year == 2015, day == 'Wednesday', views < 750) # A tibble: 2 x 7 date views day weekend month quarter year <date> <int> <fctr> <lgl> <fctr> <fctr> <fctr> 1 2015-12-23 566 Wednesday FALSE December Q4 2015 2 2015-12-30 443 Wednesday FALSE December Q4 2015 filter(d, year == 2015, day == 'Thursday', views < 750) # A tibble: 3 x 7 date views day weekend month quarter year <date> <int> <fctr> <lgl> <fctr> <fctr> <fctr> 1 2015-01-01 230 Thursday FALSE January Q1 2015 2 2015-12-24 395 Thursday FALSE December Q4 2015 3 2015-12-31 302 Thursday FALSE December Q4 2015 filter(d, year == 2015, day == 'Friday', views < 500) # A tibble: 2 x 7 date views day weekend month quarter year <date> <int> <fctr> <lgl> <fctr> <fctr> <fctr> 1 2015-01-02 478 Friday FALSE January Q1 2015 2 2015-12-25 238 Friday FALSE December Q4 2015
The other weekday outliers were dates near Christmas and New Years. There are other federal holidays in the US but I guess they didn't satisfy the criteria of being an outlier. (I did look up Martin Luther King Jr. Day in 2015, which was on 2015-01-19 [a Monday] and while the traffic [850] was less than the median, it wasn't an outlier.) Now let's take a look at the traffic factored by months for 2015.
ggplot(filter(d, year == 2015), aes(month, views)) + geom_boxplot() # the numbers filter(d, year == 2015) %>% group_by(month) %>% summarise(views=sum(views)) # A tibble: 12 x 2 month views <fctr> <int> 1 January 22110 2 February 24765 3 March 27268 4 April 26449 5 May 23895 6 June 23677 7 July 23958 8 August 21271 9 September 22924 10 October 24604 11 November 23678 12 December 19374
December (19374) has the least traffic as expected, followed by August (21271). I'm not sure why; perhaps it's because of the summer holiday in the US? August also had lower traffic in 2014 but was the fourth least.
Let's plot the traffic for all days in 2015, factored by weekend status.
ggplot(filter(d, year == 2015), aes(date, views, colour=weekend)) + geom_point() + xlab("") + ylab("Daily Views") + geom_smooth()
The sine wave pattern we saw earlier is present and much more cleanly on the weekends.
Let's use plotly with ggplot2 so that we create an interactive version of the plot and hover over the points to find out the exact dates. I've made the interactive plot available on RPubs.
library(plotly) p <- ggplot(filter(d, year == 2015), aes(date, views, colour=weekend)) + geom_point() + xlab("") + ylab("Daily Views") + geom_smooth() ggplotly(p)
Check out http://rpubs.com/davetang/web_traffic_2015 for the interactive version of this plot. I found other holidays including 2015-04-03 (Good Friday), 2015-07-03 (US Independence day holiday because 2015-07-04 was a Saturday), and 2015-11-27 (US Thanksgiving?).
Finally, plotting all the days since 2013-01-22 until 2016-09-23.
ggplot(d, aes(date, views, colour=weekend)) + geom_point() + xlab("") + ylab("Daily Views") + geom_smooth()
I hope that's not a sign of things to come.
Summary
Well I hope you enjoyed the post; hopefully, I'll write more posts...
P.S. I recently visited The Pinnacles and used my anniversary as an excuse to change the header image of the site.
This work is licensed under a Creative Commons
Attribution 4.0 International License.