Ruby on Rails Development Using Mongoid 5.0.0 - 2. How to Use Aggregation to Get Pageview Data

What is Aggregation ?

Aggregations are operations that process data records and return computed results.

In my opinion, it just like queries but it can do some operations step by step when processing the query just like a pipeline.

MongoDB Aggregation

if you want more details, please check official documents

Here are my models,

aggregation model example

Get Data using “match”

If I want to find pageview from 2016/1/1 to 2016/1/5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sort_stage = {
        '$sort' => { 'created_at' => 1 }
    }
    
match_stage = {
          '$match' => {
            "created_at" => {
              "$gte" => Time.parse('2016-01-01').beginning_of_day, 
              "$lte" => Time.parse('2016-01-05').end_of_day
            }
          }
        }

result = PageTracking.collection.aggregate([sort_stage,match_stage])
result.each do |page|
  puts page
end

Here is my result:

1
2
3
4
5
6
7
{"_id"=>BSON::ObjectId('568a1ef77db99bc968000002'), "pathname"=>"/", "daily_total_count"=>2, "created_at"=>2016-01-03 16:00:00 UTC, "user_id"=>BSON::ObjectId('568a1ef77db99bc968000000'), "updated_at"=>2016-01-04 07:27:51 UTC }
{"_id"=>BSON::ObjectId('568b28ad7db99b1ae8000002'), "pathname"=>"/", "daily_total_count"=>1, "created_at"=>2016-01-04 16:00:00 UTC, "user_id"=>BSON::ObjectId('568b28ad7db99b1ae8000000'), "updated_at"=>2016-01-05 02:21:33 UTC}
{"_id"=>BSON::ObjectId('568b28e67db99b1ae8000005'), "pathname"=>"/", "daily_total_count"=>3, "created_at"=>2016-01-04 16:00:00 UTC, "user_id"=>BSON::ObjectId('568b28e67db99b1ae8000003'), "updated_at"=>2016-01-05 02:22:30 UTC}
{"_id"=>BSON::ObjectId('568b7d307db99b9a30000011'), "pathname"=>"/activities", "daily_total_count"=>1, "created_at"=>2016-01-04 16:00:00 UTC, "user_id"=>BSON::ObjectId('568b28e67db99b1ae8000003'), "updated_at"=>2016-01-05 08:22:08 UTC}
{"_id"=>BSON::ObjectId('568b7d8e7db99b9a30000013'), "pathname"=>"/activities/hadalabo01", "daily_total_count"=>1, "created_at"=>2016-01-04 16:00:00 UTC, "user_id"=>BSON::ObjectId('568b28e67db99b1ae8000003'), "updated_at"=>2016-01-05 08:23:42 UTC}
{"_id"=>BSON::ObjectId('568b7e5e7db99b9a30000017'), "pathname"=>"/beautybuzz", "daily_total_count"=>1, "created_at"=>2016-01-04 16:00:00 UTC, "user_id"=>BSON::ObjectId('568b28e67db99b1ae8000003'), "updated_at"=>2016-01-05 08:27:10 UTC}
{"_id"=>BSON::ObjectId('568b78c07db99b9a30000001'), "pathname"=>"/beautynews", "daily_total_count"=>1, "created_at"=>2016-01-04 16:00:00 UTC, "user_id"=>BSON::ObjectId('568b28e67db99b1ae8000003'), "updated_at"=>2016-01-05 08:03:12 UTC}

ps. I add a sort stage in front of match stage due to the official suggestion.Aggregation Pipeline Optimization

Group Data using “group”

Oops, in fact, I want all records with pathname => ‘/’ should be summed up to one record just like group_by in MySQL

So, we can add a new stage

1
2
3
4
5
6
7
8
9
10
11
12
13
group_stage = {
          "$group" => {
            "_id" => {
              "pathname" => "$pathname"
            },
            "page_count" => { "$sum" => "$daily_total_count" }
          }
         }
result = PageTracking.collection.aggregate([sort_stage,match_stage,group_stage])

result.each do |page|
  puts page
end

which mean we can group by “pathname” and when we do grouping we can sum all daily_total_count into “page_count”

Here is my result

1
2
3
4
5
{"_id"=>{"pathname"=>"/beautybuzz"}, "page_count"=>1}
{"_id"=>{"pathname"=>"/activities/hadalabo01"}, "page_count"=>1}
{"_id"=>{"pathname"=>"/activities"}, "page_count"=>1}
{"_id"=>{"pathname"=>"/beautynews"}, "page_count"=>1}
{"_id"=>{"pathname"=>"/"}, "page_count"=>6}

Add more information while grouping Data using “push”

Ok, great! But, I got an another requirement, Hi, Aloha, Can you tell me that who see each pages?

Easy, let’s add some codes in group_stage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
group_stage = {
          "$group" => {
            "_id" => {
              "pathname" => "$pathname"
            },
            "page_count" => { "$sum" => "$daily_total_count" },
            "users" => {
              "$push" => {
                "user" => "$user_id"
              }
            }
          }
         }
result = PageTracking.collection.aggregate([sort_stage,match_stage,group_stage])

result.each do |page|
  puts page
end

Here is my result

1
2
3
4
5
{"_id"=>{"pathname"=>"/beautybuzz"}, "page_count"=>1, "users"=>[{"user"=>BSON::ObjectId('568b28e67db99b1ae8000003')}]}
{"_id"=>{"pathname"=>"/activities/hadalabo01"}, "page_count"=>1, "users"=>[{"user"=>BSON::ObjectId('568b28e67db99b1ae8000003')}]}
{"_id"=>{"pathname"=>"/activities"}, "page_count"=>1, "users"=>[{"user"=>BSON::ObjectId('568b28e67db99b1ae8000003')}]}
{"_id"=>{"pathname"=>"/beautynews"}, "page_count"=>1, "users"=>[{"user"=>BSON::ObjectId('568b28e67db99b1ae8000003')}]}
{"_id"=>{"pathname"=>"/"}, "page_count"=>6, "users"=>[{"user"=>BSON::ObjectId('568a1ef77db99bc968000000')}, {"user"=>BSON::ObjectId('568b28ad7db99b1ae8000000')}, {"user"=>BSON::ObjectId('568b28e67db99b1ae8000003')}]}

Cool, our aggregation run smoothly.

Aggregate Big Data

After 3 months, we found that some fetch pageview operations will fail.

Oops, since these fetch operations use big time range (ex: get pageview data from 2015/9/1 ~ 2016/1/1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
match_stage = {
          '$match' => {
            "created_at" => {
              "$gte" => Time.parse('2015-09-01').beginning_of_day, 
              "$lte" => Time.parse('2016-01-01').end_of_day
            }
          }
        }

result = PageTracking.collection.aggregate([sort_stage,match_stage,group_stage])

result.each do |page|
  puts page
end

I got this error message:

1
2
aggregate | FAILED | exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in. (16945) | 9.955263s
Mongo::Error::OperationFailure: exception: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in. (16945)

How can we solve this problem?

Just like message says, add allowDiskUse:true to it.

1
2
3
4
5
6
7
result = PageTracking.collection.aggregate(
            [sort_stage,match_stage,group_stage],
            :allow_disk_use => true
          )
result.each do |page|
  puts page
end

Yay, bug fixed.

1
2
3
...
you will got a lots of data ....
...

Limit data using “limit”

if you just want 1000 data pass to next stage

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
sort_stage = {
        '$sort' => { 'created_at' => -1 }
    }

limit_stage = {
      '$limit' => 1000
    }
    
result = UserPageTracking.collection.aggregate(
      [sort_stage, limit_stage , match_stage, group_stage],
      :allow_disk_use => true
    )
result.each do |page|
  puts page
end

Only want some specific data using “project”

If fact, we don’t need created_at, updated_at attributes in group stage, so we can …

1
2
3
4
5
6
7
8
9
10
11
project_stage = 
      { 
        "$project" => { user_id: 1, pathname: 1, daily_total_count: 1}
      }

result = PageTracking.collection.aggregate(
            [sort_stage,match_stage, project_stage, group_stage]
          )
result.each do |page|
  puts page
end
Comments

Comments

Google Analytics Alternative