Wikipedia pageviews analysis

After few more hours spent on this, I've finally found solution. I'm posting it in case someone has the same problem in the future. Wikipedia explains what can be found in database. You can find these explanations here.

Based on that you can see that rows have following structure:

domain_code page_title count_views total_response_size

Some explanations for each column:

Column 1:

Domain name of the request, abbreviated. (...) Domain_code now can also be an abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org".

Column 2:

For page-level files, it holds the title of the unnormalized part after /wiki/ -in the request Url (E.g.: Main_Page Berlin). For project-level files, it is - .

Column 3:

The number of times this page has been viewed in the respective hour.

Column 4:

The total response size caused by the requests for this page in the respective hour.

If I understand it correctly response size is discontinued due to low accuracy. That's why there are only 0s.

The pagecounts and projectcounts files also include total response byte sizes at their respective aggregation level, but this was dropped from the pageviews and projectviews files because it wasn't very accurate.

All details can be found here. Another useful link. Hope someone finds it useful.

/r/datascience Thread