Difference between revisions of "Metrics/License statistics"
(→Estimates over time) |
CC Wiki Bot (talk | contribs) |
||
(52 intermediate revisions by 9 users not shown) | |||
Line 1: | Line 1: | ||
+ | ''See [[metrics]] for a broader discussion of CC adoption and impact.'' | ||
+ | |||
=Caveats= | =Caveats= | ||
− | Estimating license adoption is a ''very'' inexact science. There is no authoritative source and we neither control nor have inside knowledge of the construction and volatility of the most comprehensive sources -- web search engines | + | Estimating license adoption is a ''very'' inexact science. There is no authoritative source and we neither control nor have inside knowledge of the construction and volatility of the most comprehensive sources -- web search engines -- primarily via Yahoo! link: queries (Google's link: operator obtains ''very'' incomplete results). |
+ | |||
+ | '''NOTE:''' Currently the best analysis of the data similar to that discussed below (based on a snapshot independently gathered in January, 2007 may be found in Giorgos Cheliotis' [http://hoikoinoi.wordpress.com/2007/07/02/cc-stats/ presentation on CC statistics] from June, 2007. | ||
+ | |||
+ | = Raw data = | ||
+ | |||
+ | Creative Commons has irregularly run programs that collect estimated total results from search engine <code>link:{license_uri}</code> queries and queries filtered by license property (Yahoo! and Google advanced [[CcSearch|search]] support filtering by license). | ||
+ | |||
+ | This data is more fully described at [[Metrics/Data Catalog]]. | ||
+ | |||
+ | == Linkback data == | ||
+ | |||
+ | {{Infobox|This data is in the [http://creativecommons.org/publicdomain/zero/1.0/ public domain]. To read more about [[data|open data]]. | ||
+ | }} | ||
+ | |||
+ | You can download raw MySQL dumps that are generated nightly from http://labs.creativecommons.org/metrics/sql-dumps/ -- this includes all data gathered programmatically by CC to date. | ||
+ | |||
+ | Single day data is available in CSV format from http://labs.creativecommons.org/metrics/csv-dumps/. Here is a guide to the columns in the file: | ||
+ | |||
+ | # internal ID number for this row (e.g., 5041) | ||
+ | # License URI (e.g., http://creativecommons.org/licenses/by-nc/1.0/jp/) | ||
+ | # search engine (e.g., All The Web) | ||
+ | # number of hits (e.g., 4680) | ||
+ | # date and time this linkback query run started (e.g., 2004-Apr-04 0:00:00) | ||
+ | # short form of license jurisdiction (e.g., jp) | ||
+ | # short form of license name (e.g., by-nc) | ||
+ | # license version (e.g., 1) | ||
+ | # long form of license jurisdiction (e.g., Japan) | ||
+ | |||
+ | '''WARNING:''' There are gaps in the data and results from any given method may be volatile to extremely volatile. Take the raw numbers with a huge grain of salt. | ||
+ | |||
+ | == Flickr data == | ||
+ | |||
+ | Information generated from Flickr is also available in the database dump above or [http://labs.creativecommons.org/metrics/stats/flickr/ here] as one-day CSVs, like this one for [http://labs.creativecommons.org/metrics/stats/flickr/2008-06-23.csv.imported June 23, 2008]. | ||
+ | |||
+ | Also see [http://creativecommons.org/weblog/entry/13588 Analysis of Flickr data] as of reaching 100m CC licensed images and [http://creativecommons.org/weblog/entry/20870 around 135m CC licensed images], including a [http://wiki.creativecommons.org/images/1/19/Cc-flickr-20100225.ods spreadsheet snapshot] for the latter. | ||
+ | |||
+ | == Software == | ||
+ | |||
+ | === Data gathering === | ||
+ | |||
+ | The code used to gather the above data is available from the <code>stats</code> module from our [[Source Repository Information|subversion repository]]. | ||
+ | |||
+ | If you want to run it yourself, here's what to do: | ||
+ | |||
+ | Check out the software: | ||
+ | |||
+ | * svn co http://code.creativecommons.org/svnroot/stats | ||
+ | * cd stats | ||
+ | |||
+ | Configure database access: | ||
+ | |||
+ | * rename dbconfig_EXAMPLE.py to dbconfig.py with a MySQL database and password you want to use (note that if you want to use a non-MySQL database, most of the tools we use are actually database-agnostic and require only tiny changes) | ||
+ | * mysql -h '''dbserver''' -u'''username''' -p '''databasename''' < create_tables.sql | ||
+ | |||
+ | Check for dependencies. Note that the script expects Tor to be running on localhost! | ||
+ | |||
+ | * python sanity_check.py | ||
+ | |||
+ | Do a stats crawl! | ||
+ | |||
+ | * python link_counts.py log | ||
+ | |||
+ | === Charting === | ||
+ | The <code>stats</code> module also contains some chart generating code. To run this code: | ||
+ | * Import all.sql.gz into a MySQL database | ||
+ | * Configure the database in dbconfig.py | ||
+ | * Run chart generation software from the stats/reports/ directory. | ||
+ | * Warning: the output is raw and the program takes a long time to run | ||
=Baseline numbers from specific collections= | =Baseline numbers from specific collections= | ||
− | We can also know the number of works licensed at various [[ | + | We can also know the number of works licensed at various [[content curators]]. The largest of these based on recent (December 2006) for various formats ''may'' (there could easily be a larger CC-licensed video collection than Revver) be: |
− | + | {| border="1" | |
− | + | |+ Licensed work counts at leading repositories | |
− | + | ! Repository !! 2005-08 !! 2005-11 || 2005-12 !! 2006-01 || 2006-04 || 2006-05 || 2006-07 || 2006-09 || 2006-12 || 2007-03 || 2007-06 || 2008-01 | |
+ | |- | ||
+ | ! [http://flickr.com/creativecommons Flickr] (photos) | ||
+ | | 4.1m || || 7.1m || || 10.8m || 12.7m || || 19.7m || 25.5m || 32.5m || 38.7m || 57.9m | ||
+ | |- | ||
+ | ! [http://www.soundclick.com/business/license_list.cfm Soundclick] (audio) | ||
+ | | 159k || || || 200k || 220k || || 249k || || 294k || 324k || 372k || 430k | ||
+ | |- | ||
+ | ! [http://revver.com Revver]<sup>*</sup> (video) | ||
+ | | na || 0 || || || 19k || || || || 119k || 214k || 296k || 417k | ||
+ | |} | ||
+ | |||
+ | Also see [http://wayback.archive.org/web/*/http://www.jamendo.com/en/?p=stats old Jamendo stats] and [http://magnatune.com/info/stats/ Magnatune stats]. | ||
+ | |||
+ | <sup>*</sup> Revver is an overestimate, probably total number of uploads to date, some of which may have been removed or never published. | ||
=License property charts= | =License property charts= | ||
Line 19: | Line 103: | ||
=Estimates over time= | =Estimates over time= | ||
+ | |||
+ | 2010-06 -- 400+ million as of December, 2010 | ||
+ | |||
+ | 2008-07-01 -- 130 million total works estimated using [http://code.creativecommons.org/viewsvn/stats/ankit/cc_total_estimate_with_comments.py?view=log Ankit's implementation] of Giorgos' scaling algorithm. | ||
+ | |||
+ | 2008-05-02 -- 67 million photos licensed at Flickr http://flickr.com/creativecommons | ||
+ | |||
+ | 2007-06-14 -- Multifaceted metrics presented at iSummit [http://creativecommons.org/weblog/entry/7551] | ||
+ | |||
+ | 2007-03-31 -- 33 million photos licensed at Flickr and growth over 1 year [http://creativecommons.org/weblog/entry/7307] | ||
+ | :Based on a swivel.com user's data collection from http://flickr.com/creativecommons | ||
2006-06-13 -- 140 million pages licensed [http://creativecommons.org/weblog/entry/5936] | 2006-06-13 -- 140 million pages licensed [http://creativecommons.org/weblog/entry/5936] | ||
Line 49: | Line 144: | ||
2004-09-17 -- Searching for Creative Commons on Yahoo![http://creativecommons.org/weblog/entry/4405] | 2004-09-17 -- Searching for Creative Commons on Yahoo![http://creativecommons.org/weblog/entry/4405] | ||
:4.7m pages link to CC licenses according to Yahoo! queries. | :4.7m pages link to CC licenses according to Yahoo! queries. | ||
+ | |||
+ | 2003-12 | ||
+ | :1m | ||
+ | |||
+ | == Issues == | ||
+ | === Fixed === | ||
+ | * Until 2008-07-01, the backlinks (e.g. [http://labs.creativecommons.org/metrics/csv-dumps/2004-04-01/00:00:00/linkbacks-daily-Yahoo.csv] and [http://labs.creativecommons.org/metrics/csv-dumps/2004-04-01/00:00:00/linkbacks-daily-Yahoo.csv] for 2004-04-01, Yahoo and Google respectively) between 2004-04-01 and 2005-06-20 were incorrectly labeled. | ||
+ | ** The problem was bad importing between data formats in 2005. | ||
+ | ** The issue was fully corrected by 2008-07-01. | ||
+ | === Confirmed === | ||
+ | * Google API queries are not working properly right now (as of 2008-06-25). | ||
[[Category:FAQ]] | [[Category:FAQ]] | ||
+ | [[Category:Metrics]] |
Latest revision as of 15:46, 2 March 2014
See metrics for a broader discussion of CC adoption and impact.
Contents
Caveats
Estimating license adoption is a very inexact science. There is no authoritative source and we neither control nor have inside knowledge of the construction and volatility of the most comprehensive sources -- web search engines -- primarily via Yahoo! link: queries (Google's link: operator obtains very incomplete results).
NOTE: Currently the best analysis of the data similar to that discussed below (based on a snapshot independently gathered in January, 2007 may be found in Giorgos Cheliotis' presentation on CC statistics from June, 2007.
Raw data
Creative Commons has irregularly run programs that collect estimated total results from search engine link:{license_uri}
queries and queries filtered by license property (Yahoo! and Google advanced search support filtering by license).
This data is more fully described at Metrics/Data Catalog.
Linkback data
This data is in the public domain. To read more about open data.
You can download raw MySQL dumps that are generated nightly from http://labs.creativecommons.org/metrics/sql-dumps/ -- this includes all data gathered programmatically by CC to date.
Single day data is available in CSV format from http://labs.creativecommons.org/metrics/csv-dumps/. Here is a guide to the columns in the file:
- internal ID number for this row (e.g., 5041)
- License URI (e.g., http://creativecommons.org/licenses/by-nc/1.0/jp/)
- search engine (e.g., All The Web)
- number of hits (e.g., 4680)
- date and time this linkback query run started (e.g., 2004-Apr-04 0:00:00)
- short form of license jurisdiction (e.g., jp)
- short form of license name (e.g., by-nc)
- license version (e.g., 1)
- long form of license jurisdiction (e.g., Japan)
WARNING: There are gaps in the data and results from any given method may be volatile to extremely volatile. Take the raw numbers with a huge grain of salt.
Flickr data
Information generated from Flickr is also available in the database dump above or here as one-day CSVs, like this one for June 23, 2008.
Also see Analysis of Flickr data as of reaching 100m CC licensed images and around 135m CC licensed images, including a spreadsheet snapshot for the latter.
Software
Data gathering
The code used to gather the above data is available from the stats
module from our subversion repository.
If you want to run it yourself, here's what to do:
Check out the software:
- svn co http://code.creativecommons.org/svnroot/stats
- cd stats
Configure database access:
- rename dbconfig_EXAMPLE.py to dbconfig.py with a MySQL database and password you want to use (note that if you want to use a non-MySQL database, most of the tools we use are actually database-agnostic and require only tiny changes)
- mysql -h dbserver -uusername -p databasename < create_tables.sql
Check for dependencies. Note that the script expects Tor to be running on localhost!
- python sanity_check.py
Do a stats crawl!
- python link_counts.py log
Charting
The stats
module also contains some chart generating code. To run this code:
- Import all.sql.gz into a MySQL database
- Configure the database in dbconfig.py
- Run chart generation software from the stats/reports/ directory.
- Warning: the output is raw and the program takes a long time to run
Baseline numbers from specific collections
We can also know the number of works licensed at various content curators. The largest of these based on recent (December 2006) for various formats may (there could easily be a larger CC-licensed video collection than Revver) be:
Repository | 2005-08 | 2005-11 | 2005-12 | 2006-01 | 2006-04 | 2006-05 | 2006-07 | 2006-09 | 2006-12 | 2007-03 | 2007-06 | 2008-01 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Flickr (photos) | 4.1m | 7.1m | 10.8m | 12.7m | 19.7m | 25.5m | 32.5m | 38.7m | 57.9m | |||
Soundclick (audio) | 159k | 200k | 220k | 249k | 294k | 324k | 372k | 430k | ||||
Revver* (video) | na | 0 | 19k | 119k | 214k | 296k | 417k |
Also see old Jamendo stats and Magnatune stats.
* Revver is an overestimate, probably total number of uploads to date, some of which may have been removed or never published.
License property charts
These charts show a breakdown of the types of licenses deployed and the properties of deployed licenses, based on Yahoo! queries as of 2006-06-13. (As above the Google API is now superior for an aggregate count, but Yahoo link: searches are superior for measuring the relative deployment of specific licenses and thus specific license types.)
Estimates over time
2010-06 -- 400+ million as of December, 2010
2008-07-01 -- 130 million total works estimated using Ankit's implementation of Giorgos' scaling algorithm.
2008-05-02 -- 67 million photos licensed at Flickr http://flickr.com/creativecommons
2007-06-14 -- Multifaceted metrics presented at iSummit [1]
2007-03-31 -- 33 million photos licensed at Flickr and growth over 1 year [2]
- Based on a swivel.com user's data collection from http://flickr.com/creativecommons
2006-06-13 -- 140 million pages licensed [3]
- Based on Google queries.
2005-12 -- 45 million pages licensed [4]
- Based on Google queries.
2005-08-09 -- 53 million pages licensed [5]
- Again based on Yahoo! queries, this number turned out to be overstated as Yahoo! tuned their results estimation after growing their index.
2005-06-13 -- CC search query breakdown[6]
- Breakdown of search requests and desired license properties -- people searching for video want the least freedom.
2005-05-27 -- CC in Yahoo! Advanced Search[7]
- Yahoo! queries say 16m pages linking to a CC license.
2005-03-23 -- Yahoo! Search for Creative Commons[8]
- Close to 14m pages link to a CC license according to Yahoo! queries.
2005-03-07 -- CC search index breakdown[9]
- Breakdown of (small) CC-nutch index -- audio publishers are most permissive, video publishers least.
2005-02-25 -- License Distribution [10]
- Based on Yahoo! queries there are now 10m licensed documents. Pie chart of what those licenses are.
2005-02-18 -- How many pages link to a CC license? [11]
- Based on Yahoo! queries, "well over 5m." At the end of 2003 it was 1m.
2004-09-17 -- Searching for Creative Commons on Yahoo![12]
- 4.7m pages link to CC licenses according to Yahoo! queries.
2003-12
- 1m
Issues
Fixed
- Until 2008-07-01, the backlinks (e.g. [13] and [14] for 2004-04-01, Yahoo and Google respectively) between 2004-04-01 and 2005-06-20 were incorrectly labeled.
- The problem was bad importing between data formats in 2005.
- The issue was fully corrected by 2008-07-01.
Confirmed
- Google API queries are not working properly right now (as of 2008-06-25).