Summer of Code 2013

Creative Commons is participates in Google's Summer of Code as a mentoring organization. Student submissions for SoC 2013 will take place April-May; see the GSoC 2013 homepage for more details.

For application process and tips please see the Summer of Code Application page. If you want more information, you can send email to Dan Mills, or visit our irc channel at Freenode/#cc.

Ok? Now you (hopefully) have a general sense of what the overall project is about. You may have also checked out the code and noticed it uses node.js--that means that to hack on it you'll need to be relatively fluent in JavaScript, but you also have to keep in mind some of the unique features of node.js: in particular, it's heavy use of non-blocking IO via asynchronous callbacks. Here is a good starter book if you know JS but haven't used node.js before.

Here are some ideas to get you started. Keep in mind these are only starter ideas, which you can feel free to mix and match or propose something new if you think it's worth exploring.

(By the way, if you visited this page in the last couple of days, you may notice that the ideas have changed somewhat. The same ideas are still here, we just combined a few together, because they seemed too easy on their own!)

Project: Homepage Themes

Brief Explanation: OpenHome has a basic view which currently displays images only. However, OpenHome is meant to be a mash-up of a variety of media which includes images, documents, and more. We need smart ways to categorize, organize, and display this content. What's more, we may start by focusing on one kind of user to start with, but we will quickly need to think about other user types. This project is about creating OpenHome templates and features for two kinds of users:

* A university professor who wants a home page for their written papers.

* An amateur artist looking to show off their artwork and writing.

Based on these users, we will perform user research and come up with specific needs, which we will then build our app to support. You don't need to do the user research itself--but you'll be involved in making rough prototypes we will user-test.

Expected Results: As mentioned before, you'll be responsible for early prototypes we will user test (for example, via usertesting.com), and based on that feedback we'll adjust the themes/views.

We expect to need different styles for each user, as well as at least a few different views for the content: a "timeline" view, a "content type" view, and a "metrics" view that provides graphs/counters for how many times the content has been viewed/favorited/etc.

Knowledge Prerequisite: JavaScript, HTML/CSS, node.js

Skill Level: Medium

Mentor: Dan Mills or other CC tech staff member

Project: CC Web Content API

Brief explanation: All CC licenses require attribution when using the licensed content. We've found that for most people, though, it's too cumbersome to do it correctly--even even when they have the best intentions. Not only that, but in virtually every case, the content author would never find out that their content was used, even if it was attributed correctly.

So, we're interested in ways of making attribution completely automatic for websites, while also providing a way for content authors to find out where their content is being used, and even a way for those using the content to send a message or give "kudos" ("thanks") to the author.

This is where the CC Web Content API comes in: it's a JavaScript library that scans the content on a page, and communicates with the OpenHome service to find out if any of it is CC licensed. If it is, it can automatically add a widget to the content providing the correct attribution data (author, etc) as well as a couple of buttons for e.g. sending a message to the author, or marking it as a favorite.

Expected Results: For this to work, there are four things that need to be completed:

1. A way for OpenHome to fingerprint or add metadata into content when the content is first added to OpenHome. This is not the core focus of this project, but you need at least some basic metadata or fingerprinting to get to the next step.

2. An API for OpenHome to receive a URL and return the metadata for the file, if it is known.

3. A JS library that implements the CC Web Content API. The library should be easily configurable to allow scanning only certain kinds of media (e.g., only images with a certain class property), and it should also check if there is already embedded metadata on the page.

4. An HTML/CSS widget that the JS library will add to media files that were determined to be CC licensed. It should work on at least one kind of media file (for example, images).

Optional: If you want to go further, you can implement two features:

1. A "message author" button in the widget, which would (after a login or CAPTCHA) send a message to the author through the OpenHome service. The author can get an email, or the message can appear in a private area of their OpenHome page, whichever you prefer.

2. A "kudos" button. This similar (in implementation) to the message feature, but instead of a message it is simply a very easy way to say "thanks" to the author--think "favorite" or "like" button. Kudos can appear in a public section of the author's OpenHome page.

Knowledge Prerequisite: JavaScript, HTML/CSS, node.js

Skill Level: Medium

Mentor: Dan Mills or other CC tech staff member

Project: OpenAttribute Integration

Brief Explanation: Creative Commons co-developed a set of browser add-ons called OpenAttribute. They look for attribution metadata on a page and allow users to extract it for use elsewhere.

This project would enable OpenAttribute to have some of the features of the JS Web Content API project (see above), even on web pages that don't explicitly use the JS Web Content API library, and because of that, there's one additional twist: we can enlist users to report back to the content author (if the author wishes it) when their content on is not being attributed correctly.

Expected Results: Updated OpenAttribute add-ons for at least Firefox and Chrome with the ability to detect non-marked up CC content on a webpage (by using the OpenHome API) and report it as such.

Knowledge Prerequisite: Firefox & Chrome add-on programming

Skill Level: Medium-low

Mentor: Dan Mills or other CC tech staff member

Project: Media Fingerprinting Library

Brief Explanation: CC would prefer that all content on the Web include correct licensing metadata. Alas, that is not the case. So we're interested in code that will allow us to identify a given item across the Web, even if there's no metadata alongside (or within) it. The tricky part is: people often crop or resize images, clip videos, re-encode content, or quote only pieces of text. So a simple hash is not sufficient: we need more intelligent fuzzy matching. That's what this project is about.

Expected Results: A library that provides two methods: 1) Given a media file, output a fingerprint, and 2) Given a file and a fingerprint, return the likelihood of the file matching the original file. You can focus your efforts on only one or two media types, or you can do more if it's possible.

The library can be in a low-level language (C/C++) or you can use a higher-level language (JavaScript) if it's feasible. Speed is not a major concern at this point.

Bonus: An additional API/method to detect content inside other files (e.g., a PowerPoint file that includes a CC licensed image, or a still image inside a video).

Notes / Resources: The first task is to decide on a strategy to compare two items and decide how similar they are. Some choices are:

* Hamming distance (bitwise AKA Manhattan distance)

* Euclidean distance (plane distance, also good in higher dimensions)

* Set similarity (Jaccard index; MinHash)

For this project, set similarity seems like the best choice. It would potentially allow us to detect works remixed into other works, if some portion of them has remained intact in some way. The technique involves distilling a document into a set of things, and comparing two documents is simply the ratio of things they have in common to things they do not.

A good way to start is with text, and involves a technique called shingling. For something like images, we'll need more work to determine which "interesting" features of the image to consider (to generate the set of things). This is called "keypoint extraction" and involves using standard algorithms to find vectors of floats that describe each keypoint. Since for images two keypoint vectors might be very similar but not identical, some additional work in clustering and mapping to example keypoints is required for images.

Some reading:

* Chapters 1 and 3 of Mining Massive Datasets

* building shingles in text

* Introduction to Information Retrieval

* OpenCV for extracting things (features) of images

* BRISK / FREAK: algorithms for "keypoint extraction", for images

* pHash.org might be something we can use.

Knowledge Prerequisite: Media formats/encodings, JavaScript, C/C++.

Skill Level: High

Mentor: Dan Mills or other CC tech staff member

Project: Metadata Embedding Library

Brief Explanation: In order to make it easier to track CC licensed content, it's possible to embed metadata into files (see our pages on XMP and). However, it's difficult for users to do this. We'd like to build a service that takes media files and is able to add licensing metadata, and the first step is to create a library that can do the low-level work.

Expected Results: A library that is able to get/set XMP metadata on as many file formats as possible. We'll make a prioritized list of file types and agree on a core set before you start. JS is preferred for the library, but it could be written in some other language and have JS bindings.; It is preferable to have the library be interoperable with the on-disk format of Liblicense. That is not an absolute requirement, but you would need to present a detailed argument for why it would be better to break compatibility.

Notes / Resources: You will want to read up on the following:

Liblicense

XMP

EXIF / exiv2 / exiv2node

Knowledge Prerequisite: JavaScript, possibly other languages/frameworks (see above).

Skill Level: Medium

Mentor: Dan Mills or other CC tech staff member

Project: Media Widget

The CC media widget will allow content from a user's CC homepage (a product we're working on) to be embedded into other sites, like Tumblr, WordPress, etc. Our main interest is in specialized widgets that excel at displaying particular media types, not a generic "file list" widget. For example, a great image widget probably looks and feels different from an academic paper widget.

Expected Results: A widget that is able to take CC licensed files and their metadata and visualize it in neat ways that users find compelling. The widget must also include the licensing metadata/attribution information: which license it's released under, the author's name/handle, and so on. Clever ways of displaying this information in a way that is accessible but not annoying will be a plus.

Knowledge Prerequisite: JavaScript, HTML/CSS, node.js

Skill Level: Low

Mentor: Dan Mills or other CC tech staff member

Summer of Code 2013

Contents

A Little Background

Project: Homepage Themes

Project: CC Web Content API

Project: OpenAttribute Integration

Project: Media Fingerprinting Library

Project: Metadata Embedding Library

Project: Media Widget

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

default links

wiki navigation

Tools