You want to aggregate CC licensed media (e.g., images, video, audio). How do you discover licensed content? There are two complementary high level options.
Have or obtain access to a content repository with CC licensed content, e.g., one listed at content curators.
This is far from impossible, even without a relationship. Feeds provide access to many repositories. Feeds should contain license information for licensed content, see syndication.
You can crawl the web and look for licensed media. If you don't already crawl, where should you start? Once you're crawling or otherwise have web pages, how do you know media is licensed?
Yahoo! and Google web search are CC-enabled. You can use their APIs to get a list of URLs with some CC license associated (using keywords relevant to your objective media). Use these as crawl seeds. Crawl until you don't see any licensed media for n hops.
What media is licensed?
You have a web page. Are the images included in the page licensed? The video? Some approaches:
- Only accept license statements pertaining to specific media URLs. Only deployed experimentally so far. See RDFa.
- Accept page level license statements as pertaining to media URLs with content type matching conent type statements (e.g., dc:type StillImage)
- Assume that page level license applies to whatever media you're interested in.
- Content or site-specific heuristics like
- Assume license applies to largest image on page
- Know that on flickr.com, license applies to image located at particular location in the HTML.
- Look inside the media itself for embedded statements. So far there will be few images found with embedded license info, as Photoshop is currently the only tool supporting this, though more are expected in the coming months. See XMP.