Back

Training data in the media landscape

Published: November 28, 2023

One of the greatest concerns for investors around companies within the generative AI space is copyright. Will lawsuits within this unchartered territory sweep away profits and deterge nascent businesses? Will winners simply move fast and break things, including the law? The answer is unknown. We do know, however, that a bottoms-up approach to AI must appeal to the bottom — in other words, to be legally pursued by an incumbent is expected, but to be pursued by a consortium of independent authors, which Open AI has been, puts into question their go-to-market and necessarily evokes skepticism around their go-forward rate of adoption.

Independent authors and artists are not the only ones to have spoken out about the threat of AI to their works, as the New York Times and Getty Images have insisted upon taking legal action against the maker of ChatGPT and Stable Diffusion, respectively. With now-wavering consensus around “fair use,” both the entities which operate these AI models and the IP owners which supply their training data are simultaneously at risk and at odds with one another.

The tumultuous legal landscape has unsurprisingly affected growth in the sector. Despite a whopping $18B of venture funding which has been deployed into the AI sector in Q3 of this year, consumer adoption is still in its nascent phases. Only about 180M people actively use ChatGPT, with hallucinations (defined as a confident response by an AI that does not seem to be justified by its training data) being amongst the largest complaints from active users. AI hallucinations are in fact a consequence of slow-moving copyright law, as without access to recent and verifiable claims from content owners, models break and customers churn.

In an ideal world, IP owners are comfortable being part of the AI ecosystem — they are opted in. In fact, they are made comfortable, as there are standards which ensure compensation for their efforts and the crucial training data which they provide. Customers pay fair prices for their ability to leverage copyright-protected content and over time, AI becomes a significant revenue stream for media conglomerates, authors, and artists alike. The fear that IP owners would be making a mistake in selling their work too cheaply would be eclipsed by the fear that they’d be leaving chips on the table if they weren’t to opt in.

But today’s fear is neither nonsensical nor will be a phase — it’s one that we’ve seen play out over the past many months as both large and small copyright holders take action to protect their bread and butter in uncharted territories. As founders, VCs, and consumers alike work to craft a fair and lucrative AI ecosystem, we may recall the mid 90s during the internet’s ascent.

Upon the rise of the internet, traditionally-print media companies began as skeptics, wondering if the cost to create an internet business would truly grow their revenues and not just their readership. There was no proven digital ad-based business model, so incumbents hesitated to participate in the ecosystem. Skeptical content owners gatekept their IP, which really just meant keeping it in print for as long as they could, preventing the inevitable web scraping or becoming victim to a poorly-executed banner ad. Wired and the Economist were notable first movers in the space, before CNN digital went live in 1995 and the initial skeptics followed suit.

Over time and as internet adoption grew, practices introduced through the internet like web scraping became a friend and not a foe to IP creators and owners. Today, for example, SEO (search engine optimization) is a practice crucial in enabling the discoverability of companies and content. It is made possible by the very thing that IP owners initially feared — scraping, albeit leveraging agreed upon standards which stave off worst case scenarios for content owners.

The revenue mix of copyrighted written content, for example, has changed drastically over time, with digital ad sales cannibalizing print and digital subscription dollars over-indexing in more recent years. Each digital ad and subscription businesses were potentially threatening to the fair price of IP— until they weren’t. Yesterday’s fears become today’s security blankets, and with growing interest and investment in generative AI, it is highly plausible that the revenue mix of content changes further to include AI-related content usage fees.

The internet grew media via a combination between fast consumer adoption and agreed upon standards which enabled content owners’ participation in the ecosystem. Take web scraping, for example, which required robots.txt, introduced relatively early on in ‘94 via the Robot Exclusion Protocol. While not an official internet standard, it is mostly respected in the ecosystem and by Google, enabling website owners and operators to ensure strong search rankings and that protected content will not end up in the wrong hands. If history repeats itself, which I believe is particularly likely given that many of the same incumbent media companies and IP owners are prominent today, we’ll need standards around AI to incite adoption.

At this moment, we very clearly face the classic chicken and egg problem and an awkward dance — customers are less likely to use AI models that have not been trained on licensed IP, as they may be worse, and IP owners are unlikely to offer up their prized possessions in exchange for unknown users and lesser-known revenues. Of each the consumer and the IP owner, I believe that the latter is more likely to move first, despite often being named slow and incumbent.

As we saw during the gilded era of the NFT, some of the largest IP owners like Nike, Gucci, and HBO to name a few, were quick to adoption, perhaps in a brisk effort to activate their loyal consumers and locate a new revenue stream that might sustain. While there may be some industry fatigue from these efforts, content owners and subscription media businesses in particular are only more keen to find additional ways to grow their revenues beyond just price hikes in a bear market. AI activations will sound compelling, that is, if IP owners are able to get comfortable with the protections, explicit and implicit, within the ecosystem.

I believe that those who enforce trusted standards may be the ones who win in the space. To own the relationships with the IP owners which supply the high-quality training data is to own the consumer, as consumers and businesses alike look to generate quality and bespoke AI content and to cover themselves legally. If developers can easily access high quality and curated data (e.g., data that avoids NSFW content), adoption begins. In my view, most of the additional value in the world of generative AI today will accrue in three places, in no particular order: 1) at the verticalized application layer, 2) via data management and governance tools which support these applications, and importantly, 3) to the original IP owners.

Those that enforce the standards in this legally-messy realm — those that “get the ball rolling” so to speak — have the ability to generate value in all three of these places.

I look forward to enabling the progression of a landscape that brings true value to both consumers and creators.