João Fiadeiro • 2023-09-07
Decentralized data marketplaces connect AI teams with tailored training data to advance machine learning models. Built on blockchain, exchanges simplify licensing and match buyers with niche datasets from global crowdsourced providers.
The availability of high-quality training data is crucial for developing effective machine learning models. While large, generic datasets have fueled rapid progress in AI in recent years, tailored, task-specific data remains hard to come by. This is especially true for settings that require lots of labeled examples of rare events, under-represented groups, or just hard-to-get-data. Synthetic data is great, but research shows that training generative AI models on AI-generated content results in “model collapse”. Data is the fuel in the ML engine and while there’s a lot of it, we might not always have the exact data we need. Researchers can get creative with the datasets they use but too many shortcuts can result in suboptimal performance. Garbage in, garbage out. What then? Access to diverse, representative data that matches the precise needs of each project would greatly accelerate innovation in AI.
That's where the concept of a decentralized data marketplace comes in. By creating a platform where buyers and sellers can safely and efficiently exchange bespoke training datasets, we can unlock new sources of high-value data. This has the potential to supplement the standard datasets and enable more organizations to train AI models that work well for their specific application. A marketplace also incentivizes the production of currently non-existent datasets that could prove invaluable for pushing the frontiers of AI.
In this blog post, I'll make the case for how a decentralized data marketplace for machine learning could transform the practice of training AI models. I'll highlight the advantages of bespoke datasets tailored to each model's needs and outline how a platform facilitating data exchange could benefit both buyers and sellers. By democratizing access to specialized data, we can foster broader participation in the development of AI and allow new voices to shape its future. Read on to learn more about this promising approach to acquiring the diverse data necessary to build fair, robust and highly capable AI systems.
A few years ago at Google, I encountered the challenge of needing a large, specialized dataset to improve an AI system. My team was working on enhancing speech recognition for accented speakers. While existing models performed well for native English speakers, the word error rate skyrocketed for those with thick accents, especially non-native accents. Bridging this gap required training on abundant real-world examples of accented speech. But collecting such focused data in-house proved difficult and expensive:
We estimated needing over 100 hours of audio from Spanish-accented English speakers. However, finding quality samples was hard:
Ultimately, I managed to get the data but only after many months of work and tens of thousands of dollars. At several points, I honestly considered flying to Mexico or Bogota equipped with a microphone and just doing it myself. This costly process made clear the need for task-specific data. But specialized datasets are hard to obtain through traditional channels.
I wondered: what if there was a data bounty hunter who could coordinate this whole thing for me? Surely there was a better way. I dreamed of a platform where I could crowdsource the job:
For example, finding 100 students to each record 10 Spanish-accented speakers reading a script could have created my custom dataset faster and cheaper. The key insight is that for developing AI that performs well in the real world, training data must match the use case. By democratizing access to specialized datasets, a decentralized marketplace could enable easier creation of the precise, representative data needed for robust AI systems.
Yes, there are things like Mechanical Turk out there that allow us to crowdsource simple tasks, but the right person or entity could actually coordinate the end-to-end process of gathering, cleaning, and publishing the data. Collecting such bespoke dataset takes much more than boots on the ground: it requires careful project management, consideration and adherence to data privacy regulations, and data engineering required for it to conform to desired specifications.
Bounties are an effective mechanism for incentivizing the completion of well-defined tasks, especially ones requiring specialized skills. Rather than formal contracts, bounties are open calls that allow flexible participation. Let's explore why they can be so useful.
Flexibility Over Rigidity
Unlike contracted work, bounties allow variable participation without complex agreements. Anyone with the capability can opt-in to complete a bounty for the advertised reward. This flexible structure is well-suited for specialized tasks.
Need to quickly build a website? A development bounty allows leveraging skills of capable freelancers without paperwork. Want expert advice on a technical topic? An open bounty lets knowledgeable people provide guidance without long-term commitments.
This fluid participation makes bounties ideal for one-off tasks where upfront agreements would add unnecessary friction.
Outcomes Over Obligations
Bounties also shift focus to outcomes rather than prescribed obligations. Submitters work towards the end goal in whatever way makes sense to them. There are no strict requirements on how to complete the work.
For example, a translation bounty cares about accurate conversions, not whether translators use certain tools. This goal-oriented flexibility allows more custom approaches tuned to each task.
Merit-Based Rewards
With bounties, the best work gets rewarded, not just who you happened to contract. Open participation lets the most capable self-select into tasks they can excel at.
Quality is ensured through competition and vetting submissions. Only work meeting expectations earns the bounty. This merit-based system surfaces top talent.
As we'll explore, these principles make bounties a compelling way to secure specialized datasets tailored to ML needs.
Here’s what I’m pitching: a decentralized marketplace that aims to facilitate access to specialized datasets tailored to each model's needs. The market connects data seekers and providers to enable trading currently non-existent datasets.
Data requesters submit requests (RFPs) outlining their ideal training data:
On the supply side, data bounty hunters bid on proposals and deliver datasets per specifications.
Just like a service like a.team allows anyone to hire a team consisting of a product manager, designer, and engineer to build an app or website for much cheaper, I imagine teams of subject-matter experts, data engineers, and data scientists getting together to produce bespoke datasets.
Access to niche data helps train robust models for end-use cases. Shared incentives cultivate currently non-existent resources. By connecting seekers and providers, the marketplace facilitates trading of bespoke data at any scale, on demand.
Unlike traditional fixed pricing models, auctions provide an effective method for determining the true market value of datasets. By facilitating price discovery through competitive bidding, auctions incentivize providers to offer fair prices tailored to demand.
Several auction formats can help elicit real-world pricing:
Intelligent reservation pricing, staggered auction timing, and bundle auctions further optimize the process.
Overall, well-designed auctions minimize pricing inefficiencies by incentivizing truthful bids. Market forces naturally drive prices toward equitable rates based on actual demand. This benefits both data consumers and providers compared to arbitrary fixed prices.
The success of marketplaces like Uber and TaskRabbit has shown the power of matching supply and demand for simple, commodity-like services. However, as the "gig economy" matures, there is a shift towards more complex services requiring skilled workers - the "talent economy." This is the niche between basic gig work and high-end professional services dominated by platforms like LinkedIn.
New vertical marketplaces are emerging to better connect employers with specialized workforces in areas like nursing, construction, and more. By focusing on labor categories with standardized skills and credentials, these marketplaces improve search and matching. This benefits employers by making hiring faster and more efficient. Workers also benefit from access to more job opportunities.
A decentralized data marketplace stands to bring similar advantages to the world of AI and machine learning. Right now, options for obtaining training data are limited. Generic datasets only get you so far, but collecting specialized, task-specific data in-house is expensive and time consuming. A platform connecting data buyers and producers could make trading bespoke datasets easier and more affordable.
By focusing specifically on the needs of the ML community, a data marketplace can facilitate better matching based on particular data requirements. For requesters, this means faster access to tailored datasets. For providers, it unlocks new revenue streams for generating currently non-existent data. Enabling efficient exchange to meet precise demands is key.
As data becomes increasingly valuable for AI development, an ecosystem for trading specialized datasets can help democratize access. Much like how vertical talent marketplaces are embedding themselves into company HR workflows, a data marketplace could integrate with ML training pipelines. This creates symbiotic value for both sides of the transaction.
A number of platforms have emerged to facilitate the exchange of third-party datasets. These marketplaces aim to simplify data discovery, procurement, and integration.
While capabilities vary, these platforms aim to lower barriers to third-party data acquisition. More importantly, they validate demand for accessible data exchange ecosystems.
On the other hand, there are several service providers who build bespoke datasets. Several of them (Innodata, Cogito) are primarily focused on data annotation tasks for use-cases like computer vision and content moderation. These shops will take some data and annotate, label, and enrich them. Appen is an example of a complete data provider which, alongside annotation of video, image, and audio data, actually collects data with a dedicated expert teams and a crowd of 1M users.
Here are some examples of highly specialized datasets ML practitioners may want to obtain through a decentralized marketplace:
The key is identifying where current datasets fall short for a given ML task, then obtaining customized data that fills the gaps. By connecting data seekers and providers, an open marketplace enables cheaper and faster access to specialized data at any scale. The main insight is this: we produce data everywhere we go and we don’t even know how useful it might be in the right context. The definition of information is data in context: one man’s rubbish may be another man’s treasure. Who knew that Minecraft videos could one day form the basis of cutting-edge AI research?
Many organizations sit on troves of proprietary data that, if sufficiently anonymized and aggregated, could provide value for training AI systems. Financial firms have records of transactions and consumer activity. Customer support logs contain myriad real-world conversations. HR data offers insights into workplace dynamics. The list goes on. While this data needs to be carefully processed to comply with regulations like GDPR and HIPAA, there is an opportunity for companies to generate additional revenue streams by selling access to compliant, sanitized versions. For example, chat logs from a bank's customer service could help train more capable conversational agents after removing any personal information. Transactions data can give retailers insights into purchasing habits if individual details are abstracted away.
Of course, enterprises would need to implement robust anonymization and compliance procedures before attempting to monetize data. But by partnering with experts in data privacy and security, sharing select data through a trusted marketplace could enable new business models. The appetite for high-quality training data makes proprietary organizational datasets a potentially lucrative asset.
With proper consent, individuals could also opt-in to share anonymized behavioral data through a decentralized marketplace. For example, someone might be willing to provide their YouTube watch history or Spotify listening habits. Aggregated workout data from apps like Strava offer insights into exercise trends. The apps and services we use generate troves of data that, if tokenized to protect privacy, could have value for training AI models. The key is giving users control over their data and sharing proceeds.
An ethical marketplace would allow people to grant access to certain data in exchange for compensation, similar to selling compute power. Strict protocols would be necessary to prevent re-identification and abuse. But for users comfortable sharing select data, a marketplace could provide revenue while fueling innovation. Enabling individuals to monetize data consensually has the potential to democratize access to diverse training resources. With thoughtful implementation, data could become a new asset class for users to extract value from.
Tokenization provides finer-grained control over data sharing and monetization. Rather than relinquishing data to centralized entities, users maintain ownership of tokenized datasets. They grant access to buyers under agreed terms, with payments settling automatically via smart contracts. This ensures users retain agency over their data while still benefiting financially.
For enterprises and organizations, tokenized data enables new revenue streams without sacrificing proprietary data sovereignty. Granular permissions facilitate safe sharing of subsets of data.
On the buyer side, tokenization streamlines access and compliance. Usage terms can be codified into datasets, with auditable chains of custody. For sensitive applications like healthcare, tokenized data with embedded policies could expand sharing of vital resources for research.
A project like Ocean Protocol demonstrates the possibilities of web3 data marketplaces. Encrypted data containers allow sharing select datasets while retaining control. The protocol establishes data provenance and automates licensing agreements and payments via tokens. This balances open access with privacy and security.
By aligning incentives and automating trust, tokenized datasets minimize friction in data exchange. Users are empowered to share data on their terms. Buyers get tailored resources. This can unlock new sources of high-quality training data to advance AI in a responsible way: no more Sarah Silverman situations!
Filecoin provides efficient decentralized data storage and retrieval. This capability enables novel data sharing models powered by dataset tokenization.
Specific datasets can be represented as non-fungible tokens (NFTs) on Filecoin. These NFTs act as licenses that control access to the underlying data.
Data owners can tokenize datasets while still retaining control through encrypted storage and smart contract-governed sharing policies.
NFT license-holders can access the Filecoin-hosted data for a duration or application specified in the contract. Usage terms and payments are automated.
This allows new data-focused organizations like DataDAOs to aggregate valuable datasets and monetize access via tokenization. DataDAOs can curate subject-specific data collections and sell access to token holders.
Shared ownership of the data resource aligns incentives between contributors. The DAO benefits collectively from its governance and curation.
On-chain provenance tracking also enables usage auditing. Licensing history is immutable, resolving disputes.
In summary, Filecoin's decentralized infrastructure combined with dataset tokenization enables new participatory data models. Collective data ownership and sharing unlocks an open, trust-minimized data economy.
OCEAN Protocol facilitates the exchange and monetization of tokenized datasets and AI services. It allows data owners to publish datasets while retaining control through encryption and fine-grained usage policies.
Datasets on OCEAN are registered as data NFTs which point to an encrypted data container. Access to the data is governed by smart contracts. To gain access, buyers must purchase temporary access tokens which decrypt the data for the duration of the contract. This access can be scoped to only portions of a dataset if desired.
Once access is purchased, the data remains securely encrypted while in use by buyers. Compute-to-data patterns keep data private even during wrangling, transformation, and training. OCEAN enforces data provenance and auditing throughout the process.
The protocol handles licensing, payments and contract management automatically via smart contracts. This reduces legal hurdles and operational overhead for sharing data.
OCEAN enables owners to monetize datasets on their terms while allowing buyers easy on-demand access. Automated trust and built-in privacy preserve control for providers while benefiting consumers of data. This alignment of incentives facilitates fluid data exchange at scale. I believe this protocol is uniquely suited to build on top of.
Web2 marketplaces like Uber and Airbnb have demonstrated how matching supply and demand can create tremendous value. However, the pressure to deliver shareholder returns often strains these networks over time.
Acquiring users requires subsidizing participation with minimal fees initially. But once network effects kick in, platforms face pressure to increase monetization. This leads to tactics like:
These moves may boost revenue short-term but undermine user trust and platform quality long-term. The misalignment between users and shareholders manifests as the platform evolves.
A web3 marketplace based on tokenized participation offers a more sustainable alternative. Shared ownership via a native token provides incentives better aligned with user value.
In a data marketplace like Ocean Protocol, data providers earn tokens for supplying datasets. Consumers pay tokens to access resources. The token thus powers the entire ecosystem.
Shared ownership gives participants a voice in governance. Voting rights allow adjusting policies to benefit the network. There are no outside shareholders to appease. Furthermore, if the quality of certain datasets suffers because of sub-par providers or malicious actors, the token price will fall further incentivizing the community to self-monitor and provide adequate trust and safety measures.
This motivates users to help the marketplace grow. More data supply and consumption makes the network more valuable, in turn raising the token's value.
Staking tokens on outcomes is one way a decentralized data marketplace could incentivize high quality and reliable service. For instance, data bounty hunters could stake tokens on being able to deliver datasets faster than the requested timeline. Exceeding expectations would lead to gaining the staked tokens as a reputation bonus. Requesters could also stake extra tokens when submitting proposals to boost visibility and indicate seriousness. Higher stakes signal priority jobs that may warrant higher bids from providers. Well-designed staking mechanisms give both sides skin in the game for achieving good outcomes. With proper governance, thoughtful staking protocols can create the right production incentives without requiring centralized oversight.
A great model of how tokens can be used productively is Braintrust. Packy McCormick wrote a great analysis on the potential for web3-based marketplaces.
For a data marketplace to maintain credibility, rigorous checks are needed to validate providers. This prevents illegal or unethical data practices that could undermine trust. While staking mechanisms can be powerful in incentivizing good behavior, strong KYC and reputation-based mechanisms (potentially leveraging on-chain data) must be in place.
Know Your Customer (KYC) Requirements
Reputable marketplaces mandate Know Your Customer procedures before approving new data suppliers. This due diligence verifies:
Thorough KYC protects consumers by filtering out bad actors before they can join the marketplace.
Ongoing Monitoring
Vetting cannot stop at onboarding. Active monitoring helps maintain marketplace integrity over time as new providers join.
Well-designed oversight reinforces the marketplace's reputation as a trusted destination for valuable data exchange. By upholding standards, consumers and ethical providers mutually benefit.
The Value of Reputation
In data exchange especially, past behavior predicts future trustworthiness. Reputation systems that document provider conduct over time help guide consumer decisions.
Profile histories, ratings, and reviews give buyers transparent insights into each vendor's track record. Ethical actors are rewarded with positive reputations that drive business. Bad behavior leads to exclusion.
Much like identity verification establishes baseline trust, robust reputation systems maintain confidence in marketplace transactions. This virtuous cycle cements ecosystem credibility and stimulates growth.
Without shareholder pressure, web3 platforms can focus on benefiting users over the long-term. This stems from:
For data providers and consumers, this results in lower fees, better rewards, and higher quality over time. Their shared success powers network growth.
A decentralized data marketplace can sustain long-term value for participants by:
By aligning incentives around data exchange, a web3 platform creates a marketplace optimized for openness, quality, and sustainability. The collective success of participants drives innovation and growth.
One risk as builders in the web3 ecosystem is getting over-enamored with the underlying technology. It's easy to assume that novel capabilities like tokenization and decentralization are inherent value propositions. But for most users, these complexities are secondary to getting things done efficiently.
As we explore models like data marketplaces, web3 rails offer genuine utility:
However, elevating these intricacies over user experience is a recipe for failure. We cannot forget that solving real problems smoothly is the only sustainable value proposition.
For data providers, the priority is monetizing their resources, not wresting with cryptographic credentials. For ML teams, it's about rapid access to tailored data, not understanding blockchain governance. As builders, we must abstract away unnecessary complexity into seamless workflows.
To drive mainstream adoption, web3 marketplaces need to feel as simple and intuitive as web2 services, while unlocking new economic possibilities. Users will flock to services that help them accomplish jobs efficiently, not marvel at novel tech.
For example, an ML engineer needs to quickly browse available datasets, purchase the rights tokens, and integrate the data into their training pipeline. Providers want tooling to securely package datasets into revenue-generating assets without headaches.
Delivering these user journeys smoothly is paramount. Novel incentives schemes and governance models enable that, but should facilitate usage, not drive it.
If web3 marketplaces focus too much on evangelizing protocol intricacies, they run the risk of confounding users. We must remember that crypto tooling exists to serve users, not dazzle them.
The measure of success is whether providers can seamlessly monetize data and buyers can conveniently access bespoke datasets. Not whether users are wowed by staking mechanisms and quadratic voting.
By abstracting away unnecessary complexity into clean workflows, we allow decentralized technology to elevate marketplaces invisibly - growing the pie for all participants without burdening them.
The potential of web3 data ecosystems is vast, but realizing that potential requires putting user needs first. Crypto-powered marketplaces should feel like the most usable and empowering services, not tech demos. User-centric design is still paramount even as capabilities expand exponentially. If we lead with intuitive experiences that solve real problems, mainstream adoption will follow.
While decentralized protocols and incentives models enable new data marketplace paradigms, we should also leverage AI to directly benefit users. The goal is smoothing friction points through automation and augmentation.
For requesters, AI tools could help scope out data needs and generate realistic synthetic samples to kickstart projects. This gives requesters a tangible prompt when formulating proposals.
Natural language interfaces allow requesters to describe needs conversationally versus having to formalize rigid specifications. AI then helps craft well-formed proposals to maximize responses.
For data suppliers, AI-powered workflows can simplify the capture and packaging of datasets per specs. This includes:
With better tooling, more users can participate in supplying valuable data. Lowering barriers to entry expands data diversity.
AI should also facilitate matching requests to providers through semantic search and recommendations. This saves time browsing proposals.
By automatically handling rote tasks and augmenting decision-making, AI allows users to focus on high-value activities. The marketplace experience centers on data exchange goals, not operational headaches.
In this article, we explored the need for more robust access to specialized training data to advance AI development. While large generic datasets drove initial progress, bespoke resources tailored to each use case are crucial for real-world systems.
However, collecting customized data in-house can be prohibitively expensive and time-consuming. A decentralized data marketplace offers a more efficient model by connecting data seekers and providers.
Well-designed data bounties allow crowdsourcing niche datasets like accented speech recordings. Built on web3 protocols, the market facilitates licensing and payments while preserving privacy. Participants are incentivized through shared ownership and governance.
Integration of AI can further optimize workflows like proposal matching, data packaging, and search. The end goal is an intuitive platform unlocking abundant bespoke data to train more capable ML models.
Though still early, data marketplaces present a promising path to democratize access to the diverse, representative resources needed for ethical, robust AI development. Aligned incentives and user-centric design can drive sustainable growth. By empowering participants to exchange value openly, data can fuel innovation far into the future.
What do you think? Should we build this? Reach out if you're interested in building this and let's talk.
Interested in collaborating? Reach out!
Copyright © 2023 João Fiadeiro. All rights reserved.