Generative AI Models Are Sucking Up Data from All Over the Internet, Yours Included
In the rush to build and train ever larger AI models, developers have swept up much of the searchable Internet, quite possibly including some of your own public data—and potentially some of your private data as well.
How do AI companies gather data?
AI companies typically use automated programs known as web crawlers and web scrapers to gather data. Web crawlers navigate the internet, cataloging information from various URLs, while web scrapers download this cataloged data. For example, OpenAI has utilized a web crawler called Common Crawl to collect training data for its models.
Is my private data safe from AI models?
While generative AI models primarily gather data that is publicly accessible, there are concerns about privacy. For instance, Meta has acknowledged using public posts from platforms like Facebook and Instagram to train its AI. Although locked-down accounts are generally not included, there are instances where private information can inadvertently end up in training datasets due to lax privacy settings or digital leaks.
What are the implications of biased data in AI?
Bias in the data used to train AI models can lead to skewed outputs that reflect harmful stereotypes. For example, AI image generators may produce more sexualized depictions of women compared to men. This bias arises because the internet itself contains a disproportionate amount of certain perspectives, often favoring wealthier, Western demographics, which can result in AI models that do not accurately represent the broader population.

Generative AI Models Are Sucking Up Data from All Over the Internet, Yours Included
published by Divergent IT
Divergent IT is a tech service operational consulting & strategy firm. Divergent IT partners with CIOs, business owners, and Non-Profits to develop strategy and implementation across their business including: cybersecurity, remote maintenance management (RMM), IT strategy, on-site maintenance and more.