ChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

  • @unipadfox@pawb.social
    link
    fedilink
    English
    461 year ago

    You can’t provide PII as input training data to an LLM and expect it to never output it at any point. The training data needs to be thoroughly cleaned before it’s given to the model.