Dropsitenews published a list of websites Facebook uses to train its AI on. Multiple Lemmy instances are on the list as noticed by user BlueAEther

Hexbear is on there too. Also Facebook is very interested in people uploading their massive dongs to lemmynsfw.

Full article here.

Link to the full leaked list download: Meta leaked list pdf

  • Canaconda@lemmy.ca
    link
    fedilink
    arrow-up
    0
    ·
    20 days ago

    Does this mean that some of the more unhinged users might actually be chat bots? Or are they just scraping our comments reddit style?

    • mesa@piefed.social
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      20 days ago

      Scraping by the look of it.

      Also if you have ever spun up a lemmy or piefed instance, you will quickly see these bots pop up. They don’t respect robots.txt AT ALL. I estimate 95% of the traffic I get on ly tiny little server is all AI crawlers.

      A good way to hurt them is to either use cloudflares service or create a page that has a link…to another page that gets generated…to another page. And each time, it slows down. No human would ever click the link, but bots ALWAYS do. Its so funny to see how many are out there in the quagmire of links on my little python script.

      • tpyo@lemmy.world
        link
        fedilink
        arrow-up
        0
        ·
        20 days ago

        Does it generate any form of visuals? Like could you post a screenshot of something that shows how far a bot has traveled? I’ve heard about these traps but I’m curious about what you’re describing looks like

        • mesa@piefed.social
          link
          fedilink
          English
          arrow-up
          0
          ·
          20 days ago

          I just have a id. 1/2… A href id if that makes sense.

          So it’s the logs that see the number of iterations. Thousands on a couple of ips. Script kiddies.

          Honestly I didn’t think the black hole would work that well. But it reduces the actual traffic by a huge factor.

    • davidgro@lemmy.world
      link
      fedilink
      arrow-up
      0
      ·
      20 days ago

      I assume scraping at this point. There’s likely a few hobby ones now, but if Lemmy becomes popular then there will be lots of bots for sure.

    • zeca@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      20 days ago

      I guess they mostly scrape it. To waste resources posting here they have to find a way to make money in doing so. They put bots posting on facebook because they think it increases user engagement. They dont want to increase engagement on lemmy (not that it would work…).

  • flamingos-cant (hopepunk arc)@feddit.uk
    link
    fedilink
    English
    arrow-up
    0
    ·
    20 days ago

    There’s like half a dozen feddits and somehow feddit.uk is the only one to make it onto this?

    Here’s a list of instances in feddit.uk linked instances that appear in the list:

    List of instance
    beehaw.org
    furry.engineer
    ibe.social
    fediworld.de
    framatube.org
    trailers.ddigest.com
    nrw.social
    lemmynsfw.com
    video.hardlimit.com
    digitalcourage.social
    xn--baw-joa.social
    tube.kockatoo.org
    equestria.social
    wisskomm.social
    social.anoxinon.de
    freiburg.social
    toobnix.org
    toot.bike
    mstdn.lalafell.org
    peertube.linuxrocks.online
    social.rebellion.global
    mastodon.cipherbliss.com
    social.sdf.org
    corteximplant.com
    typo.social
    www.404media.co
    mastodon.ml
    video.liberta.vip
    tilvids.com
    todon.eu
    hessen.social
    digipres.club
    shigusegubu.club
    mastodon.me.uk
    zdf.social
    mastodon.sdf.org
    spore.social
    kolektiva.media
    gruene.social
    share.tube
    nso.group
    mastouille.fr
    masto.es
    vivaldi.com
    literatur.social
    mstdn.mx
    kirche.social
    mastodon.hams.social
    federation.network
    lile.cl
    todon.nl
    betweenthelions.link
    ipv6.social
    linuxrocks.online
    peertube.otakufarms.com
    pawb.social
    mastodon-belgium.be
    jasette.facil.services
    machteburch.social
    mastodont.cat
    mastodon.eus
    eupolicy.social
    social.bau-ha.us
    toot.berlin
    amicale.net
    hexbear.net
    mastodon.bida.im
    reddthat.com
    shelter.moe
    mastodon.nl
    dju.social
    bonn.social
    mstdn.chrisalemany.ca
    social.sciences.re
    tldr.nettime.org
    lemy.lol
    climatejustice.social
    rollenspiel.social
    mastodon.org.uk
    social.kyiv.dcomm.net.ua
    pouet.chapril.org
    ecoevo.social
    social.politicaconciencia.org
    darmstadt.social
    peertube.tv
    lemmus.org
    libretooth.gr
    hackers.town
    tooter.social
    anarchism.space
    diode.zone
    video.infosec.exchange
    mastodon.thirring.org
    aussie.zone
    social.bund.de
    apobangpo.space
    shitpost.cloud
    berlin.social
    toot.aquilenet.fr
    social.beachcom.org
    lemmygrad.ml
    mastodon.radio
    nerdculture.de
    programming.dev
    decayable.ink
    kafeneio.social
    functional.cafe
    things.uk
    fuzzies.wtf
    diaspodon.fr
    dalek.zone
    sunbeam.city
    tooting.ch
    fediscience.org
    mastodon.tetaneutral.net
    social.librem.one
    im-in.space
    lemmy.sdf.org
    legal.social
    post.lurk.org
    mastodon.uy
    noc.social
    tube.pol.social
    lemmy.ml
    don.linxx.net
    infosec.pub
    kolektiva.social
    masto.bike
    furries.club
    zhub.link
    lemmy.world
    openbiblio.social
    mastodon.zaclys.com
    mamot.fr
    clacks.link
    discuss.tchncs.de
    cyberplace.social
    graz.social
    pl.kitsunemimi.club
    mastodonczech.cz
    masto.nobigtech.es
    hostux.social
    pawb.fun
    mastodon.trueten.de
    norden.social
    systemli.social
    mander.xyz
    ciberlandia.pt
    woem.men
    sopuli.xyz
    lemmy.ca
    
    • poVoq@slrpnk.net
      link
      fedilink
      arrow-up
      0
      ·
      20 days ago

      Given that we used to see lots of Meta scraping a while back on our instance and had to implement Anubis as a result, it is interesting to see that slrpnk.net doesn’t seem to be on this list (anymore).

    • usernamesAreTricky@lemmy.ml
      link
      fedilink
      arrow-up
      0
      ·
      20 days ago

      Linked article in the body suggests that likely wouldn’t have made a difference anyway

      The scrapers ignored common web protocols that site owners use to block automated scraping, including “robots.txt” which is a text file placed on websites aimed at preventing the indexing of context

      • mesa@piefed.social
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        20 days ago

        Yeah ive seen the argument in blog posts that since they are not search engines they dont need to respect robots.txt. Its really stupid.

    • Pamasich@kbin.earth
      link
      fedilink
      arrow-up
      0
      ·
      20 days ago

      If they have a brain, and they do have the experience from Threads, they don’t need to scrape Lemmy. They can just set up a shell instance, subscribe to Lemmy communities, and then use federation to get their data for free. That doesn’t use robots.txt at all regardless.

  • socsa@piefed.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    20 days ago

    Definitely called this. Can we have private voting now? These people are scraping the fediverse and the current state of things is a privacy nightmare.

    • Deceptichum@quokk.au
      link
      fedilink
      English
      arrow-up
      0
      ·
      edit-2
      20 days ago

      You cannot have private voting. The Fediverse is open, that information has to be shared for it to work unless you want to make it more open to vote manipulation.

      Even the PieFed implementation wasn’t great, basically giving every user a second account that sends the vote instead.

      • socsa@piefed.social
        link
        fedilink
        English
        arrow-up
        0
        ·
        edit-2
        20 days ago

        Vote manipulation only matters if votes matter. Just make down votes placebo or get rid of them entirely. There are other engagement metrics to use for sorting. Just make votes a small portion of a bigger algorithm and it dilutes the problem away. On the other hand, it seems like a ton of people on here outright refuse to consider that this is a problem, and are I stead choosing to live with their head in the sand.

        Either way, right now public voting does nothing to stop vote manipulation, it just gives the sockpuppet and astroturfing accounts great feedback to target certain demographics.

        The piefed implementation was a great compromise imo, and the only reason it was abandoned was idiotic forum politics. It did exactly what it set out to do - provide a layer of protection against large scale data mining and long term storage, and added a significant barrier to vote stalking, while still leaving mechanisms to ban voting agents.

        • Deceptichum@quokk.au
          link
          fedilink
          English
          arrow-up
          0
          ·
          19 days ago

          I don’t want engagement metrics, I want the collective opinion of users.

          People may engage may more with content they dislike, that doesn’t mean they want it to be on the front page.

          Once people stop expecting privacy from an open publicly broadcasting platform the better.

          • socsa@piefed.social
            link
            fedilink
            English
            arrow-up
            0
            ·
            19 days ago

            So your argument is that meaningless internet points are more important than user privacy? I just want to make sure we have that on record.

            The quickest path to enshitification of the fediverse is precisely this kind of large scale scraping and data mining. There are extremely simple ways to avoid this but the collective admin cohort has decided they like this tiny bit of internet power over innovation, because innovation is a tiny bit more difficult.

            • Deceptichum@quokk.au
              link
              fedilink
              English
              arrow-up
              0
              ·
              edit-2
              19 days ago

              There is no user privacy on an open system. Just as there is no privacy when you walk down the street. If you want privacy go into your house and talk (use signal or any other privacy app).

              Likewise peoples opinions are not meaningless.

              The enshitification of the fediverse will come from corporate or so aligned instances that play it safe for brand. The scraping is irrelevant. Enshitification is a social issue, not a technical one.

                • Deceptichum@quokk.au
                  link
                  fedilink
                  English
                  arrow-up
                  0
                  ·
                  19 days ago

                  By intent there is none, and it should remain that way. This works on public openness, everything needs to be visible not further hidden away out of our reach on our platform.

  • scintilla@crust.piefed.social
    link
    fedilink
    English
    arrow-up
    0
    ·
    20 days ago

    Can someone explain why they would need to scrape multiple instances? Are they intentionally going after the fediverse or is it just a byproduct of meta trying to get all of human communication?