I saw this post and I was curious what was out there.

https://neuromatch.social/@jonny/113444325077647843

Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?

  • Otter@lemmy.caOP
    link
    fedilink
    English
    arrow-up
    72
    ·
    2 months ago

    One option that I’ve heard of in the past

    https://archivebox.io/

    ArchiveBox is a powerful, self-hosted internet archiving solution to collect, save, and view websites offline.

    • tomtomtom@lemmy.world
      link
      fedilink
      English
      arrow-up
      9
      ·
      2 months ago

      I am using archivebox, it is pretty straight-forward to self-host and use.

      However, it is very difficult to archive most news sites with it and many other sites as well. Most cookie etc pop ups on a site will render the archived page unusable and often archiving won’t work at all because some bot protection (Cloudflare etc.) will kick-in when archivebox tries to access a site.

      If anyone else has more success using it, please let me know if I am doing something wrong…

      • Daniel Quinn@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        2 months ago

        Monolith has the same problem here. I think the best resolution might be some sort of browser-plugin based solution where you could say “archive this” and have it push the result somewhere.

        I wonder if I could combine a dumb plugin with Monolith to do that… A weekend project perhaps.