In light of this news, we need a browser that looks like a search engine crawler.

This would equalise the problem of websites giving preferential treatment to crawlers and lousy treatment to the rest.

My question is: assuming all hearders could mimick a crawler, would that be sufficient? Or do paywalls take IP address into account? And if so, would it work to subscribe to Google Cloud just to get an IP address in Google’s ranges and use that for crawling?

    • freedomPusher@sopuli.xyzOP
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      11 days ago

      Indeed, but is that sufficient? Your browser fingerprint goes far beyond user agent. It includes attributes like screen dimensions.

      • B-TR3E@feddit.org
        link
        fedilink
        arrow-up
        1
        ·
        11 days ago

        If you turn off Javascript there’s no browser fingerprinting. Crawlers are not interpreting JS either.

        • freedomPusher@sopuli.xyzOP
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          11 days ago

          The Tor Browser devs went to great lengths to merely reduce fingerprinting. If you run TB, drag the corner of the window. You will see that the geometry changes in descrete steps. Whatever dimension you choose, you only share a fingerprint with others who are on that same geometry step, and remain distinguished from other TB users who chose different geometry steps.

          User Agent is another example of a fingerprinting attribute that’s orthoganol to JS.

          I just went to an arbitrary webpage and saw that my browser sends these headers:

          accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
          accept-encoding: gzip, deflate, br
          accept-language: en-US,en;q=0.9
          dnt: 1
          

          IP address would also be part of the fingerprint. A very sophisticated server could even look at factors like SYN-ACK response times to do recon on the software involved; though that would be unlikely in this context.

          (update) Is it even fair to say crawlers do not run JS? There are websites that withold information unless JS is executed. Lemmy was one such instance in the early days.