We need a browser that masquerades as a crawler -- to disenshitify damage done by paywalls to our search results

freedomPusher@sopuli.xyz · edit-2 5 months ago

We need a browser that masquerades as a crawler -- to disenshitify damage done by paywalls to our search results

B-TR3E@feddit.org · 5 months ago

There are ways to mask your user agent. Proxies or browser extensions.

freedomPusher@sopuli.xyz · edit-2 5 months ago

Indeed, but is that sufficient? Your browser fingerprint goes far beyond user agent. It includes attributes like screen dimensions.

B-TR3E@feddit.org · 5 months ago

If you turn off Javascript there’s no browser fingerprinting. Crawlers are not interpreting JS either.

freedomPusher@sopuli.xyz · edit-2 5 months ago

The Tor Browser devs went to great lengths to merely reduce fingerprinting. If you run TB, drag the corner of the window. You will see that the geometry changes in descrete steps. Whatever dimension you choose, you only share a fingerprint with others who are on that same geometry step, and remain distinguished from other TB users who chose different geometry steps.

User Agent is another example of a fingerprinting attribute that’s orthoganol to JS.

I just went to an arbitrary webpage and saw that my browser sends these headers:

accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
dnt: 1

IP address would also be part of the fingerprint. A very sophisticated server could even look at factors like SYN-ACK response times to do recon on the software involved; though that would be unlikely in this context.

(update) Is it even fair to say crawlers do not run JS? There are websites that withold information unless JS is executed. Lemmy was one such instance in the early days.