I need to scan very large JSONL files efficiently and am considering a parallel grep-style approach over line-delimited text.

Would love to hear how you would design it.

  • Eager Eagle@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    7 hours ago
    1. How many grep-like ops per file?
    2. Is it interactive or run by another process?
    3. Do you know which files ahead of time?
    4. Do you have any control over that file creation?
    5. Is the JSONL append only? Is the grep running while the file is modified?
    6. How large is very large? 100s of MB? Few GB? 100s of GB? Whether or not it fits in memory could change the approach.
    7. You’re using files, plural, would parallelizing at the file level (e.g. one thread per file) be enough?
    8. How many files and how often is that executed?