(bash?) matching Array of Arrays against a simple k:v json at ~200k lines?

𞋴𝛂𝛋𝛆@lemmy.world · edit-2 2 months ago

(bash?) matching Array of Arrays against a simple k:v json at ~200k lines?

PumaStoleMyBluff@lemmy.world · 2 months ago

readable script that I will understand at a glance a year later

Well that rules out bash

lime!@feddit.nu · 2 months ago

this sounds like a python kind of problem. i’m all for abusing bash features but there comes a point when there’s just too many brackets.

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

Yeah, I probably should use Python, it just takes me longer and I am much less likely to keep hacking around with it later. It feels like a foreign language relative to bash. The problem came from greping using for loops, so I’m in that mindset. Point taken though.

INeedMana@piefed.zip · 2 months ago

I think this post could use some small example with what you have vs what you want to get

Depending on how I read your description it’s either doable with bash and grep or probably doable but a lot of hassle compared to using something else than bash

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

jake
Jake
j4ke
Jak3
j@k3
JAK€
jπ⸦kE
𝚥ᎪᏦ⋲
ꓙᏎ🅺Ꮛ
𞋕ꮜ𝈲𝈁
᜴ᚣᜩᗕ
It is not this, but same problem scope. Resolve all to “jake” for further processing. Also specifically looking for that ck.

INeedMana@piefed.zip · edit-2 2 months ago

If you want ᜴ᚣᜩᗕ to get resolved to jake then bash will be a pain to use. I would use python

For each ASCII letter create a list of non-ASCII characters that look similar. Then, for each word you want to match construct a regex

dictionary = { ...'j': ['j', 'J', '𝚥', '𞋕', ...

regex=''  
for letter in 'jake':  
regex += f'[{"".join(dictionary[letter])}]'

so after whole loop the regex would look a bit like (I’m cutting out a lot of characters to save on copy-pasting) [jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]

>>> import re  
>>> r='[jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]'  
>>> re.fullmatch(r, 'jake')  
<re.Match object; span=(0, 4), match='jake'>  
>>> re.fullmatch(r, 'joke')  
>>>

In general the group of problems that you are touching here is https://en.wikipedia.org/wiki/String_metric but I’m not sure if there is an algorithm that would be so “visual” matching

𞋴𝛂𝛋𝛆@lemmy.world · 2 months ago

Thanks. This was helpful.

lime!@feddit.nu · 2 months ago

tr could definitely do some work here. maybe something like, echo each word and its translated counterpart, sort on first column, and then echo $col1 >> $col0 for each line? it’s a start at least

Brickfrog@lemmy.dbzer0.com · 2 months ago

Similar to the other comment, not sure if you’ve ruled out writing a Python script? For what you’re describing Python would be able to easily tackle your requirements and still be readable since it’s just a script you can launch whenever you need. Python is also pretty easy to pick up so if you’re familiar with scripting then it could be a fun learning experience (if you don’t already know it).

Other scripting languages could work too, just feel like Bash will be less readable if you’re writing a massive script like that.

Matt The Horwood@lemmy.horwood.cloud · 2 months ago

If you have raw jdon, then take a look at jq. It’s amazing for read json text in bash

cymor@midwest.social · 2 months ago

If it’s a regular enough format, you could just use DuckDB. Are you using it as a challenge?