(bash?) matching Array of Arrays against a simple k:v json at ~200k lines?

𞋴𝛂𝛋𝛆@lemmy.world · edit-2 4 months ago

(bash?) matching Array of Arrays against a simple k:v json at ~200k lines?

INeedMana@piefed.zip · 4 months ago

I think this post could use some small example with what you have vs what you want to get

Depending on how I read your description it’s either doable with bash and grep or probably doable but a lot of hassle compared to using something else than bash

𞋴𝛂𝛋𝛆@lemmy.world · 4 months ago

jake
Jake
j4ke
Jak3
j@k3
JAK€
jπ⸦kE
𝚥ᎪᏦ⋲
ꓙᏎ🅺Ꮛ
𞋕ꮜ𝈲𝈁
᜴ᚣᜩᗕ
It is not this, but same problem scope. Resolve all to “jake” for further processing. Also specifically looking for that ck.

INeedMana@piefed.zip · edit-2 4 months ago

If you want ᜴ᚣᜩᗕ to get resolved to jake then bash will be a pain to use. I would use python

For each ASCII letter create a list of non-ASCII characters that look similar. Then, for each word you want to match construct a regex

dictionary = { ...'j': ['j', 'J', '𝚥', '𞋕', ...

regex=''  
for letter in 'jake':  
regex += f'[{"".join(dictionary[letter])}]'

so after whole loop the regex would look a bit like (I’m cutting out a lot of characters to save on copy-pasting) [jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]

>>> import re  
>>> r='[jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]'  
>>> re.fullmatch(r, 'jake')  
<re.Match object; span=(0, 4), match='jake'>  
>>> re.fullmatch(r, 'joke')  
>>>

In general the group of problems that you are touching here is https://en.wikipedia.org/wiki/String_metric but I’m not sure if there is an algorithm that would be so “visual” matching

𞋴𝛂𝛋𝛆@lemmy.world · 4 months ago

Thanks. This was helpful.

lime!@feddit.nu · 4 months ago

tr could definitely do some work here. maybe something like, echo each word and its translated counterpart, sort on first column, and then echo $col1 >> $col0 for each line? it’s a start at least