I have a file that contains a lot of odd slang and dialects that were written as they sound, and I want to standardize them to the ASCII character set. I want a readable script that I will understand at a glance a year later despite not touching a computer in the interim.
Maybe I am going about this the wrong way, but I want to initialize individual arrays for each character [a-z]. Then step through each character of the input word or string, passing these to a Case that matches them to the respective [a-z] array while passing unmatched characters unchanged. In the end I need to retain correlation with the original file line.
In my first attempt, I got to the point of matching each character to the name of the array using the case, but only as the name of the array as a string of text. So like, the “a” array is “aaa”. Now I’m trying to relearn how to call that placeholder as an array again, like a pointer. I can make it a variable with printf -v. but then calling that variable as a pointer to the array alludes me. I don’t know how to double expand a variable inside an array like “${$var[@]}”. I’ll figure that out. This is just where I am at in terms of abstract reference of ideas. Solve it, don’t; I do not care about that aspect; solving my method is not related to what I am asking here.
What I am asking is what ways are used to solve this type of problem in general, with the constraint of readability? Egrep, sed, awk? Do it all within the json to maintain the relationship to the original key/value? Associative arrays have never really clicked for me in bash. Maybe that is the better solution? It is just a hobby thing, not work, school, or whatnot. I’m asking hackers that find this kind of problem casual fun social smalltalk.


I think this post could use some small example with what you have vs what you want to get
Depending on how I read your description it’s either doable with bash and grep or probably doable but a lot of hassle compared to using something else than bash
jakeJakej4keJak3j@k3JAK€jπ⸦kE𝚥ᎪᏦ⋲ꓙᏎ🅺Ꮛ𞋕ꮜ𝈲𝈁᜴ᚣᜩᗕIt is not this, but same problem scope. Resolve all to “jake” for further processing. Also specifically looking for that ck.
If you want
᜴ᚣᜩᗕto get resolved tojakethen bash will be a pain to use. I would use pythonFor each ASCII letter create a list of non-ASCII characters that look similar. Then, for each word you want to match construct a regex
dictionary = { ...'j': ['j', 'J', '𝚥', '𞋕', ...regex='' for letter in 'jake': regex += f'[{"".join(dictionary[letter])}]'[jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]>>> import re >>> r='[jJ𝚥𞋕][aA4@][kKᏦ🅺][eE3€]' >>> re.fullmatch(r, 'jake') <re.Match object; span=(0, 4), match='jake'> >>> re.fullmatch(r, 'joke') >>>In general the group of problems that you are touching here is https://en.wikipedia.org/wiki/String_metric but I’m not sure if there is an algorithm that would be so “visual” matching
trcould definitely do some work here. maybe something like, echo each word and its translated counterpart, sort on first column, and thenecho $col1 >> $col0for each line? it’s a start at least