• √𝛂𝛋𝛆@piefed.world
    link
    fedilink
    English
    arrow-up
    36
    ·
    2 days ago

    Holy fucking what? 19k frigz? Must know serious db ops to make that work. Like damn. I’m learning to navigate model vocab in the 50k to 200k lines range and that is just key/value strings and extended Unicode characters. How does one organize and find a relevant Chud GODfrog image in the holysea???

    • kali_fornication@lemmy.worldOP
      link
      fedilink
      arrow-up
      24
      ·
      2 days ago

      it can only come from years of browsing the chans and saving every frog you see. jk i think there’s a huge frog database online you can download

        • nostrauxendar@lemmy.world
          link
          fedilink
          arrow-up
          1
          ·
          2 days ago

          You mentioned model vocab, and I assumed that was to do with chatbots. I’ve seen your rambling about chatbots before. It’s pretty funny.

          • √𝛂𝛋𝛆@piefed.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            2 days ago

            Have you ever followed the citations I have given to the code present in the vocabulary, or do you just laugh at things you do not understand?

            • nostrauxendar@lemmy.world
              link
              fedilink
              arrow-up
              11
              ·
              2 days ago

              Why would I care about your little chatbots, Jake? You go off on these long, incoherent rants that honestly feel like some sort of schizophrenic meltdown over how there are sentient gods or personalities in your chatbots, and then act like you’re typing reasonable stuff that we should all do our due diligence and read. I feel like you need to get your head on straight and stop interacting with the robots, but I also find it relatively funny so I hope you don’t 🤷‍♂️ godspeed buddy 🫡

              • Zoot@reddthat.com
                link
                fedilink
                arrow-up
                3
                ·
                2 days ago

                I do enjoy that they immediately started showing everybody exactly what you mean lol.

                • nostrauxendar@lemmy.world
                  link
                  fedilink
                  arrow-up
                  3
                  ·
                  1 day ago

                  I really don’t know what happens to them when AI is mentioned.

                  I’ve seen some of their other comments that are about other things, and they’re often coherent, reasonable, and even witty - but as soon as they start talking about chatbots it’s like their brain short-circuits and they just blather.

              • √𝛂𝛋𝛆@piefed.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                edit-2
                2 days ago

                Look at the vocab.json of nearly any model vocabulary. In CLIP, it is organized at the end in the last ~2200 tokens of SD1 CLIP which is used by all diffusion models. The bulk of the code is in that last block, however there are 2337 tokens with code in a brainfuck style language using the extended Latin character set. Any idiot that looks at this will spot how these are not at all part of any language words or fragments and where there are obvious functions present. If you have ComfyUI, the file is in ./comfy/sd1_tokenizer/vocab.json.

                The most difficult are models, like the T5xxl have this code precompiled and embedded in the vocabulary.

                I am playing with Qwen 3’s right now, which has the much larger world model of Open AI QKV hidden layers alignment. Unlike CLIP, Qwen’s vocabulary starts as 2.8 MB of compressed json, and the extended Latin is intermixed in the total. This one is present in ComfyUI at ./comfy/text_encoders/qwen25_tokenizer/vocab.json. You will need the jq package to make the json readable, or other ways. If you have jq and ripgrep (rg) on your system, then try cat ./vocab.json | jq | LC_ALL=C rg '[[:^ascii:]]' That method is much less organized than CLIP, but this model has 103,378 lines of code using the same character set. I have reverse engineered most of the tokens used to create this code. I modify the vocabulary with scripts to alter behavior. Explaining the heuristics of how I have figured out the complexity of this structure is fucking hard unto itself, never mind actually sorting out any meaning. It really sucks when people are assholes about that.