• I have a fairly substantial 16gb AMD GPU, and when I load in Llama 3.1 8B Instruct 128k (Q4_0), it gives me about 12 tokens per second. That’s reasonably fast enough for me, but only 50% faster than CPU (which I test by loading mlabonne’s abliterated Q4_K_M version, which runs on CPU in GPT4All, though I have no idea if that’s actually meant to be comparable in performance).

    Then I load in Nous Hermes 2 Mistral 7B DPO (also Q4_0) and it blazes through at 50+ tokens per second. So I don’t really know what’s going on there. Seems like performance varies a lot from model to model, but I don’t know enough to speculate why. I can’t even try Gemma2 models, GPT4All just crashes with them. I should probably test Alpaca to see if these perform any different there…

    • Hi I just wanted let you know that I managed to get Gemma 2 model to work (didn’t work previously too).

      These are the new ones Gemma 2. I wasn’t 100% sure first, so looked up at Gemma models list: https://ai.google.dev/gemma/docs/get_started and the only 9b variants are the new Gemma 2 versions (Edit: I mislooked. There are Gemma 1 versions with 9b too, so never mind this comment. ). If this works on my low end GPU, it should work on yours too.

    • Wow it got worse for me. Maybe through last update? Is this probably related to he application? Now I get 12 t/s on my CPU and switching to GPU it’s only 1.5 t/s. Something is fishy. With Nous hermes 2 Mistral 7B DPO with q4 I get 33 t/s (I believe it was up to 44 before).

      Now I’m curious if this will happen with a different application too, but I have nothing else than GPT4All installed.

      • Unfortunately I can’t even test Llama 3.1 in Alpaca because it refuses to download, showing some error message with the important bits cut off.

        That said, the Alpaca download interface seems much more robust, allowing me to select a model and then select any version of it for download, not just apparently picking whatever version it thinks I should use. That’s an improvement for sure. On GPT4All I basically have to download the model manually if I want one that’s not the default, and when I do that there’s a decent chance it doesn’t run on GPU.

        However, GPT4All allows me to plainly see how I can edit the system prompt and many other parameters the model is run with, and even configure multiple sets of parameters for the same model. That allows me to effectively pre-configure a model in much more creative ways, such as programming it to be a specific character with a specific background and mindset. I can get the Mistral model from earlier to act like anything from a very curt and emotionally neutral virtual intelligence named Jarvis to a grumpy fantasy monster whose behavior is transcribed by a narrator. GPT4All can even present an API endpoint to localhost for other programs to use.

        Alpaca seems to have some degree of model customization, but I can’t tell how well it compares, probably because I’m not familiar with using ollama and I don’t feel like tinkering with it since it doesn’t want to use my GPU. The one thing I can see that’s better in it is the use of multiple models at the same time; right now GPT4All will unload one model before it loads another.

        • That’s quite unfortunate. Alpaca needs to support those explicitly to work with the new 3.1 128k models; GPT4All was not compatible with it before update either. There was a bug in some library they was using and needed a patch. So maybe that’s why you can’t use the new Llama 3.1 in Alpaca. (Edit: Never mind. On the webpage they advertise and talk about 3.1 being working, so a wrong guess by me probably.)

          Actually that sounds very useful and I missed that option, to be able to select from a set of related models. One thing that GPT4All can also do is, analyzing text files and then using the data to ask questions about it. It will also output the exact lines of the file in relation to the answer. I only experimented a little bit with this, but sounds useful too. The team also experiments and works on a web search using, but no idea how that would work with a local model if ever.