-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
General question about purpose of this project in scope of other hardware than Intel #772
Comments
Thanks for your comments. This Python extension project focuses on Intel hardware. |
Yes, thank you. Actually I'm building something more awesome than Ollama and it will use SGLang (better than vLLM), OpenVINO + some other stuff and also my custom bridge adapters to provide unified API compatible with OpenAI API. The goal is to get crazy, think out of the box, like mad scientist, and create something smart, helpful and innovative. So I will need to compile almost everything. So basically what I was expecting you would offer, it was my idea and I will implement it. So it will utilize NPU, XPU, GPU + provide GPUDirect Storage to use NVMe (or RAM DISK like we did in Windows 95 era :D )and bypass CPU. It should be aimed at inference, not primarily training, so you will design some agentic workflow using several models and because my focus is on developers and people with single GPU, it will provide very fast hot serving models while keeping them in RAM or NVMe for fast swaping. It will also detect and utilize all the hardware the person have and decide what to put on NPU and what to put on GPU. My vision is that modern software must be very smart, offer simple UI or API, but under the hood it will be very complex orchestration that will be fully automated. I want at least my apps represent myself, so computer get just basic instructions from the user and user will not do anything manually, computer must be smart enough to solve the task and optimize it or even benchmark various combinations, store it in history, and then have some data for informed decisions in the future. Because of my thinking, and perfectionism, my life is not easy. I am not even popular at workplace often because I think forward for further customer needs and often criticize bad decision about tech stack or if boss orders to do some manual tasks for a month or two and whole team must do it. They told me I'm slow and don't meet deadline (because I care about quality), then I got upset and over weekend I write only 20 lines in Java using AOP (Aspect oriented programming) and it will generate everything automatically in the same style and same template as they did manually. But for some reason boss was not happy :D he did not have prepared any tasks for the team and I even educate everyone that computer should work for us and if someone with software engineering education does stereotype work manually for a month, then the person breaks basic paradigms of software engineering and such people are just highly trained monkeys. I find good company once a 10 years. I have my standards and vision, maybe similar attitude like Steve Jobs. So even when I'm complaining, I analyze the situation and try to offer some solution. Thank you for your advice, documentation confused me quite a lot and I thought it will be like that, but hope dies as last thing. I bought new PC and did not even get to have fun with AI models. First I will need to compile all these extensions to one package. It is great you have Intel AI Tools and there is even AIO library for Windows that is like 90% POSIX compatible, so I can use it to fix DeepSpeed, Triton, NCCL and so on. We have a fairytaile for kids, so I will act like that: Cat and Dog cook a cake and they will put everything that kids like to that cake :D |
OPEA is our open source project which supports Nvidia GPU, AMD GPU, Intel CPU/XPU/Gaudi , ....... You can ask questions in each repo. Grasp and keep passion which is not everyone every monment has. |
Thank you very much. I learnt to read chinese scientific papers about AI because when I want some real and state-of-the-art knowledge about AI, I need to browse chinese internet. I bought $6000 Win11 AI PC and cannot use it yet. I need to recompile all C++ on Windows and enable all features to get at least normal performance. I will make custom model serving server for inference and exposing Ollama/OpenAI compatible APIs. Then share it with some nerds including Ph.d's in my network. Most of them complain at Microsoft and NVIDIA. Strange thing is that Microsoft have DeepSpeed, but they compile it only for Linux, not for Windows. They claim they don't have AIO support, which is Linux only, yet Intel AI Toolkit has AIO and Intel website says Microsoft SDK have it also. The plan look doable until I found that NVIDIA, for some strange reason, includes cuFile in Cuda Toolkit only in Linux, but not in Windows. Anyone who asks about this on NVIDIA forum never receives any response even after 4 years. Microsoft Win11 has DirectStorage equialent and in Win11 it shows my GPU is enabled and supports DirectStorage out of the box. I should look if Intel GPU / XPU supports something like DirectStorage (offloading to NVMe instead of HDD/RAM bypassing CPU). Because I am perfectionist and only missing cuFile breaks our AI community like house of cards, I found it is possible to load cuFile.so in Windows Python using ciffy. But I did not test it yet. In European Union we have social capitalism (like combination of capitalism and socialism) so government here pays more attention to social needs of people rather than giving corporations all the power. Doing something simple changed to big effort and it costs me a lot of time. I cannot be in Vietnam with family on Lunar New Year. BTW: Enjoy Lunar New Year, year of dragon: 新年快乐 (I had girlfriend from Shenzhen, I think Shenzhen must be the best place in the world for innovators and hardware creators, my country does not make anything) |
Describe the issue
Hi, I have i9 Ultra Core 285K with AVX-512 VNNI instructions (only INT8, meh) and Intel AI NPU (only 128MB memory) and there is also some integrated Intel graphic card.
Also 128GB DDR5 RAM..
But mostly I use RTX 4090 24GB VRAM with float16 or bfloat16.
I read a lot of your papers and github here and it seems that if I want to use Intel Neural Compressor or DL Boost (AI Kit), I have to download PyTorch that has been compiled for CPU only support? Because I use Pytorch with CUDA build and that will be always faster.
What are Intel AI Kit, Neural Compressor, Intel Extension for Pytorch, MKL-DNN and other fancy stuff for real usage with dedicated NVIDIA GPU (Tensor cores are nowadays preferred over CUDA).
Could your library help me with faster loading and unloading of model? Lets assume scenario 1 - model is in ONNX INT8 format already and scenario 2 - model is in bfloat16 from Hugging Face.
Intel AI NPU is still waiting for software support and also Microsoft and Intel seems to support only Linux, while I am using Windows 11 for development. But I guess I could offload some small text to speech model....at least something
Is there some fancy trick how could utilize this expensive Intel hardware with NVIDIA GPU? You only advertise Intel XPU everywhere and it's little bit annoying. While I respect your marketing and business, we are on a field of science, visionary and creativity so we should talk more openly not only pushing Intel GPU on every paper.
I am building OpenAI API compatible gateway supporting even Ollama and incompatible servers serving models such as SGLang or OpenVINO. I know OpenVINO is also Intel supported and it does not even work with non-Intel GPUs. But at least I will develop full cycle eco-system for GPU, NPU, XPU.
But I would like to know some workflows where these tools can cooperate together - if there is some library support for that specific case.
Also I was thinking Intel AI Kit, Neural Compressor or this Python extension would help by speeding up model loading and offloading and also providing Direct GPU - from VRAM to RAM and back so it won't use hard drive. Then it could be in parallel quantized by CPU on the background and served to NPU. So basically my OpenAI API like gateway must support scenarios where you have limited memory and you are multi tasking so you need to use many AI/ML models quickly.
Thank you!
I love Intel ;)
The text was updated successfully, but these errors were encountered: