I am not very familiar with hardware design, so I really appreciate if someone with knowledge in this area could tell me how much performance we could gain by creating an LLM-specific inference hardware. I don't mean e.g. a chip optimized for general transformers, I mean going beyond that and hard-coding the weights of a trained model into the hardware.