Right, except that such super-instructions have to be useful in the first place. For large vector/matrix operations, it would be better to simply implement these in optimized C/C++/... (or even offload to a GPU) and expose them as functions to the (scripting) language (this is what Python does, binding to optimized C libraries).
both Lua and Python already have power instructions, like table/dictionary lookups, Lua even has a support opcodes to speed up the for loops etc.
One could expose something like vector instructions (SSE/AVX, I believe WebAssembly has a proposal for those) but then you need a compiler than can auto-vectorize your code, which is no easy task. And let's not forget that you still need control flow instructions to do something useful (like playing chess
I saw a nice trick in the 32-bit days, where bytecode (32-bit instructions) was interleaved with a 32-bit pointer containing the address to handle the next opcode, bypassing the cost of decoding the opcode and dispatching via indirect jump table, but that doesn't change the fact that interpreters are very slow in general.
My point is that no interpreter can ever get close to optimized native code and so far I've found nothing that would get even remotely close in general case in terms of performance. And performance always matters, contrary to popular belief that it doesn't.
And of course that a JIT compiler is not an interpreter... in fact I believe that it should be called JIT optimizer rather than compiler