I originally thought numerical linear algebra was the course to start learning how to optimize numerical computations
Numerical linear algebra is a start for learning how to do optimized numerical computation. It just isn't an end. Trefethen's book is a solid introduction to the foundations of algorithmic design; algorithmic design is the most mathematical aspect of numerical computation, though not all of it is linear algebraic.
There are, however, many further (mathematical) topics in the field that you can pursue to further your understanding. Some broader coverage of numerical analysis is also helpful (see, for example, Higham's book Accuracy and Stability of Numerical Algorithms), as well as a wealth of literature on parallel computation, much of which is unfortunately still too state-of-the-art for there to be a single concise up-to-date textbook. During my own graduate studies I followed online materials from courses; see for example UC Berkeley's parallel computation course.
However, the reality of the situation is that numerical computation is not solely mathematical. We perform our computation not with abstract Turing machines, but with electronic computers. Therefore, the hardware constraints of specific electronic computer configurations become extremely important. As a simple example, one of the basic considerations to make when coding an algorithm is which data to store in memory for future recall. If performance is important, then these considerations are based upon physical access times to different forms of memory; modern CPUs have several levels of memory cache that are much smaller but have faster access time compared to the larger system memory. These access times are based upon the electronic wiring and can vary not only based on the particular model of CPU, but also on the physical manufacturing properties that can differ very slightly between individual CPUs of the same model.
Another large topic I encourage you to read into is GPU-based computation. Once again, the primary considerations are hardware-based rather than mathematical, based upon the highly parallel, shared-memory nature of GPUs.