How come the most accurate microphone isn't a big metal box? What is "audio science" really capable of?

A giant metal box would be the least ideal microphone shape and not for aesthetic reasons. The source's sound pressure waves would reflect off of this box with bad consequences for the recorded signal. Also, it would be impractical to use. After an iteration or two, prototype engineering equipment can be ugly but it's usually robust and practical for the task at hand. Industrial design seems like a waste of effort to ignorant people but it helps make things practical to use, which is just as important as any performance metric. There are also plenty of mics that have a huge brick (tube mics for instance) nearby.

In audio technology R&D, GRAS condenser Microphones (or similar) are standard for performing all sorts of calibration and measurement. They lower noise by making the preamp and power supply separate modules. They are a bit more like you allude to above: a small cylinder that is basically just the capsule. https://www.gras.dk/products/measurement-microphone-cartridge/externally-polarized-cartridges-200-v/product/167-40ag

MEMS microphones are getting better all the time and I'd bet they'll replace condensers like the 40AG someday.

/r/audioengineering Thread