I think the original poster is not correctly interpreting the dimensions of the results produced by melspectrogram.
Mel-spectrogram, like all of the other feature extraction functions in librosa, produces a set of values for each frame of audio. By convention, the frame index is the last dimension of the result, and the number of channels/values per frame is the first dimension. So if you have a mel spectrogram output with shape (128, 177), that means you have 177 frames of audio, each of which has 128 mel channel values.
If you want to feed an individual mel spectral frame into a neural network (or any other kind of model), you need to slice out along the second channel, like:
melspec = librosa.feature.melspectrogram(y=y, sr=sr)
observed_frame = melspec[:, 10] # slice the 11th frame out of the mel spectrogram
Does that clear things up?