AMA: the OpenAI Research Team

Hey folks! Question about a model idea I've been toying with:

tl;dr: make RLs honest with neuraltalk? and/or are you interested in the area of research regarding debugging tools?

After hearing Ilya share how neuraltalk was put together - that is, just jamming a decoder on the end of a convnet and training it - I've been thinking about using something similar to analyze large reinforcement learners that can make plans. In particular, since RLs can make plans, try them out, and learn from the gathered data, they run some risk of learning that it is useful to their reward to lie. I'm concerned that if we build up complex enough brain equivalents - say, around the capabilities of a teenager - and they have learned to lie, it may become rather difficult to get them to "listen".

By using the RL's intermediate hidden states in something similar to neuraltalk - just take the hidden state and then train a decoder - I was thinking you might be able to serialize the hidden states that might contain their plans to english or images, via an external decoder that would be trained without the data or signals from the reinforcement system. Perhaps, for instance, do exactly what was done with neuraltalk - feed MS COCO (or, if possible, a larger dataset with similar labels) into the learned RL model, and then at various points in the RL model's pipeline, copy the hidden state out and use it as the thought vector to feed to an external decoder predicting COCO's labels. My thinking on this isn't fully formed, as I'd expect that to be unable to generalize far enough from COCO, but what are your thoughts on the general idea?

I don't expect this to be a big deal for several years at the bare minimum, but since neuraltalk was as useful as it was for getting an idea what convnets like to erase, it seems like analysis tools generally of the category would be immensely valuable for crafting and designing reward systems and training environments for large RLs. Thoughts?

And more importantly to me, is this the sort of research your has interest in? That of building out interpretable debugging tools?

/r/MachineLearning Thread