Learning the way of life from Reinforcement Learning

4 min readApr 18, 2022


This is my first ever blog to write with my name on Medium. But it’s okay if I write any gibberish since this is my first epoch.

Photo by Andrea De Santis on Unsplash

Studying RL, I’ve reflected on myself from how agents learn to win the game after thousands of trials and errors. I watch the blank-slate babies getting better and better and sometimes struggling, yet managing to make it in the end when there’s some help.

Although there’s still a long way ahead, RL algorithms are powerful enough to give us insights on how to learn and maybe even how to live. The advance of RL models is a global game of RL itself, played by our smart fellow human researchers, who distil their own way of learning into their baby creatures to optimize the trajectory to a better world.

To me, RL is a mirror of myself.

RL in a nutshell

Photo by KD nuggets

Let’s take a real-life example of a university student. You’re a student, aka an agent. Your goal is to maximize the reward in an environment, called a university. You know nothing about the environment yet. You don’t know where you can earn rewards from. You don’t know what you’re into. So, you randomly carry out actions. Then, you happen to earn rewards by making new friends, learning deep learning, going to a party, or becoming a member of a student society. Over time, you get to know what you can expect.

Value / Q

We are not determined by our experiences, but the meaning we give them is self-determining — Courage to be Hated

A value-based model learns a value function of states or action-states (Q). He learns to estimate the value of each situation and thus selects an action that would lead him to a state that would potentially give him the best rewards.

How many times have you asked

  • What’s the value of my current situation?
  • What can I expect from it?
  • What’s the value of the next possible situation and which one should I choose?
  • Am I doing good?

It’s about knowing which state you are in and extracting the meaning from it.

I myself am also on a journey to finding my own value function. I believe everyone has his or her own. To some, money has the best value; to others, learning.

As in RL, a value function can constantly change until it converges. I think that’s why we regret the past. To me, our brain is a machine that always selects the situation with the best value. But why do we err? Because we poorly judge the situation and many don’t even try to. Without second thoughts, watching Youtube videos or playing games can be a state that gives us the maximum rewards in the quickest time. Why does addiction even exist? Because our brain always chooses the best-looking state.

There’s another reason why we regret even though we do our best. Your value function constantly changes. That means when you were striving to achieve your dream in the past, the dream might not have the same value in your current value function. We all have regrets, even though we did our best. It’s so natural, since how we view the world changes over time.

Action & Policy

The important thing is not what one is born with, but what use one makes of the equipment— Courage to be Hated

In RL, actions are a set of possible ways you can change yourself and the environment in a given state. A policy is a probabilistic distribution of actions, i.e. how you’re likely to choose each action. A well-trained agent can choose the best action that would give him the best possible rewards in each state.

To me, it sounds like a habit. When you’re setting a foot in a “home” state, what’s the first thing you do? What are you likely to do when you wake up? What is your policy? What’s your probability to choose reading a new research paper over watching Netflix? We run into hundreds of different states every day, decide which action to choose, and move on to the next state. It is your policy that determines the outcome in a long run. Without careful thoughts, it is your unconsciousness that dictates your every action, a fate waiting for you.

An agent that doesn’t update its policy or choose an action that’s destined to ruin the game isn’t clearly learning.




  1. It’s all about learning. An agent that beat the Go champion learned from nothing through millions of trials and errors
  2. Do exploration to step outside your comfort zone.
  3. Know your value function.
  4. Examine your policy — it’s a habit that determines your fate