# Artificial Intelligence Info

These are questions and answers from AIT. Questions are in italics, may be out-of-order of lecture chronology.

## lecture 5 Quizzes Q&A

### What is learning?

Induction learning, learning from a set of observations, about a model with unknown distributions.

Bayesian learning, learning from the observations from a prior belief. Posterior probability = P(H|d), d=available data. maximizing the H function that is used. Learning by inference => P(H|d) = P(d|H)P(H)/P(d). After that you act, deciding on the utilities and probabilities of the next state.

Drawback of Maximum Likelihood: prone to overfitting. A better alternative is Maximum a Posteriori (MAP), takes priors into account. h_map = max_h P(d|h)P(h). MAP can still overfit.

### Machine Learning 101

Which is an example of supervised learning? Clustering or Regression?

Which is the ultimate objective of Machine Learning?
Approximating the original (unknown) function f with an hypothesis h such that h generalizes well also on unseen data.

True of false: Having a training loss equal to 0 guarantees the same result on the test set.

#### Hidden variables

Which is an advantage of using hidden variables? It greatly reduces the number of parameters requested by our model.

#### EM algorithm

Which two steps take turns in the EM algorithm? Estimating the value of hidden variables given the current values of parameters and updating the parameters using the estimated values.

How can we simplify the Expectation step of the EM algorithm when applied to HMMs? It is equal to computing the pairwise marginal probabilities of the current and next state at each timestamp.

## Lecture 6: Markov Decision Process

### Sequential Decision Processes

Which n-tuple does formally define a Markov Decision Process? A 4-tuple: (S, A, P, R) where S is the set of states, A is the set of actions, P is the transition probability from a state to another taking a certain action, R is the reward function for moving from a state to another through a certain action.

How can we define the best policy ${\displaystyle \pi ^{*}(s)}$? It is the function mapping every state to the best action we can take (i.e., that maximises the expected reward for that state). Function: max(U(s,a)).

Why might we need a discounted utility function? By using a discounted utility function we set a priority on maximising the immediate rewards over the future ones.

What can we compute by using the Bellman equation? The utility of states.

How many Bellman equations do we have to solve? We have to solve n nonlinear equations, where n is the number of states.

### Dynamic Programming

How are the utility values initialized in Value Iteration? Randomly.

How are the utility values updated in Value Iteration? Using the Bellman equation for every state utility value.

In the Dynamic Programming approach, from which states do we start computing the expected utilities? From the terminal ones, then we “climb back” the state-action tree.

Which is a problem of Dynamic Programming when computing utilities of states? In environments with many states and actions it does not scale well and becomes very expensive.

### Solving MDPs - Policy Iteration

Which are the two steps alternating in Policy Iteration? Policy evaluation and policy improvement.

What is a drawback of policy iteration? It is expensive to compute because we need to repeat the evaluation at each iteration.

How can we improve policy iteration? By doing just a few evaluation steps: as long as we update all the states every once in a while, the algorithm will converge anyway.

### Markov assumptions, recap MDPs and intro to POMDPs

What is the Markov property (applied to the state transition probabilities)? The conditional probability distribution of the future states given present and past states only depends on the current state of the system: ${\displaystyle p(s_{\text{t+1}}|s_{t},s_{\text{t-1}},...,s_{0})=p(s_{\text{t+1}}|s_{t})}$

### Online Planning

Which is an advantage of Online Planning compared to the techniques we have seen so far? We do not need to use a tabular policy representation, which does not scale well.

### Monte Carlo Tree Search

Which two ideas does MCTS implement to improve over the DP approach? It constructs a sampled states-action tree incrementally and it focuses only on the most promising regions.

Why do we require a good exploration strategy when implementing MCTS? Because we need to balance exploitation and exploration such that we try many actions (to have a good estimate of the expected values) and also focus on the best tree branches (to reach the best value).

## Lecture 7: Partially Observable Markov Decision Process

### POMDP, observervation model

What are the two major problems with idealized MDPs that POMDPs aim to solve? MDPs can be successfully applied if there is an observable Markovian state which typically is difficult(impossible) to obtain.

1. Real agents do not directly observe the state of the environment, they do it through their sensors.
2. Sensors could be noisy and may not give the full picture of the environment

How does the observation model help in dealing with the issues of partial observability? The observation model tries to capture the distribution of the observations conditioned on the state and action allowing us to factor in the noise of the sensors that result in the observation. Now the transition model is no longer single layered based on the state and action, but also the observation that the agent made based on that state.

For POMDP, does keeping track of more than just the previous state and action beneficial for optimal decision making? Why? Yes, it is beneficial to keep a track of more states and actions in memory. Since this is not the ideal markovian setting, the agent typically has little information about the current state. Tracking a number of past states, actions and rewards could help the agent “know” which state it is in.

What is a typical way of encapsulating the previous state and actions of the agent in a POMDP setting? POMDP agents maintain a belief over all states of the environment. A belief b(s) of an agent being in a particular state is the probability with which it “thinks” it is in state s. Beliefs over all states add up to 1.

## Lecture 8 (W11) Reinforcement Learning

### General RL

Which is the objective of Reinforcement Learning? Learning a policy optimizing its expected value.

Which is the approach of Model Based RL? We learn the model (transition and reward functions) interacting with the environment, and then we compute the optimal policy using the model.

Which is the approach of Model Free RL? We learn the optimal policy directly using experience provided by the environment.

### Passive Learning

When using Monte Carlo methods, what is the main drawback? It usually has high variance and needs many samples to improve the accuracy.

What is an advantage of TD-learning? We can update our estimate as soon as we observe the next state, without waiting for the end of the episode.

### Active Learning

How do we implement an ε-greedy strategy? We take a random action with probability ε and the greedy action with probability 1-ε.

What is an on-policy method? An on-policy method (like Sarsa) evaluates and improves the same policy used by our algorithm to make decisions.

What is an off-policy method? An off-policy method (like Q-learning) evaluates and improves a policy different from the one used to make decisions.

### Function Approximation

Why would we need to use function approximation? Tabular representations do not scale well in large environments (many states/actions, possibly continuous space): we need to approximate value functions without keeping track of all possible state-action pairs.

What is a problem when using non-linear function approximation with Q-learning? Convergence is no more guaranteed.

### Policy Search

Which is the founding idea of policy gradient methods? Directly parametrize the policy function.

Which are the roles of actor and critic in the omonimous methods? Actor selects the actions (policy structure) and the critic uses the value function to give feedback on the actions taken.