概述:
弄懂 Q-learning 算法的前提是了解法尔科夫过程和奖励函数,用value(max)替换了原先的value奖励函数。
代码:
import numpy as np GAMA = 0.8FINALLY = 5#构造一个6*6 的小型迷宫R = np.random.randint(1,100,[6,6])#初始化Q表Q = np.zeros_like(R)# Q表更新函数def updataq(i,j): try: while True: Q[i,j] = R[i,j] + GAMA * Q[j].max() if j == FINALLY:break return updataq(j,Q[j].argmax()) except:pass# 测试函数def findway(node): if node != FINALLY: way = Q[node].argmax() ways.append(way) return findway(way)for _ in range(600): updataq(*np.random.randint(0,6,2)) ways = [] findway(2) print(ways)
测试结果:
希望找出2节点到5节点的路径:
[3, 1, 5]
作者:圣_狒司机
链接:https://www.jianshu.com/p/58d51a3f1e71