Codeforces科学刷题指南

{"managingGroups":{},"collectingGroups":[],"author":"wyyweb","updateTime":1575160901000,"title":"Codeforces科学刷题指南","dislikeCnt":0,"content":"简要介绍如何科学地刷算法题，来提高自己解决问题的能力，并利用爬虫抓取Codeforces的题库，来分析题目难度以及算法分类的关系\n\n无论做什么事，多尝试、找套路、然后刻意练习都是至关重要的。对信息科学竞赛（Olympiad in Informatics）爱好者来说，找套路的关键就是多刷题。然而题海茫茫，单以Codeforces来说，截止2017年1月3日，总共有3206道题。换言之，如果一个人足够勤奋，能够一天刷三道题，那也得快三年才能把题目刷完，而且题目数量还在扩充。所以盲目的刷题简直是浪费生命，本人从16年上半年一直按照题目解决人数从高到低排序，不断的刷水题。显然易见，刷水题的后果就是没有长进，熟悉的还是熟悉，不懂的还是不懂，唯一让自己开心的就是刷题数量的累积。所以科学刷题的本质在于不断挑战新高度，在一个平台练习足够久足够熟练之后，就要进入下一个难度平台。为了方便大家，我把Codeforces上截止2017年1月3日的所有题目的基本信息用爬虫收集了下来，并存储到excel里。更进一步，本文试图分析不同算法在不同难度等级上的出现频率分布，以及不同算法在不同难度等级上被解决次数的分布。最后，我会简要介绍的我的刷题观，以及如何爬取Codeforces上的信息。\n\n\n先说结论\n一张图\nCodeforces_Algorithms_Tag_Frequency.jpgCodeforces_Algorithms_Tag_Frequency.jpg\n上面这张图反映了不同算法（第一列）在不同问题难度（第一行）上的频率分布，基于该图，大概就可以知道在什么样的水平下应该掌握什么样的算法。不过这里我没有区分Div1和Div2之间的差别，仅仅是按照题号（A、B、C等等）来推断难度。可以看到对简单的A题而言，大部分都是考察基本的编程功底，诸如implementation（大概就是题目说什么，你做什么就是了），math（四则运算、取模取整等等）以及brute force（暴力枚举）。而随着难度的增加，比如说E题，主要就在于考察对dp（动态规划），data structures（数据结构）。当然了，从图中也可以看出，高难度题目主要在math，geometry（计算几何），shortest \npath（图论）以及games（博弈）上。下面再免费附送领一张图，反映了不同算法在不同问题难度上被解决次数的频率分布。\n\nCodeforces_Algorithms_Tag_Solved.jpgCodeforces_Algorithms_Tag_Solved.jpg\n\n一张表\nCodeforces-ProblemSet.jpgCodeforces-ProblemSet.jpg\n然后祭上刷题目录，也就是这一张表，汇总了截止2017年1月3日Codeforces题目上的所有算法题。基于这张表，一来可以按照解决人数来进行刷题，二来可以按照题目难度进行刷题，三来还可以进行主题刷题。具体的文件下载链接请见文末。\n\n\n我的刷题观\n这里搬来我在知乎上的回答，详见 LeetCode按照怎样的顺序来刷题比较好？\n\n如果想提升自己的思维能力 ，可以按照AC率或者解决人数由低到高二分查找匹配自己当前水平难度的题目，然后适当挑战高难度题（二分时间复杂度是  ，至少比从易到难的  节省时间）\n如果想巩固某一专题 ，那自然应该按照tag来刷题，但是因为所用的方法在求解前已知，不太利于思维能力的提升\n如果什么都不懂 ，那么建议随机刷题，一来可以涨见识，二来进步空间比较大\n如果想提高AC率或者增加自信 ，那么建议刷水题\n混搭以上策略 ，比如针对某一专题，然后用二分查找来选择问题求解\n再有个建议，题目如果太难超过自己当前能力的话，尝试一定时间后还是老老实实看题解吧，人与人之间还是有天赋差别的，但区别在于经验可以慢慢积累。特别是即使做对题之后，还要想尽办法看有没有提高的余地，并参考别人的代码，看如何精简代码以及精简时间空间复杂度。\n\ntourist.jpgtourist.jpg\n据说大神们的刷题量都是上万的，所以正式比赛里可以看到诸多大神不到一分钟就秒了一道题，手速太快。对Competitive Programming而言，把题目做对是基本要求（题目太难则另当别论），用更快的速度求解才是顶尖高手之间的核心区别。如果说真的有天赋存在的话，那我们也无能为力；但希望能像卖油翁一样说出，『无他，但手熟尔』。\n\n\n\n如何用爬虫获取信息\n必要的库\n1: import re\n2: import urllib.request\n3: from bs4 import BeautifulSoup\n4: import os\n5: import csv\n6: import time\n爬取Codeforces的所有算法题\n 1: #%% retrieve the problem set\n 2: def spider(url):\n 3:     response \u003d urllib.request.urlopen(url)\n 4:     soup \u003d BeautifulSoup(response.read())\n 5:     pattern \u003d {\u0027name\u0027: \u0027tr\u0027}\n 6:     content \u003d soup.findAll(**pattern)\n 7:     for row in content:\n 8:  item \u003d row.findAll(\u0027td\u0027)\n 9:  try:\n10:         # get the problem id\n11:        id \u003d item[0].find(\u0027a\u0027).string.strip()\n12:       col2 \u003d item[1].findAll(\u0027a\u0027)\n13:         # get the problem title\n14:         title \u003d col2[0].string.strip()\n15:         # get the problem tags\n16:      tags \u003d [foo.string.strip() for foo in col2[1:]]\n17:        # get the number of AC submissions\n18:      solved \u003d re.findall(\u0027x(\\d+)\u0027, str(item[3].find(\u0027a\u0027)))[0]\n19:       # update the problem info\n20:       codeforces[id] \u003d {\u0027title\u0027:title, \u0027tags\u0027:tags, \u0027solved\u0027:solved, \u0027accepted\u0027:0,}\n21:  except:\n22:      continue\n23:     return soup\n24: \n25: codeforces \u003d {}\n26: wait \u003d 15 # wait time to avoid the blocking of spider\n27: last_page \u003d 33 # the total page number of problem set page\n28: url \u003d [\u0027http://codeforces.com/problemset/page/%d\u0027 % page for page in range(1,last_page+1)]\n29: for foo in url:\n30:     print(\u0027Processing URL %s\u0027 % foo)\n31:     spider(foo)\n32:     print(\u0027Wait %f seconds\u0027 % wait)\n33:     time.sleep(wait)\n标记已解决的算法题\n 1: #%% mark the accepted problems\n 2: def accepted(url):\n 3:     response \u003d urllib.request.urlopen(url)\n 4:     soup \u003d BeautifulSoup(response.read())\n 5:     pattern \u003d {\u0027name\u0027:\u0027table\u0027, \u0027class\u0027:\u0027status-frame-datatable\u0027}\n 6:     table \u003d soup.findAll(**pattern)[0]\n 7:     pattern \u003d {\u0027name\u0027: \u0027tr\u0027}\n 8:     content \u003d table.findAll(**pattern)\n 9:     for row in content:\n10:     try:\n11:         item \u003d row.findAll(\u0027td\u0027)\n12:      # check whether this problem is solved\n13:      if \u0027Accepted\u0027 in str(row):\n14:        id \u003d item[3].find(\u0027a\u0027).string.split(\u0027-\u0027)[0].strip()\n15:        codeforces[id][\u0027accepted\u0027] \u003d 1\n16:  except:\n17:      continue\n18:     return soup\n19: \n20: wait \u003d 15 # wait time to avoid the blocking of spider\n21: last_page \u003d 10 # the total page number of user submission\n22: handle \u003d \u0027Greenwicher\u0027 # please input your handle\n23: url \u003d [\u0027http://codeforces.com/submissions/%s/page/%d\u0027 % (handle, page) for page in range(1, last_page+1)]\n24: for foo in url:\n25:     print(\u0027Processing URL %s\u0027 % foo)\n26:     accepted(foo)\n27:     print(\u0027Wait %f seconds\u0027 % wait)\n28:     time.sleep(wait)\n输出爬取信息到csv文本\n 1: #%% output the problem set to csv files\n 2: root \u003d os.getcwd()\n 3: with open(os.path.join(root,\"CodeForces-ProblemSet.csv\"),\"w\", encoding\u003d\"utf-8\") as f_out:\n 4:     f_csv \u003d csv.writer(f_out)\n 5:     f_csv.writerow([\u0027ID\u0027, \u0027Title\u0027, \u0027Tags\u0027, \u0027Solved\u0027, \u0027Accepted\u0027])\n 6:     for id in codeforces:\n 7:    title \u003d codeforces[id][\u0027title\u0027]\n 8:   tags \u003d \u0027, \u0027.join(codeforces[id][\u0027tags\u0027])\n 9:   solved \u003d codeforces[id][\u0027solved\u0027]\n10:     accepted \u003d codeforces[id][\u0027accepted\u0027]\n11:     f_csv.writerow([id, title, tags, solved, accepted])\n12:     f_out.close()\n分析题目难度以及算法分类的关系\n 1: #%% analyze the problem set\n 2: # initialize the difficult and tag list\n 3: difficult_level \u003d {}\n 4: tags_level \u003d {}\n 5: for id in codeforces:\n 6:     difficult \u003d re.findall(\u0027([A-Z])\u0027, id)[0]\n 7:     tags \u003d codeforces[id][\u0027tags\u0027]\n 8:     difficult_level[difficult] \u003d difficult_level.get(difficult, 0) + 1\n 9:     for tag in tags:\n10:   tags_level[tag] \u003d tags_level.get(tag, 0) + 1\n11: import operator\n12: tag_level \u003d sorted(tags_level.items(), key\u003doperator.itemgetter(1))[::-1]\n13: tag_list \u003d [foo[0] for foo in tag_level]\n14: difficult_level \u003d sorted(difficult_level.items(), key\u003doperator.itemgetter(0))\n15: difficult_list \u003d [foo[0] for foo in difficult_level]\n16: \n17: # initialize the 2D relationships matrix\n18: # matrix_solved: the number of AC submission for each tag in each difficult level\n19: # matrix_freq: the number of tag frequency for each diffiicult level\n20: matrix_solved, matrix_freq \u003d [[[0] * len(difficult_list) for _ in range(len(tag_list))] for _ in range(2)]\n21: \n22: \n23: # construct the 2D relationships matrix\n24: for id in codeforces:\n25:     difficult \u003d re.findall(\u0027([A-Z])\u0027, id)[0]\n26:     difficult_id \u003d difficult_list.index(difficult)\n27:     tags \u003d codeforces[id][\u0027tags\u0027]\n28:     solved \u003d codeforces[id][\u0027solved\u0027]\n29:     for tag in tags:\n30:     tag_id \u003d tag_list.index(tag)\n31:     matrix_solved[tag_id][difficult_id] +\u003d int(solved)\n32:   matrix_freq[tag_id][difficult_id] +\u003d 1\n\n下载本文源代码以及分析结果\n本文源代码以及分析结果请见 我的Github ，或者点击链接下载: https://pan.baidu.com/s/1o7P8oT8 密码: 8dcb。\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        \u003c/div\u003e\n            \u003c/div\u003e\n        \u003c/article\u003e\n1\n2\n3\n","threadId":60218,"likeCnt":7,"createTime":1575160901000,"isWorkbook":false,"viewCnt":5558,"openness":2,"fav":false,"id":2111,"trustable":false}