Google Classroom
GeoGebraGeoGebra Classroom

技能训练

(1)什么是支持向量机?

(2)简述硬间隔支持向量机。

(3)简述非线性支持向量机。

(4)支持向量机(SVM)大数据算法操作实践。 ① 作业目的。 旨在让学生了解硬间隔支持向量机、软间隔支持向量机及非线性支持向量机的算法含义及应用场景,了解四类不同核函数,即 Linear Kernel、Polynomial Kernel、Gaussian Kernel和 Sigmoid Kernel 对学习机(Learner)性状的影响,体会其中的异同点,从而加深对 Orange平台中各种支持向量机对分类功能的实现。 ② 作业准备。 Orange3 软件下载Toolbar Image并安装。 源数据Toolbar Image包含三个文件,adult-data.txt(训练集)、adult-test.txt(测试集)、adult-attribute.txt(数据来源及属性说明)。
| This data was extracted from the census bureau database found at | http://www.census.gov/ftp/pub/DES/www/welcome.html | Split into train-test using MLC++ GenCVFiles (2/3,1/3 random). | 48842 instances,mix of continuous and discrete (train=32561,test=16281) | 45222 if instances with unknown values are removed (train=30162,test=15060) | Duplicate or conflicting instances : 6 | Class probabilities for adult.all file | Probability for the label '>50K' : 23.93% / 24.78% (without unknowns) | Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns) | | Extraction was done by Barry Becker from the 1994 Census database. A set of| reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). || Prediction task is to determine whether a person makes over 50K| a year. | | C4.5 : 84.46+-0.30 | Naive-Bayes : 83.88+-0.30 | NBTree : 85.90+-0.28 | || Following algorithms were later run with the following error rates,all after removal of unknowns and using the original train/test split. All these numbers are straight runs using MLC++ with default values. | | Algorithm Error | -- ---------------- ----- | 1 C4.5 15.54 | 2 C4.5-auto 14.46 | 3 C4.5 rules 14.94 | 4 Voted ID3 (0.6) 15.64 | 5 Voted ID3 (0.8) 16.47 | 6 T2 16.84 | 7 1R 19.54 | 8 NBTree 14.10 | 9 CN2 16.00 | 10 HOODG 14.82 | 11 FSS Naive Bayes 14.05 | 12 IDTM (Decision table) 14.46 | 13 Naive-Bayes 16.12 | 14 Nearest-neighbor (1) 21.42 | 15 Nearest-neighbor (3) 20.35 | 16 OC1 15.04 | Description of fnlwgt (final weight):| The weights on the CPS files are controlled to independent estimates of the| civilian noninstitutional population of the US. These are prepared monthly| for us by Population Division here at the Census Bureau. We use 3 sets of| controls. | These are: | 1. A single cell estimate of the population 16+ for each state. | 2. Controls for Hispanic Origin by age and sex. | 3. Controls by Race,age and sex. age: continuous. workclass: Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked. fnlwgt: continuous. education: Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool. education-num: continuous. marital-status: Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse. occupation: Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners,Machine-op-inspct,Adm-clerical,Farming-fishing,Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces. relationship: Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried. race: White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black. sex: Female,Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands.
a. 数据源分析。 首先分析数据源的数据出处,属于开源的机器学习数据库网站。 b. 数据属性及数据配置解析。 数据配置: Split into train-test using MLC++ GenCVFiles (2/3,1/3 random). 48842 instances,mix of continuous and discrete (train=32561,test=16281) 45222 if instances with unknown values are removed (train=30162,test=15060) Class probabilities for adult.all file Probability for the label '>50K' :23.93% / 24.78% (without unknowns) Probability for the label '<=50K' :76.07% / 75.22% (without unknowns)
特征属性(共 15 个): age:continuous. workclass:Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov, Without-pay,Never-worked. fnlwgt:continuous. education:Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool. education-num:continuous. marital-status:Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse. occupation:Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces. relationship:Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried. race:White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black. sex:Female,Male. capital-gain:continuous. capital-loss:continuous. hours-per-week:continuous. native-country:United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines , Italy , Poland , Jamaica , Vietnam , Mexico , Portugal , Ireland , France ,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands.
③ 作业内容。 作业包括三个部分: ● 数据整理及转换; ● Orange 平台上机操作; ● 撰写分析报告。 a. 数据整理与转换。 一般来讲,下载数据采用的是 txt 文件,而 txt 是一种纯文本文档,里面不会有任何字体 格式,直观性较差,同时也不便于 Orange 平台操作,因此需要进行转换并预处理。 ● 在 Excel 中打开 txt 文件。 要求:创建训练集及测试集两个 Excel 数据集,文件名自定。 ● 预处理数据。 要求:加标题表头,通过筛选批量删除含有“?”字符的记录。 b. Orange 平台上机操作。 总要求是对四个核函数分别建立学习器,并比较各学习器的优劣。工作流完整,逻辑清 晰,产出合理。关键内容如下: ▶ 设置四个核函数的学习器; ▶ 训练集及测试集部署合理; ▶ 调整惩罚项及参数设置,调优学习器; ▶ 数据集在线端配属正确,不报错; ▶ 调用可视化模块,对支持向量进行展示。



c. 撰写数据分析报告。 以上资源下载Toolbar Image