技能训练
(1)什么是支持向量机?
(2)简述硬间隔支持向量机。
(3)简述非线性支持向量机。
(4)支持向量机(SVM)大数据算法操作实践。
① 作业目的。
旨在让学生了解硬间隔支持向量机、软间隔支持向量机及非线性支持向量机的算法含义及应用场景,了解四类不同核函数,即 Linear Kernel、Polynomial Kernel、Gaussian Kernel和 Sigmoid Kernel 对学习机(Learner)性状的影响,体会其中的异同点,从而加深对 Orange平台中各种支持向量机对分类功能的实现。
② 作业准备。
Orange3 软件下载
并安装。
源数据
包含三个文件,adult-data.txt(训练集)、adult-test.txt(测试集)、adult-attribute.txt(数据来源及属性说明)。


| This data was extracted from the census bureau database found at
| http://www.census.gov/ftp/pub/DES/www/welcome.html
| Split into train-test using MLC++ GenCVFiles (2/3,1/3 random).
| 48842 instances,mix of continuous and discrete (train=32561,test=16281)
| 45222 if instances with unknown values are removed (train=30162,test=15060)
| Duplicate or conflicting instances : 6
| Class probabilities for adult.all file
| Probability for the label '>50K' : 23.93% / 24.78% (without unknowns)
| Probability for the label '<=50K' : 76.07% / 75.22% (without unknowns)
|
| Extraction was done by Barry Becker from the 1994 Census database. A set of| reasonably clean
records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)
&& (HRSWK>0)).
|| Prediction task is to determine whether a person makes over 50K| a year.
|
| C4.5 : 84.46+-0.30
| Naive-Bayes : 83.88+-0.30
| NBTree : 85.90+-0.28
|
|| Following algorithms were later run with the following error rates,all after removal of unknowns
and using the original train/test split. All these numbers are straight runs using MLC++ with default
values.
|
| Algorithm Error
| -- ---------------- -----
| 1 C4.5 15.54
| 2 C4.5-auto 14.46
| 3 C4.5 rules 14.94
| 4 Voted ID3 (0.6) 15.64
| 5 Voted ID3 (0.8) 16.47
| 6 T2 16.84
| 7 1R 19.54
| 8 NBTree 14.10
| 9 CN2 16.00
| 10 HOODG 14.82
| 11 FSS Naive Bayes 14.05
| 12 IDTM (Decision table) 14.46
| 13 Naive-Bayes 16.12
| 14 Nearest-neighbor (1) 21.42
| 15 Nearest-neighbor (3) 20.35
| 16 OC1 15.04
| Description of fnlwgt (final weight):| The weights on the CPS files are controlled to independent estimates of the| civilian noninstitutional population of the US. These are prepared monthly| for us by Population Division here at the Census Bureau. We use 3 sets of| controls.
| These are:
| 1. A single cell estimate of the population 16+ for each state.
| 2. Controls for Hispanic Origin by age and sex.
| 3. Controls by Race,age and sex.
age: continuous.
workclass: Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked.
fnlwgt: continuous.
education: Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool.
education-num: continuous.
marital-status: Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse.
occupation: Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners,Machine-op-inspct,Adm-clerical,Farming-fishing,Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces.
relationship: Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried.
race: White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black.
sex: Female,Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands.
a. 数据源分析。
首先分析数据源的数据出处,属于开源的机器学习数据库网站。
b. 数据属性及数据配置解析。
数据配置:
Split into train-test using MLC++ GenCVFiles (2/3,1/3 random).
48842 instances,mix of continuous and discrete (train=32561,test=16281)
45222 if instances with unknown values are removed (train=30162,test=15060)
Class probabilities for adult.all file
Probability for the label '>50K' :23.93% / 24.78% (without unknowns)
Probability for the label '<=50K' :76.07% / 75.22% (without unknowns)
特征属性(共 15 个):
age:continuous.
workclass:Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,
Without-pay,Never-worked.
fnlwgt:continuous.
education:Bachelors,Some-college,11th,HS-grad,Prof-school,Assoc-acdm,Assoc-voc,9th,7th-8th,12th,Masters,1st-4th,10th,Doctorate,5th-6th,Preschool.
education-num:continuous.
marital-status:Married-civ-spouse,Divorced,Never-married,Separated,Widowed,Married-spouse-absent,Married-AF-spouse.
occupation:Tech-support,Craft-repair,Other-service,Sales,Exec-managerial,Prof-specialty,Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving,Priv-house-serv,Protective-serv,Armed-Forces.
relationship:Wife,Own-child,Husband,Not-in-family,Other-relative,Unmarried.
race:White,Asian-Pac-Islander,Amer-Indian-Eskimo,Other,Black.
sex:Female,Male.
capital-gain:continuous.
capital-loss:continuous.
hours-per-week:continuous.
native-country:United-States,Cambodia,England,Puerto-Rico,Canada,Germany,Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines , Italy , Poland , Jamaica , Vietnam , Mexico , Portugal , Ireland , France ,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&Tobago,Peru,Hong,Holand-Netherlands.
③ 作业内容。
作业包括三个部分:
● 数据整理及转换;
● Orange 平台上机操作;
● 撰写分析报告。
a. 数据整理与转换。
一般来讲,下载数据采用的是 txt 文件,而 txt 是一种纯文本文档,里面不会有任何字体
格式,直观性较差,同时也不便于 Orange 平台操作,因此需要进行转换并预处理。
● 在 Excel 中打开 txt 文件。
要求:创建训练集及测试集两个 Excel 数据集,文件名自定。
● 预处理数据。
要求:加标题表头,通过筛选批量删除含有“?”字符的记录。
b. Orange 平台上机操作。
总要求是对四个核函数分别建立学习器,并比较各学习器的优劣。工作流完整,逻辑清
晰,产出合理。关键内容如下:
▶ 设置四个核函数的学习器;
▶ 训练集及测试集部署合理;
▶ 调整惩罚项及参数设置,调优学习器;
▶ 数据集在线端配属正确,不报错;
▶ 调用可视化模块,对支持向量进行展示。
。
