由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - NLP 问题 python
相关主题
python用起来没有matlab好使,尤其是数据处理text mining中的relation extraction
Memory Error in pandas.concat with Python说说最近的一次面试,兼告诫国人
python/excel suggestion/replacement求职求内推
kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?data scientist job openings at Workday
DS需要会的手艺 真不少data scientist position
机器学习需要自己搞算法吗求Google 的 Data Science 有关的位置内推 (转载)
公司招DATA SCIENTIST机器学习日报 2015年3月楼
请推荐一个NLP的data set (转载)机器学习周报 2015-03-15
相关话题的讨论汇总
话题: 3879940话题: 01jan2019话题: parkage话题: order话题: broken
进入DataSciences版参与讨论
1 (共1页)
v*********9
发帖数: 1
1
Python初学者,
现在需要split一个Column, 这个column里面有两部分,
一部分是类似
01jan2019_order_1977663:
或者是
877920_jan_19799"
类似这种pattern, 那regular expression部分则么写比较好?
试了*_*_*" 或者 *_*_*:
都不好用
谢谢!
j****w
发帖数: 11
2
"877920_jan_19799".split("_")
or
re.split("_", "877920_jan_19799")

【在 v*********9 的大作中提到】
: Python初学者,
: 现在需要split一个Column, 这个column里面有两部分,
: 一部分是类似
: 01jan2019_order_1977663:
: 或者是
: 877920_jan_19799"
: 类似这种pattern, 那regular expression部分则么写比较好?
: 试了*_*_*" 或者 *_*_*:
: 都不好用
: 谢谢!

v*********9
发帖数: 1
3
哎呀,问题没说清楚
比如有一列是客户的comment, 但是像是这样的
01JAN2019_order_3879940"I like this product, but the parkage is broken"
already replace an order and sent to customer
01mar2019_SAP_3879940:the parkage is broken, all things are mess
01JAN2019_order_3879940-3778"wrong color, I order golden, but comes yellow"
contacted customer to refund
01JAN2019_order_3879940
01JAN2019_dfegf_3879940"I like this product, but the parkage is broken"
already replace an order and sent to customer
01JAN2019_order_3879940:"I like this product, but the parkage is broken"
already replace an order and sent to customer
01JAN2019_order_3879940_"I like this product, but the parkage is broken"
already replace an order and sent to customer
it feels mold inside
color is not right" contacted with customer
现在就是想除掉不是客户留言的第一部分,试了几个都不好用。
H**********f
发帖数: 2978
4
你这是已经在e-commerce做ds或者da了吧。那劝你认真学下常用字符串相关函数和
regular expression,没多少东西就一天的事,否则以后这种没完没了你还得问。


: 哎呀,问题没说清楚

: 比如有一列是客户的comment, 但是像是这样的

: 01JAN2019_order_3879940"I like this product, but the parkage is broken
"

: already replace an order and sent to customer

: 01mar2019_SAP_3879940:the parkage is broken, all things are mess

: 01JAN2019_order_3879940-3778"wrong color, I order golden, but comes
yellow"

: contacted customer to refund

: 01JAN2019_order_3879940

: 01JAN2019_dfegf_3879940"I like this product, but the parkage is broken
"

: already replace an order and sent to customer



【在 v*********9 的大作中提到】
: 哎呀,问题没说清楚
: 比如有一列是客户的comment, 但是像是这样的
: 01JAN2019_order_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer
: 01mar2019_SAP_3879940:the parkage is broken, all things are mess
: 01JAN2019_order_3879940-3778"wrong color, I order golden, but comes yellow"
: contacted customer to refund
: 01JAN2019_order_3879940
: 01JAN2019_dfegf_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer

j****w
发帖数: 11
5
re.split('[0-9a-zA-Z]+_[a-zA-Z]+_[0-9]+_?', '01JAN2019_order_3879940_I like
this product, but the parkage is broken"')[1]

"

【在 v*********9 的大作中提到】
: 哎呀,问题没说清楚
: 比如有一列是客户的comment, 但是像是这样的
: 01JAN2019_order_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer
: 01mar2019_SAP_3879940:the parkage is broken, all things are mess
: 01JAN2019_order_3879940-3778"wrong color, I order golden, but comes yellow"
: contacted customer to refund
: 01JAN2019_order_3879940
: 01JAN2019_dfegf_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer

m******n
发帖数: 453
6
你这是data cleaning
不是NLP
g*****g
发帖数: 390
7
在学python,练练手哈:
"any_mon" can used as a group (not right now) to catch the time for the
feedback, if needed.
import re
text = '01JAN2019_order_3879940:"I like this product, but the parkage is
broken"'
any_mon = "(?:Jan|Feb|Mar)"
pattern = r'\d+{}\d+_.+_\d+[:"]'.format(any_mon)
res = split(pattern, text, flags=re.I)
if len(res) ==1:
print('No Split')
else:
print(res[1]) # output: "I like this product, but the parkage is broken"
v*********9
发帖数: 1
8
Python初学者,
现在需要split一个Column, 这个column里面有两部分,
一部分是类似
01jan2019_order_1977663:
或者是
877920_jan_19799"
类似这种pattern, 那regular expression部分则么写比较好?
试了*_*_*" 或者 *_*_*:
都不好用
谢谢!
j****w
发帖数: 11
9
"877920_jan_19799".split("_")
or
re.split("_", "877920_jan_19799")

【在 v*********9 的大作中提到】
: Python初学者,
: 现在需要split一个Column, 这个column里面有两部分,
: 一部分是类似
: 01jan2019_order_1977663:
: 或者是
: 877920_jan_19799"
: 类似这种pattern, 那regular expression部分则么写比较好?
: 试了*_*_*" 或者 *_*_*:
: 都不好用
: 谢谢!

v*********9
发帖数: 1
10
哎呀,问题没说清楚
比如有一列是客户的comment, 但是像是这样的
01JAN2019_order_3879940"I like this product, but the parkage is broken"
already replace an order and sent to customer
01mar2019_SAP_3879940:the parkage is broken, all things are mess
01JAN2019_order_3879940-3778"wrong color, I order golden, but comes yellow"
contacted customer to refund
01JAN2019_order_3879940
01JAN2019_dfegf_3879940"I like this product, but the parkage is broken"
already replace an order and sent to customer
01JAN2019_order_3879940:"I like this product, but the parkage is broken"
already replace an order and sent to customer
01JAN2019_order_3879940_"I like this product, but the parkage is broken"
already replace an order and sent to customer
it feels mold inside
color is not right" contacted with customer
现在就是想除掉不是客户留言的第一部分,试了几个都不好用。
相关主题
公司招DATA SCIENTIST说说最近的一次面试,兼告诫国人
请推荐一个NLP的data set (转载)求职求内推
text mining中的relation extractiondata scientist job openings at Workday
进入DataSciences版参与讨论
H**********f
发帖数: 2978
11
你这是已经在e-commerce做ds或者da了吧。那劝你认真学下常用字符串相关函数和
regular expression,没多少东西就一天的事,否则以后这种没完没了你还得问。


: 哎呀,问题没说清楚

: 比如有一列是客户的comment, 但是像是这样的

: 01JAN2019_order_3879940"I like this product, but the parkage is broken
"

: already replace an order and sent to customer

: 01mar2019_SAP_3879940:the parkage is broken, all things are mess

: 01JAN2019_order_3879940-3778"wrong color, I order golden, but comes
yellow"

: contacted customer to refund

: 01JAN2019_order_3879940

: 01JAN2019_dfegf_3879940"I like this product, but the parkage is broken
"

: already replace an order and sent to customer



【在 v*********9 的大作中提到】
: 哎呀,问题没说清楚
: 比如有一列是客户的comment, 但是像是这样的
: 01JAN2019_order_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer
: 01mar2019_SAP_3879940:the parkage is broken, all things are mess
: 01JAN2019_order_3879940-3778"wrong color, I order golden, but comes yellow"
: contacted customer to refund
: 01JAN2019_order_3879940
: 01JAN2019_dfegf_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer

j****w
发帖数: 11
12
re.split('[0-9a-zA-Z]+_[a-zA-Z]+_[0-9]+_?', '01JAN2019_order_3879940_I like
this product, but the parkage is broken"')[1]

"

【在 v*********9 的大作中提到】
: 哎呀,问题没说清楚
: 比如有一列是客户的comment, 但是像是这样的
: 01JAN2019_order_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer
: 01mar2019_SAP_3879940:the parkage is broken, all things are mess
: 01JAN2019_order_3879940-3778"wrong color, I order golden, but comes yellow"
: contacted customer to refund
: 01JAN2019_order_3879940
: 01JAN2019_dfegf_3879940"I like this product, but the parkage is broken"
: already replace an order and sent to customer

m******n
发帖数: 453
13
你这是data cleaning
不是NLP
g*****g
发帖数: 390
14
在学python,练练手哈:
"any_mon" can used as a group (not right now) to catch the time for the
feedback, if needed.
import re
text = '01JAN2019_order_3879940:"I like this product, but the parkage is
broken"'
any_mon = "(?:Jan|Feb|Mar)"
pattern = r'\d+{}\d+_.+_\d+[:"]'.format(any_mon)
res = split(pattern, text, flags=re.I)
if len(res) ==1:
print('No Split')
else:
print(res[1]) # output: "I like this product, but the parkage is broken"
c*****m
发帖数: 1160
15
观察你的例子,得到的结论是:
如果1行里有双引号,就把双引号前的删除;
如果1行里有冒号,就把冒号和它之前的删除;
入宫既没有双引号,也没有冒号,就看有没有 _ 号;如果有,就把整行删除。
这就是三句 python语句,就能清理你刚才那些例子了。
1 (共1页)
进入DataSciences版参与讨论
相关主题
求问一道关于NLP的面试题DS需要会的手艺 真不少
san bruno ds position机器学习需要自己搞算法吗
工作机会 data scientist@experian datalab, San Diego (转载)公司招DATA SCIENTIST
几个Data Scientist/NLP/Robotics/Visual Computing相关职位请推荐一个NLP的data set (转载)
python用起来没有matlab好使,尤其是数据处理text mining中的relation extraction
Memory Error in pandas.concat with Python说说最近的一次面试,兼告诫国人
python/excel suggestion/replacement求职求内推
kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?data scientist job openings at Workday
相关话题的讨论汇总
话题: 3879940话题: 01jan2019话题: parkage话题: order话题: broken