$ cat arquivo1.txt
6|1000|121|999
1|1000|2000|3001
2|1000|2000|3001
3|2000|11|11
4| 100|22|1
5|1000|2000|4000
1000|10|11|12

$ cat arquivo2.txt
5
1000
7

$ cat arquivo3.txt
20

I want to output all lines from arquivo1.txt that the second field (arquivo1.txt) is not in arquivo2.txt and the substring(first 2 chars) of the second field (arquivo1.txt) not in arquivo3.txt.

我想从arquivo1中输出所有的行。第二个字段(arquivo1.txt)不在arquivo2中。txt和第二个字段的子字符串(前两个字符)(arquivo1.txt)不在arquivo3.txt中。

In this example, the output would be:

在本例中,输出将是:

4| 100|22|1
1000|10|11|12

So, I did the filter of arquivo2.txt:

所以,我做了一个arquivo2.txt的过滤器:

$ awk -F'|' 'FNR==NR { a[$1]; next } !($2 in a)' arquivo2.txt arquivo1.txt

And I did the filter of arquivo3.txt:

我过滤了arquivo3。txt

$ awk -F'|' 'FNR==NR { a[$1]; next } !(substr($2,1,2) in a)' arquivo3.txt arquivo1.txt

Is it possible to have these commands together in one line of code?

是否可能将这些命令放在一行代码中?

All I need is performance, because these files are big (arquivo1.txt have 1 million lines and arquivo2.txt and arquivo3.txt have 200k lines each), is this the best approach to achieve the best response time?

我需要的是性能,因为这些文件很大(arquivo1)。txt有100万行和arquivo2。txt和arquivo3。txt各有200k行),这是获得最佳响应时间的最佳方法吗?

3 个解决方案

#1


1

I have a kind of solution but it is for gawk (awk solution at the end of this post). Maybe it is usable.

我有一个解决方案,但它是给gawk (awk解决方案在这篇文章的结尾)。也许它是可用的。

To use a hash is a good idea to make the searching in constant time.

使用散列是在常数时间内进行搜索的好方法。

awk -F\| '
  ARGIND == 1 {a[$1]=1;next}
  ARGIND == 2 {b[$1]=1;next}
  !($2 in a) && !(substr($2,1,2) in  b)
' arquivo2.txt arquivo3.txt arquivo1.txt

Output:

输出:

4| 100|22|1
1000|10|11|12

I did some measurements. I generated the 3 files with the following awk script:

我做了一些测量。我用以下awk脚本生成了3个文件:

time awk ' BEGIN {
  for(i=0;i<1000000;++i) print i"|"i"|1000|123">"arquivo1.txt"
  for(i=0;i<200000;++i) print (i*10)>"arquivo2.txt"
  for(i=0;i<200000;++i) print (i*10+5)>"arquivo3.txt"
}' || exit 1

Then I measured the time needed to run the second script adding time before awk and I redirected the output to /dev/null not to measure the screening. Here is the result of three independent runs:

然后,我测量了运行第二个脚本所需的时间,并在awk之前将输出重定向到/dev/null,以不度量筛选。以下是三个独立运行的结果:

$./test.sh
real    0m2.880s
user    0m2.816s
sys     0m0.044s
$./test.sh
real    0m2.931s
user    0m2.892s
sys     0m0.032s
$./test.sh
real    0m2.924s
user    0m2.864s
sys     0m0.040s

(The creation of tables finished in 1.5 sec). For 1 million rows for the input table, and 2x200_000 rows for the filter tables finishes in 3 sec and it prints 809_999 lines (at least so many times both conditions are evaluated).

(在1.5秒内完成表格的创建)。对于输入表的100万行,以及过滤器表的2x200_000行,将在3秒内完成,并打印809_999行(至少要对这两个条件求多次)。

Is something You expected, or it is still to much for runtime? My machine is a little bit old laptop with Pentium(R) Dual-Core CPU T4300 @ 2.10GHz CPU.

是你期望的,还是运行时的期望?我的机器是一个有点旧的笔记本电脑与奔腾(R)双核CPU T4300 @ 2.10GHz CPU。

ADDED

添加

Here is a little bit faster and real awk solution:

这里有一个更快更真实的awk解决方案:

awk -F\| '
BEGIN {
  while((getline<"arquivo2.txt")>0) a[$0];
  while((getline<"arquivo3.txt")>0) b[$0];
}
!($2 in a) && !(substr($2,1,2) in  b)
' arquivo1.txt

For the big test files the run time is:

对于大型测试文件,运行时间为:

real    0m2.544s
user    0m2.452s
sys     0m0.048s

real    0m2.458s
user    0m2.420s
sys     0m0.032s

real    0m2.493s
user    0m2.448s
sys     0m0.036s

So this runs in 2.5 sec.

这是2。5秒。

I hope this helps a bit!

我希望这能有所帮助!

更多相关文章

  1. Linux系统下 使用Lsof恢复误删除的文件
  2. C语言文件I/O 读取一个文件并输出出来 和 输出到另一个文件里面
  3. vs code远程编辑文件
  4. 从QQ浏览器缓存文件中提取出完整的视频
  5. Linux程序设计——文件操作(标准I/O库)
  6. Linux之profile、bash_profile、bashrc文件
  7. Linux服务器权限管理实践——添加用户只访问某些文件目录
  8. 如何卸载内核代码中的文件系统
  9. 如何在init.d上重启Jar文件?

随机推荐

  1. 详解通过XmlDocument读写Xml文档的示例代
  2. 详细介绍使用XmlWriter写Xml的示例代码
  3. XML—尝试对一个XML文档进行增删查改编程
  4. Javascript 调用XML制作连动下拉框代码实
  5. Android color(颜色) 在XML文件和java代
  6. 详解Android XML文件使用的示例代码
  7. 使用xmlhttp为网站增加域名查询功能详细
  8. Xml格式数据的生成和解析的代码详情
  9. 详细介绍JavaScript解析 JSON 及 XML的示
  10. Web设计中如何使用XML数据源对象详细介绍