|
The Hacker Factor BlogTools, Techniques, and Tangents |
Home Blog |
Hannah Montana is DeadTuesday, March 30. 2010
I recently came across a new search engine called Spokeo. This is a data mining service for finding people. Although it is similar to an online phone book or people-search engines like Intelius, Veromi, and Zaba Search, Spokeo has a twist... it incorporates data from sources like social networks and mailing lists. Because of the data they harvest, the results from Spokeo are both inaccurate and hysterically funny.
I mean, seriously, most mailing lists have incorrect information about me. It isn't that I gave them wrong data. They just made up stuff. Most likely, it was some contractor who got paid by the amount of data and they just decided to enter in garbage. The inaccurate information from mailing lists seems to grossly outweigh the amount of accurate data. And I'm not alone... It seems like everyone I know receives letters for people who don't live at their addresses, letters addressed to the right person but with a misspelled name, or age inappropriate offerings. For example, I keep getting letters addressed to my mother, asking her to enroll at the local junior college -- but she's never lived at my house and already has a college degree! A few examples of things that I found on Spokeo:
Good thing Spokeo includes a disclaimer: Profile data is derived from marketing surveys, consumer records, and public data sources and is not guaranteed to be 100% accurate. That's an understatement. I've looked up dozens of people and all of them had mostly inaccurate information. (Anything accurate seems coincidental.) UnsubscribingThe nicest thing about Spokeo (besides the humor factor) is their unsubscribe policy. Getting your personal information removed is fast and painless. Click on the privacy link at the bottom, paste the URL to your bad entry, and give an email address. (I suggest using a throwaway email account.) They will email you a hyperlink that will remove the entry. There is actually no validation here. Anyone can remove anyone. But after two removals, you will need to choose a new throwaway email address. Ashes to AshesThe rush to gather and data mine these massive databases has led to some interesting problems. For example, Wired recently featured a story about Google. Google has a subsystem that intelligently identifies word correlations. For example, it managed to correlate that "puppies are dogs" and "boiling water is hot". But it also determined that a "hot dog" was a boiled puppy. I can't wait until other public records are combined into massive, intelligent databases. For example, the name "Hannah Montana" is actually a registered trademark. Trademark 3,735,365 is owned by Disney Enterprises. And while it was filed in 2006, it was awarded this year. According to this trademark, "The name 'Hannah Montana' in the mark does not identify a living individual." So while you used to be able to see Hannah Montana Live in concert, you can't anymore. The trademark says that Hannah Montana is not a living individual. No wonder Miley Cyrus is leaving Hannah Montana -- Hannah Montana is dead. Then again, Spokeo says that there are 465 people named Hannah Montana in the United States. And all of them have the exact same photos on file. Waiting for GodotTuesday, March 23. 2010
Whenever people hear that I analyze pictures, they always ask the same thing: Do you analyze videos?
While I do a little bit with videos (there's no real way to get away from it), I mainly focus on static pictures. As critical a I am toward image formats, video formats are on an entirely different (and chaotic) level. Back to BasicsIf you want to support image formats, how many different formats do you need to support? Keep in mind, there are literally hundreds of proprietary and lesser-used formats. However, there are only a handful of widely used image formats. If you support these, then you will support more than 90% of the images you will come across on the Internet: GIF, PNG, and JPEG. Other formats are used in certain fields, but are rare outside of those fields. For example, BMP is common on older Windows systems, but you are unlikely to run into them on Linux or Unix systems. The bitmap formats (PPM, PGM, PBM, and XPM) are used in certain niches where people fear lossy formats (even though PNG is lossless and compressed). And in certain cases, there is a need to edit image data -- raw bitmaps can be better than PNG. Matryoshka ImagesBeyond the basic image formats are container formats. There are really three types of containers: mono-format, multi-format, and any-format. (My terms for them.) The mono-format containers can hold many images, but all internal images are the same format. For example, the ICO format contains only BMPs. Windows Thumbs.db files only contain JPEGs (but depending on the version of Windows, they are either standard or non-standard JPEGs) and animated GIFs are really just a file containing a series of GIF images. The multi-format containers hold many images, but permit a limited number of different formats. For example, the JPEG format (the JFIF and EXIF containers) can hold bitmaps, JPEG, and lossless JPEG images. TIFF can contain bitmaps or JPEGs. And PDFs can contain bitmaps, JPEG, and PNG images as well as vector graphics. While each of these formats permit extensions to store other image formats, you are very unlikely to come across a JPEG with an embedded GIF or a PDF that contains a TIFF. Finally, there are a few any-format containers. These are file formats that can hold anything. Like Word and PPT documents -- anything can be stored in them. Frankly, if you can support a dozen image and container formats, then you are likely to support that vast majority of image formats that you are ever going to run across. Moving Pictures... Moving TargetsSo supporting GIF, PNG, JPEG, BMP, TIFF, and a handful of other formats is enough to easily support more than 90% of the image formats you will come across. What about videos? Video formats are a nightmare. There is a huge variety of formats and subformats that can be mixed and matched. What do you need to support in order to claim that 90% compatibility mark? MP4, MPG, AVI, FLV, WMV, RealMedia, DIVX... Oh, and don't forget DVDs! And even then, you might not be at 90%. Each of the video formats I listed are actually container formats. There are a huge number of audio and video formats that can exist inside these containers. The MPlayer video player (MPlayer-1.0rc2) supports at least 279 different video codec variations and 131 different audio formats! And that's only counting the ones that are "working" -- there are dozens more that are under development. Simply trying to support "MPG" means you still need to support dozens of different audio and video codecs. Whereas, if you support PNG then you only need to support one format. Most open source applications use MPlayer or FFmpeg as the basis for video format support (and MPlayer uses FFmpeg). This will easily get you over that 90% support mark. However, I've found the open source code to be poorly documented and very complicated to follow. If you happen to hit some of the poorly supported formats, then your program is likely to crash and even if you are a hard-core developer, you are unlikely to be able to trace the code and identify or fix problems. More importantly, there isn't a great way to tell these applications to not support some file formats. Commercial applications actually have more control here. While QuickTime and Windows Media Player don't support every possible combination that MPlayer/FFmpeg support, they provide stronger support options. (For example, you can rewind and fast forward with QuickTime and Media Player, but seeking with FFmpeg is very inexact.) Basically, there are so many video formats and codec combinations, that you really have to make a choice: do you want to use open source and kind of support everything, or do you want to go closed source and support fewer formats but do them really well? Better than BetterFor people old enough, you know that records were replaced by CDs virtually overnight. While a few people still claim that records are better, the fact is, CDs are better in almost every way. They can store more audio, last longer, and offer good-enough sound (better than records when you take pops and scratches into account). Switching from records to CDs was a no-brainer. With image formats, we still have a handful of competing formats but the battle for dominance is pretty much settled. Proprietary formats exist because they are proprietary, but few people are developing new formats for the purpose of being "better". Unfortunately, the same cannot be said for video formats. There is no strongly dominant format and newer (but not better) formats keep coming out. Right now we're stuck with a hodgepodge of audio, video, and container formats.
Posted by Dr. Neal Krawetz
in Forensics, Image Analysis, Programming
at
20:12
| Comments (0)
| Permlink
Con CensusSunday, March 21. 2010
I received my 2010 Census form last week. I was lucky, I got the short form. But there are so many things about the 2010 Census that bothers me... is the census even needed anymore?
A Better LifeAccording to the flood of TV and radio commercials, the census is needed to help improve our way of life. One of the examples claims that the census ensures that schools have enough teachers. Huh? The census is conducted every 10 years. Kids who were born in 2001 are already 9 years old and have been in school for over 4 years. The census doesn't tell schools how many students they will have. Instead, the number of students is known because households pay taxes to their school districts (number of potential families), hospitals track birth records (how many new students), real-estate sales track the number of incoming and outgoing households, and most importantly: school districts know that if they have x students this year, then they will likely have x students next year. There is always a little fluctuation, but it isn't in the hundreds of students between years. The census may tell congress how to allocate funds for schools, but it isn't the only method. Congress knows where the money should be spent because they get annual numbers from the individual states. Using the census to identify teacher shortages? That sounds bogus to me. Taking the High RoadAnother commercial says that the census will help cities determine which roads to fix. Again: the census is taken every 10 years. In less than 10 years, unfixed potholes can consume cars. And city planners already know where the traffic problems are. For example, when my city installed a traffic light for my neighborhood, they didn't wait 10 years. Instead, the city measured the traffic (those rubber hoses that go across the street). They looked at the traffic volume and installed a light -- less than two years. Saying that the census helps cities fix roads is bogus. UnequalThe census is required by law. However, laws are supposed to be applied equally. With the census, most people get the short form but a few get the long form. You are legally required to complete whatever form you receive. While I can certainly understand and agree with the use of a statistical sample for more detailed information, this isn't applying the law equally. If it were equal, then everyone would receive the same form. I also have to wonder why my form asked for (1) my name, (2) my age, and (3) am I Hispanic? Is there some particular reason why Hispanics are called out in the census and other ethnic backgrounds are not? Almost PrivateThe 2010 Census says that the information provided "is protected by law". But what does that really mean? If you assumed that the information will be kept private, then you are grossly mistaken. The census will likely release a summary of names and potentially identifiable metrics within a year. (If your parents gave you a unique name, then you have no privacy.) The full details of the information provided today will become public record in 72 years. All In The FamilySo ignoring all of the issues about inequality, bogus claims of relevancy, and untrue privacy claims... what does the census provide? If you are into genealogy then the census is a goldmine. It is one of the few sets of records that document families in the United States. Today, there are many records that track families, but few are official, government, public records. And even fewer are all located in one convenient location. However, there are some serious limitations. For example, many marriages and cohabitation relationships last less than 10 years. Those will be completely missed by the census. Better resources for tracking people are available than any snapshot that the census provides. Today there are so many different documents tracking people that data mining the records is much more valuable than the census records. As a valuable resource, I have my serious doubts about today's census. I mean, seriously, what value does it provide? As I previously mentioned, the census is slow, expensive, and inaccurate. While it was a great idea 100 years ago, today it just seems to be a waste of taxpayer money. Anti Social NetworkingThursday, March 18. 2010
One of my coworkers attended a productivity presentation a few months ago. This person came back fully convinced that Facebook, Yahoo!Mail, and other social networking sites were primary causes of procrastination. At this person's request, I created a bunch of firewall rules. The rules blocked access to these social sites during business hours. Access is granted outside work hours, from 4:00pm to 8:00am. I also permitted access during lunch (11:30 - 1:00) and on weekends. But access is blocked all other times.
My coworker has been thrilled with the results. By blocking access to social networking sites during office hours, my addicted coworker is forced to focus on the task at hand. Working SmarterI have many different filtering rules in place. For example, the local DNS server intercepts requests for domains associated with pop-up marketing sites and malware. My router blocks other sites that are frequently used for banner ads. The result is that web pages load significantly faster if you don't have to wait for ads. Another great time saver is the NoScript plugin for Firefox. Most flash, javascript, and ads on sites are not needed. If the site requires it, then you can always add that specific site to the whitelist. It takes seconds to permit sites and by not adding in things like Google Analytics, banner ads, and quick-links for posting to Digg, Facebook, and ReddIt, most sites load almost immediately. As a side-effect of NoScript, phishing sites are no longer an issue. Your bank should be in your NoScript white-list, but phishing sites are not. One of my associates actually remarked that NoScript saved them from compromising their bank account. "Why doesn't my bank's page look right? Why is NoScript blocking my bank? Oh! It isn't the correct URL!" I can only wonder -- how many malware sites have been blocked over the years because NoScript wouldn't load that portion of the page? Time's Up!Unfortunately, this week I've received a number of complaints about the router's configuration. "It's 11:45 and I can't get to my Facebook page!" (The sign of a true addict. Remember: my coworker asked for the router block; I didn't impose it without permission.) The problem turned out to be related to the router itself. You see, the date when Daylight Savings Time occurs changed in 2007. Unfortunately the router, a D-Link DI-604, has no means for updating when DST occurs. The clock was off by an hour. Making matters worse, the DI-604 is no longer supported by D-Link; they dropped support in 2008, before fixing the timezone information. We didn't notice this earlier because we had not used time-sensitive router rules before. My solution? I changed the timezone on the router (was MST, now CST). Now, I just need to remember to set the router's timezone whenever DST rolls around. That's going to be easier than waiting for my coworker to overcome a serious Facebook addiction. Thumbs UpFriday, March 12. 2010
Images hide everwhere. I have some MP3 files that each contain a 200K embedded JPEG for the album cover. (Every song from that album contains the same 200K embedded JPEG.) I've extracted pictures from DOC and PPT files. And data carving image files from network streams is always exciting.
From a forensic viewpoint, images -- especially the ones that users usually forget -- are really valuable finds. I mean, people know that JPEGs contain meta data that identifies the camera, time, and even GPS location. But how many people remember that the meta data is retained when the JPEG is imported into PowerPoint? The more obscure the file format, the more likely it is that people (both investigators and suspects) will forget to look at it. The Thumbs.db file is a good example of this. Windows creates Thumbs.db files to store directory thumbnail/preview images. The catch is that the Thumbs.db contents are always appended and never deleted. Even if you delete the files from a directory, their thumbnail images will live on. The Thumbs.db files don't just contain images, they also contain timestamps and filenames! You might have deleted your porn collection, but if you did not remove the Thumbs.db, then your wife will still be able to see that you downloaded "3_Midgets.jpg" on 24-Feb-2010 19:21:22. Adding to the complexity, every directory can have its own Thumbs.db file. If you really want to keep a file a secret, don't copy it to a bunch of different directories. Parsing ProblemsThe Thumbs.db files would probably be more widely evaluated if they were in some type of easy-to-parse format. However, this is a Microsoft format, so it is totally weird. The actual format is laid out more like a hard drive than a data file. The Thumbs.db contains a FAT catalog, sectors, and clusters. Images are stored in the clusters, and extended meta data is arbitrarily stuffed in unused space. (Seriously -- it is a weird format.) Assuming you can parse the file, there are still problems with the image format. For example, Windows 98 and 2000 uses a non-standard JPEG format. The colors are RGBA instead of the standard YUV. (This confuses libraries like FreeImage since a 4-color JPEG is assumed to be CMYK and not RGB with alpha channel for transparency.) Windows XP, Me, and 2003 all use a standard JPEG for the thumbnail image. Newer is BetterWith Vista and Windows 7, Microsoft completely revamped the thumbnail system. The new system uses a centralized database for storing all thumbnails (no more Thumbs.db files all over the place). And the files are laid out more like a flat file containing data records than a disk partition. That's the good news. The bad news is, the full file format proprietary and there are lots of data fields that are unknown. Fortunately, the Various Oddities blog has done a great job reverse-engineering the format. However, I've found that some of the fields are a little different than their blog describes. Here's my revisions to his reverse engineering. Index File FormatFirst, the thumbnails are stored across a couple of different files. The thumbnail_idx.db file is the master index. It assigns a unique ID to each file and gives file offsets to the various image cache files. The data structures look like this: (All values are stored in Little Endian.) typedef struct {Image Cache FilesThe remaining thumbnail_*.db files store various images. For example, thumbnail_256.db stores all 256x256 thumbnail images. The thumbnail_1024.db seems to be the only odd file -- the images are 1024x768 instead of 1024x1024. These files begin with a short header and then contain variable length data records. (But don't worry -- they are not too complicated!) typedef struct {Each CMMM record consists of the CMMM structure followed by the variable-length data. First comes the name (in multibyte format). The name is nameLen bytes long. In some cases, it looks like a long hex string encoded in a multibyte format, but in other cases it contains a filename and/or path. The name is followed by optional padding (paddingSize bytes). Usually there is no padding, but sometimes there are 1 or 2 bytes. I still haven't figured out why there is sometimes padding, but the "why" isn't needed for parsing -- the paddingSize tells you how much padding there is. Finally, there is the data. The data size needs to be computed (sizeHeaderAndData - sizeof(CMMM)). If there is no image, then the size is zero. But if there is data, then it is an image. The image can be a JPEG, PNG, or Bitmap. (Other formats may be possible, but this is all I have seen so far.) Almost There...The part that bothers me are the 24 bytes of unknown data in the CMMM (unk[6]). While a few of the uint32_t fields seem to always be empty, a few definitely contain some kind of data. If anyone has a clue what they contain, please let me know! I wouldn't be surprised if some of the unk fields identified the inode for the actual file on the disk. There are also a few unknown fields in the IMMM header, but right now I'm not as concerned about those. If you happen to know what any of these unknown fields contain -- or if I got anything wrong -- let me know and I'll be happy to update this entry with the corrected information.
Posted by Dr. Neal Krawetz
in Forensics, Image Analysis, Programming
at
21:00
| Comments (2)
| Permlink
(Page 1 of 2, totaling 6 entries)
» next page
|
SearchCalendarArchivesCategoriesPopular PostsLinksSecurity
Internet Storm Center Security Focus CyberSpeak Happy as a Monkey Cybercrime Images Photoshop Disasters Food In Real Life Worth1000 CG Society Awkward Family Photos Media Stinky Journalism Unnecessary "Quotes" Oh No They Didn't Obama Conspiracies Barackryphal Blogs Fergie's Tech Blog Xenon's Isotopia James Carrion Mark Shuttleworth |
